DOE Joint Genome Institute

  • COVID-19
  • About
  • Phones
  • Contacts
  • Our Science
    • DOE Mission Areas
    • Bioenergy Research Centers
    • Science Programs
    • Products
    • Science Highlights
    • Scientists
    Screencap of green algae video for PNAS paper
    Green Algae Reveal One mRNA Encodes Many Proteins
    A team of researchers has found numerous examples of polycistronic expression – in which two or more genes are encoded on a single molecule of mRNA – in two species of green algae.

    Read more

    Advances in Rapidly Engineering Non-model Bacteria
    CRAGE is a technique for chassis (or strain)-independent recombinase-assisted genome engineering, allowing scientists to conduct genome-wide screens and explore biosynthetic pathways. Now, CRAGE is being applied to other synthetic biology problems.

    Read more

    Maize can produce a cocktail of antibiotics with a handful of enzymes. (Sam Fentress, CC BY-SA 2.0)
    How Maize Makes An Antibiotic Cocktail
    Zealexins are produced in every corn variety and protect maize by fending off fungal and microbial infections using surprisingly few enzymes.

    More

  • Our Projects
    • Search JGI Projects
    • DOE Metrics/Statistics
    • Approved User Proposals
    • Legacy Projects
    Poplar (Populus trichocarpa and P. deltoides) grow in the Advanced Plant Phenotyping Laboratory (APPL) at Oak Ridge National Laboratory in Tennessee. Poplar is an important biofuel feedstock, and Populus trichocarpa is the first tree species to have its genome sequenced — a feat accomplished by JGI. (Image courtesy of Oak Ridge National Laboratory, U.S. Dept. of Energy)
    Podcast: Xiaohan Yang on A Plantiful Future
    Building off plant genomics collaborations between the JGI and Oak Ridge National Laboratory, Xiaohan Yang envisions customizing plants for the benefit of human society.

    More:

    Expansin complex with cell wall in background. (Courtesy of Daniel Cosgrove)
    Synthesizing Microbial Expansins with Unusual Activities
    Expansin proteins from diverse microbes have potential uses in deconstructing lignocellulosic biomass for conversion to renewable biofuels, nanocellulosic fibers, and commodity biochemicals.

    Read more

    High oleic pennycress. (Courtesy of Ratan Chopra)
    Pennycress – A Solution for Global Food Security, Renewable Energy and Ecosystem Benefits
    Pennycress (Thlaspi arvense) is under development as a winter annual oilseed bioenergy crop. It could produce up to 3 billion gallons of seed oil annually while reducing soil erosion and fertilizer runoff.

    Read more

  • Data & Tools
    • IMG
    • Genome Portal
    • MycoCosm
    • PhycoCosm
    • Phytozome
    • GOLD
    Artistic interpretation of CheckV assessing virus genome sequences from environmental samples. (Rendered by Zosia Rostomian​, Berkeley Lab)
    An Automated Tool for Assessing Virus Data Quality
    CheckV can be broadly utilized by the research community to gauge virus data quality and will help researchers to follow best practices and guidelines for providing the minimum amount of information for an uncultivated virus genome.

    More

    Unicellular algae in the Chlorella genus, magnified 1300x. (Andrei Savitsky)
    A One-Stop Shop for Analyzing Algal Genomes
    The PhycoCosm data portal is an interactive browser that allows algal scientists and enthusiasts to look deep into more than 100 algal genomes, compare them, and visualize supporting experimental data.

    More

    Artistic interpretation of how microbial genome sequences from the GEM catalog can help fill in gaps of knowledge about the microbes that play key roles in the Earth's microbiomes. (Rendered by Zosia Rostomian​, Berkeley Lab)
    Podcast: A Primer on Genome Mining
    In Natural Prodcast: the basics of genome mining, and how JGI researchers conducted it in IMG/ABC on thousands of metagenome-derived genomes for a Nature Biotechnology paper.

    Read more

  • User Programs
    • Calls for User Proposals
    • Special Initiatives & Programs
    • User Support
    • Submit a Proposal
    Image of Octopus Springs for the CSP annual call
    Letters of Intent are due April 12, 2021 for the annual Community Science Program (CSP) call focused on large-scale genomic science projects that address specific areas of special emphasis and exploit the diversity of JGI capabilities.

    Read more

    SIP engagement webinar
    “SIP technologies at EMSL and JGI” Webinar
    The concerted stable isotope-related tools and resources of the JGI and the Environmental Molecular Sciences Laboratory (EMSL) may be requested by applying for the annual “Facilities Integrating Collaborations for User Science” (FICUS) call.

    Read more

    martin-adams-unsplash
    CSP Functional Genomics Call Ongoing
    The CSP Functional Genomics call helps users translate genomic information into biological function. Proposals submitted by July 31, 2021 will be part of the next review.

    Read more

  • News & Publications
    • News
    • Blog
    • Podcasts
    • Webinars
    • Publications
    • Newsletter
    • Logos and Templates
    • Photos
    Aerial photo of the switchgrass diversity panel late in the 2020 season at the Kellogg Biological Station in Michigan. (Robert Goodwin)
    A Team Effort Toward Targeted Crop Improvements
    A multi-institutional team has produced a high-quality reference sequence of the complex switchgrass genome. Building off this work, researchers at three DOE Bioenergy Research Centers have expanded the network of common gardens and are exploring improvements to switchgrass.

    More

    Artistic interpretation of how microbial genome sequences from the GEM catalog can help fill in gaps of knowledge about the microbes that play key roles in the Earth's microbiomes. (Rendered by Zosia Rostomian​, Berkeley Lab)
    Uncovering Novel Genomes from Earth’s Microbiomes
    A public repository of 52,515 microbial draft genomes generated from environmental samples around the world, expanding the known diversity of bacteria and archaea by 44%, is now available .

    More

    Green millet (Setaria viridis) plant collected in the wild. (Courtesy of the Kellogg lab)
    Shattering Expectations: Novel Seed Dispersal Gene Found in Green Millet
    In Nature Biotechnology, a very high quality reference Setaria viridis genome was sequenced, and for the first time in wild populations, a gene related to seed dispersal was identified.

    More

Data & Tools
Home › Data & Tools › BBTools › BBTools User Guide › CalcUniqueness Guide

CalcUniqueness Guide

CalcUniqueness is designed to plot the fraction of unique reads produced by a sequencing run, as a function of the number of reads sequence. In other words, the output is similar to a rarefaction curve. It can help determine library complexity and whether additional sequencing might be useful. The way it determines whether a read has already been seen is probabilistic, by storing kmers from fixed locations (e.g., the first kmer in the read); if a kmer has already been seen, it is assumed that the read has already been seen. It also tracks pair uniqueness, using a hashcode generated from one kmer in read 1 and another in read 2.

*Notes*

Memory:

CalcUniqueness grabs all available memory, even though normally it doesn’t really need it. It needs approximately 50 bytes per unique read.

Legacy Aspects:

CalcUniqueness was designed to replace an existing, inefficient pipeline. And it was designed to provide output matching that old pipeline, which I did not design. As a result, some of the features do not make a lot of sense, such as using K=20 (which is too short) and the “random kmer” columns (which are of questionable utility; I ignore them).

Data Quality:

Kmer matches must be exact. As a result, low quality data will give artificially high uniqueness estimates. For the same reason, this program cannot be used on raw PacBio data. Interestingly, you can see where on the flowcell the sequencing machine has quality issues by looking at the graphs from this program; they show up as spikes.

Histogram Output:

There are 3 columns for single reads, 6 columns for paired:
count number of reads or pairs processed
r1_first percent unique 1st kmer of read 1
r1_rand percent unique random kmer of read 1
r2_first percent unique 1st kmer of read 2
r2_rand percent unique random kmer of read 2
pair percent unique concatenated kmer from read 1 and 2

One line is printed every X reads (default is 25000) showing the percent of reads that were unique in the last interval. “cumulative=t” will still print once per interval, but will print the number of reads that were unique overall (which is generally a higher number, and not useful in most cases).

*Usage Examples*

To generate a uniqueness plot:

bbcountunique.sh in=reads out=histogram.txt

  • BBTools User Guide
    • Installation Guide
    • Usage Guide
    • Data Preprocessing
    • Add Adapters Guide
    • BBDuk Guide
    • BBMap Guide
    • BBMask Guide
    • BBMerge Guide
    • BBNorm Guide
    • CalcUniqueness Guide
    • Clumpify Guide
    • Dedupe Guide
    • Reformat Guide
    • Repair Guide
    • Seal Guide
    • Split Nextera Guide
    • Statistics Guide
    • Tadpole Guide
    • Taxonomy Guide
  • BBTools FAQ and Support Forums

More topics:

  • COVID-19 Status
  • News
  • Science Highlights
  • Blog
  • Podcasts
  • Webinars
  • CSP Plans
  • Featured Profiles
  • Careers
  • Contact Us
  • Events
  • User Meeting
  • MGM Workshops
  • Internal
  • Disclaimer
  • Credits
  • Emergency Info
  • Accessibility / Section 508 Statement
  • RSS feed
  • Flickr
  • LinkedIn
  • Twitter
  • YouTube
Lawrence Berkeley National Lab Biosciences Area
A project of the US Department of Energy, Office of Science

JGI is a DOE Office of Science User Facility managed by Lawrence Berkeley National Laboratory

© 1997-2021 The Regents of the University of California