DOE Joint Genome Institute

  • COVID-19
  • About Us
  • Contact Us
  • Our Science
    • DOE Mission Areas
    • Bioenergy Research Centers
    • Science Programs
    • Science Highlights
    • Scientists
    (PXFuel)
    Designer DNA: JGI Helps Users Blaze New Biosynthetic Pathways
    In a special issue of the journal Synthetic Biology, JGI scientific users share how they’ve worked with the JGI DNA Synthesis Science Program and what they’ve discovered through their collaborations.

    More

    A genetic element that generates targeted mutations, called diversity-generating retroelements (DGRs), are found in viruses, as well as bacteria and archaea. Most DGRs found in viruses appear to be in their tail fibers. These tail fibers – signified in the cartoon by the blue virus’ downward pointing ‘arms’— allow the virus to attach to one cell type (red), but not the other (purple). DGRs mutate these ‘arms,’ giving the virus opportunities to switch to different prey, like the purple cell. (Courtesy of Blair Paul)
    A Natural Mechanism Can Turbocharge Viral Evolution
    A team has discovered that diversity generating retroelements (DGRs) are not only widespread, but also surprisingly active. In viruses, DGRs appear to generate diversity quickly, allowing these viruses to target new microbial prey.

    More

    Algae growing in a bioreactor. (Dennis Schroeder, NREL)
    Refining the Process of Identifying Algae Biotechnology Candidates
    Researchers combined expertise at the National Labs to screen, characterize, sequence and then analyze the genomes and multi-omics datasets for algae that can be used for large-scale production of biofuels and bioproducts.

    More

  • Our Projects
    • Search JGI Projects
    • DOE Metrics/Statistics
    • Approved User Proposals
    • Legacy Projects
    This data image shows the monthly average sea surface temperature for May 2015. Between 2013 and 2016, a large mass of unusually warm ocean water--nicknamed the blob--dominated the North Pacific, indicated here by red, pink, and yellow colors signifying temperatures as much as three degrees Celsius (five degrees Fahrenheit) higher than average. Data are from the NASA Multi-scale Ultra-high Resolution Sea Surface Temperature (MUR SST) Analysis product. (Courtesy NASA Physical Oceanography Distributed Active Archive Center)
    When “The Blob” Made It Hotter Under the Water
    Researchers tracked the impact of a large-scale heatwave event in the ocean known as “The Blob” as part of an approved proposal through the Community Science Program.

    More

    A plantation of poplar trees. (David Gilbert)
    Genome Insider podcast: THE Bioenergy Tree
    The US Department of Energy’s favorite tree is poplar. In this episode, hear from ORNL scientists who have uncovered remarkable genetic secrets that bring us closer to making poplar an economical and sustainable source of energy and materials.

    More

    Ian Rambo, graduate student at UT-Austin, was a DOE Graduate Student Research Fellow at the JGI
    Virus-Microbe Interactions of Mud Island Mangroves
    Through the DOE Office of Science Graduate Student Research (SCGSR) program, Ian Rambo worked on part of his dissertation at the JGI. The chapter focuses on how viruses influence carbon cycling in coastal mangroves.

    More

  • Data & Tools
    • IMG
    • Data Portal
    • MycoCosm
    • PhycoCosm
    • Phytozome
    • GOLD
    HPCwire Editor's Choice Award (logo crop) for Best Use of HPC in the Life Sciences
    JGI Part of Berkeley Lab Team Awarded Best Use of HPC in Life Sciences
    The HPCwire Editors Choice Award for Best Use of HPC in Life Sciences went to the Berkeley Lab team comprised of JGI and ExaBiome Project team, supported by the DOE Exascale Computing Project for MetaHipMer, an end-to-end genome assembler that supports “an unprecedented assembly of environmental microbiomes.”

    More

    With a common set of "baseline metadata," JGI users can more easily access public data sets. (Steve Wilson)
    A User-Centered Approach to Accessing JGI Data
    Reflecting a structural shift in data access, the JGI Data Portal offers a way for users to more easily access public data sets through a common set of metadata.

    More

    Phytozome portal collage
    A More Intuitive Phytozome Interface
    Phytozome v13 now hosts upwards of 250 plant genomes and provides users with the genome browsers, gene pages, search, BLAST and BioMart data warehouse interfaces they have come to rely on, with a more intuitive interface.

    More

  • User Programs
    • Calls for Proposals
    • Special Initiatives & Programs
    • Product Offerings
    • User Support
    • Policies
    • Submit a Proposal
    screencap from Amundson and Wilkins subsurface microbiome video
    Digging into Microbial Ecosystems Deep Underground
    JGI users and microbiome researchers at Colorado State University have many questions about the microbial communities deep underground, including the role viral infection may play in other natural ecosystems.

    Read more

    Yeast strains engineered for the biochemical conversion of glucose to value-added products are limited in chemical output due to growth and viability constraints. Cell extracts provide an alternative format for chemical synthesis in the absence of cell growth by isolating the soluble components of lysed cells. By separating the production of enzymes (during growth) and the biochemical production process (in cell-free reactions), this framework enables biosynthesis of diverse chemical products at volumetric productivities greater than the source strains. (Blake Rasor)
    Boosting Small Molecule Production in Super “Soup”
    Researchers supported through the Emerging Technologies Opportunity Program describe a two-pronged approach that starts with engineered yeast cells but then moves out of the cell structure into a cell-free system.

    More

    These bright green spots are fluorescently labelled bacteria from soil collected from the surface of plant roots. For reference, the scale bar at bottom right is 10 micrometers long. (Rhona Stuart)
    A Powerful Technique to Study Microbes, Now Easier
    In JGI's Genome Insider podcast: LLNL biologist Jennifer Pett-Ridge collaborated with JGI scientists through the Emerging Technologies Opportunity Program to semi-automate experiments that measure microbial activity in soil.

    More

  • News & Publications
    • News
    • Blog
    • Podcasts
    • Webinars
    • Publications
    • Newsletter
    • Logos and Templates
    • Photos
    In their approved proposal, Frederick Colwell of Oregon State University and colleagues are interested in the microbial communities that live on Alaska’s glacially dominated Copper River Delta. They’re looking at how the microbes in these high latitude wetlands, such as the Copper River Delta wetland pond shown here, cycle carbon. (Courtesy of Rick Colwell)
    Monitoring Inter-Organism Interactions Within Ecosystems
    Many of the proposals approved through JGI's annual Community Science Program call focus on harnessing genomics to developing sustainable resources for biofuels and bioproducts.

    More

    Coloring the water, the algae Phaeocystis blooms off the side of the sampling vessel, Polarstern, in the temperate region of the North Atlantic. (Katrin Schmidt)
    Climate Change Threatens Base of Polar Oceans’ Bountiful Food Webs
    As warm-adapted microbes edge polewards, they’d oust resident tiny algae. It's a trend that threatens to destabilize the delicate marine food web and change the oceans as we know them.

    More

    Integrating JGI Capabilities for Exploring Earth’s Secondary Metabolome
    Natural Prodcast podcast: Nigel Mouncey
    JGI Director Nigel Mouncey has a vision to build out an integrative genomics approach to looking at the interactions of organisms and environments. He also sees secondary metabolism analysis and research as a driver for novel technologies that can serve all JGI users.

    More

Data & Tools
Home › Data & Tools › Software › BBTools › BBTools User Guide › BBNorm Guide

BBNorm Guide

BBNorm is designed to normalize coverage by down-sampling reads over high-depth areas of a genome, to result in a flat coverage distribution. This process can dramatically accelerate assembly and render intractable datasets tractable, and often improve assembly quality. It can also do depth-binning, kmer frequency histogram generation, error-correction, error-marking, and genome-size estimation. BBNorm has 4 particularly notable features:
1) It stores kmers in a probabilistic data structure called a count-min sketch. This means it will never run out of memory, or swap to disk, on any dataset. Rather, as the number of unique kmers increases, accuracy gradually declines.
2) It has numerous features such as multipass normalization, which reduce the average error rate in the normalized output; whereas standard normalization enriches for reads containing errors.
3) It is extremely fast and easy-to-use compared to other normalization programs.
4) It supports unlimited kmer lengths.

*Notes*

Data Structures:

A Count-Min Sketch (CMS) is also called a “counting Bloom filter”. It is a type of hash table that only stores values, not keys, and ignores collisions. To prevent the negative effects of collisions, values are stored in multiple locations, in the hopes that at least one of them won’t collide with anything else; when reading kmer counts, all locations are read, and the lowest value is used.
BBNorm can use CMSs with 1, 2, 4, 8, 16, or 32 bits per cell. The more bits, the higher the maximum count (up to 2^bits-1), but the fewer cells are available; for example, 1GB RAM will accommodate 4 billion 2-bit cells, with counts up to 3, or 500 million 16-bit cells, with counts up to 65535. If your data has expected coverage of 200x, there is little reason to use 32-bit cells.
Also, the number of locations used for storing a kmer’s count (the number of “hashes”) can be specified, from 1 to infinity (default 3). More hashes are more accurate (until the table becomes too full), but slower. To determine the optimal number of hashes, please read about Bloom filters.

Memory and Capacity:

BBNorm should be run using all available memory (which is what the shell script will try to do by default). The more memory available, the more accurate. It is possible to process an arbitrarily large dataset with even a tiny amount of memory. However, that will result in a warning message like this:

“Made hash table: hashes = 1 mem = 581.26 MB cells = 152.38M used = 93.540%
Warning: This table is extremely full, which may reduce accuracy. Ideal load is under 60% used.
For better accuracy, use the ‘prefilter’ flag; run on a node with more memory; quality-trim or error-correct reads; or increase the values of the minprob flag to reduce spurious kmers. In practice you should still get good normalization results even with loads over 90%, but the histogram and statistics will be off.”

Please don’t ignore this message! The memory can be used more efficiently by specifying “prefilter”, which stores low-count kmers in smaller cells (2-bit, by default) and high-count kmers in bigger cells (32-bit, by default). Prefilter is by default false, as it makes things slower, but should always be enabled when maximal accuracy is desired or if the tables become too full (say, over 50% or so for normalization; lower for error-correction). You can also reduce the size of the primary cells with e.g. “bits=16”, or perform a first pass with only a 2-bit or 4-bit filter in which very-low-depth reads are removed, to reduce the total number of unique kmers. It’s also possible to adjust the “minprob=X” flag, which ignores kmers with a probability of being error-free (based on quality scores) of below X. Normalization is very robust against the table being too full, but other operations, such as error-correction and histogram generation, are less robust.
Shell scripts:

BBNorm (whose java file name is jgi.KmerNormalize) has 3 shell scripts – bbnorm.sh, ecc.sh, and khist.sh. They all call KmerNormalize and just use different default parameters. It is possible to make kmer frequency histograms while doing error-correction and normalization at the same time with a single command from any of these shell scripts; they are only for convenience.

Dumping Kmers, Exact Counts, and Error Correction:

BBNorm cannot dump kmers and their counts because it only stores counts, not kmers. For this purpose, please use KmerCountExact instead, which explicitly tracks both kmers and their exact counts. Also, KmerCountExact can report the exact kmer frequency histogram. However, KmerCountExact cannot handle unlimited input data in finite memory like BBNorm can.
Tadpole uses the same exact data structures as KmerCountExact, and as a result, error-correction by Tadpole is generally better than error-correction by BBNorm. Therefore, while BBNorm supports error-correction, it is recommended that Tadpole be used when there is sufficient memory.

When Not To Use BBNorm:

For normalization, BBNorm is mainly intended for use in assembly, and with short reads. Normalization is often useful if you have too much data (for example, 600x average coverage when you only want 100x) or uneven coverage (amplified single-cell, RNA-seq, viruses, metagenomes, etc). It is not useful if you have smooth coverage and approximately the right amount of data, or too little data. BBNorm cannot inflate low coverage (bring 15x coverage up to 100x), only reduce it. Never normalize read data prior to a quantitative analysis (like ChIP-seq, RNA-seq for expression profiling, etc); if you assemble normalized data, and want to use mapping to determine coverage, map the non-normalized reads. Also, do not normalize data prior to mapping for variant discovery; it will cause bias. If you need to reduce data volume in any of these scenarios, use subsampling rather than normalization. Do not attempt to normalize high-error-rate data from platforms such as PacBio or Nanopore; it is designed for relatively-low-error-rate, short, fixed-length reads such as Illumina and Ion Torrent.
Also, error-correction is not advisable when you are looking for rare variants. It should generally be fine with relatively high-depth coverage of heterozygous mutations in a diploid (where you expect a 50/50 allele split), but with low-depth coverage (like 5x), or very lopsided distributions (like a 1/100 allele split), it may correct the minority allele into the majority allele, so should be used with caution.

Temp Files and Piping:

BBNorm needs to read input files multiple times (twice per pass), which means it is unable to accept piped input. In multipass mode, it also needs to generate temp files. The location of temp files can be specified with the “tmpdir” flag; it defaults to the environment variable $TMPDIR, which on Genepool points to local disk when available. Temp files will be cleaned up once BBNorm finishes.

Threads:

BBNorm is fully multithreaded both when counting kmers and when doing other operations such as normalization. The counting is lock-free, using atomic counters. As a result, it will default to using all available hardware threads; this can be adjusted with the “t” flag.

*Usage Examples*

Estimating Memory Requirements:

loglog.sh in=reads.fq

This will estimate the number of unique kmers in a dataset, which will dictate how much memory is needed by kmer-counting programs such as BBNorm. It does so very quickly while using virtually no memory, so it is recommended prior to running BBNorm (or any kmer-counting tool) if you need to know how much memory is needed. For BBNorm, if LogLog reports 1 billion kmers (for example), then using 16-bit cells and 3 hashes, you would need roughly 3hashes*16bits/kmer/8bits/byte*1000000000kmers/0.5load=12 GB to achieve a table under 50% full.
Estimating the memory requirement is not really necessary, though.

To normalize read coverage:

bbnorm.sh in=reads.fq out=normalized.fq target=100 min=5

This will run 2-pass normalization to produce an output file of reads with an average depth of 100x. Reads with an apparent depth of under 5x will be presumed to be errors and discarded.

To error-correct reads:

ecc.sh in=reads.fq out=corrected.fq
or equivalently
bbnorm.sh in=reads.fq out=corrected.fq ecc=t keepall passes=1 bits=16 prefilter

This will do error correction without discarding any reads. “bits=16 prefilter” are not really necessary but will typically make the correction more accurate by storing kmers more efficiently.

To generate a kmer-frequency histogram:

khist.sh in=reads.fq khist=khist.txt peaks=peaks.txt
or equivalently
bbnorm.sh in=reads.fq khist=khist.txt peaks=peaks.txt passes=1 prefilter minprob=0 minqual=0 mindepth=0

The histogram shows the number of unique kmers at a given depth. For example, a point at “Depth=10, UniqueKmers=248028” indicates that there are 248028 kmers that each occur 10 times in the input data. This should be plotted on a log-log scale. The peaks file contains the locations of peaks in the histogram, as well as estimates of genome size and ploidy. These estimates will only be accurate for randomly-sheared isolate genomic DNA with little contamination, and a ploidy of at most 4 (with 1 or 2 being more accurate). If there are no obvious peaks in the kmer histogram, the results of the peaks file will not be useful.
The additional arguments to bbnorm.sh (minprob=0 minqual=0 mindepth=0) are there to prevent low-depth kmers from being discarded.
To normalize and error-correct reads, creating before and after kmer histograms:
bbnorm.sh in=reads.fq out=normalized.fq target=100 min=5 prefilter ecc khist=khist_before.txt khistout=khist_after.txt

To make a high-pass or low-pass filter:

bbnorm.sh in=reads.fq out=highpass.fq outt=lowpass.fq passes=1 target=999999999 min=10

This will pass only reads with a depth of at least 10 to “out”, and low-depth reads under 10 to “outt” (outtoss).

To split by depth into 3 bins:

bbnorm.sh in=reads.fq outlow=low.fq outmid=mid.fq outhigh=high.fq passes=1 lowbindepth=10 highbindepth=80

This will put reads with coverage under 10x in low.fq; at least 80x in high.fq; and all others in mid.fq. Specifically, for pairs, if one read is below the low cutoff and the other is above the high cutoff, both go into mid.

Using additional files for kmer counts:

bbnorm.sh in=reads.fq out=corrected.fq passes=1 ecc extra=genome.fa,more_reads.fq

This will error-correct reads.fq using additional kmer count information from genome.fa and more_reads.fq. It can also be applied to other operations like normalization. The arguments to “extra” will be used only for kmer frequency data, but will not be part of the output.

  • BBTools User Guide
    • Installation Guide
    • Usage Guide
    • Data Preprocessing
    • Add Adapters Guide
    • BBDuk Guide
    • BBMap Guide
    • BBMask Guide
    • BBMerge Guide
    • BBNorm Guide
    • CalcUniqueness Guide
    • Clumpify Guide
    • Dedupe Guide
    • Reformat Guide
    • Repair Guide
    • Seal Guide
    • Split Nextera Guide
    • Statistics Guide
    • Tadpole Guide
    • Taxonomy Guide
  • BBTools FAQ and Support Forums

More topics:

  • COVID-19 Status
  • News
  • Science Highlights
  • Blog
  • Webinars
  • CSP Plans
  • Featured Profiles
  • Careers
  • Contact Us
  • Events
  • User Meeting
  • MGM Workshops
  • Internal
  • Disclaimer
  • Credits
  • Policies
  • Emergency Info
  • Accessibility / Section 508 Statement
  • Flickr
  • LinkedIn
  • RSS
  • Twitter
  • YouTube
Lawrence Berkeley National Lab Biosciences Area
A project of the US Department of Energy, Office of Science

JGI is a DOE Office of Science User Facility managed by Lawrence Berkeley National Laboratory

© 1997-2022 The Regents of the University of California