DOE Joint Genome Institute

  • COVID-19
  • About Us
  • Contact Us
  • Our Science
    • DOE Mission Areas
    • Bioenergy Research Centers
    • Science Programs
    • Science Highlights
    • Scientists
    Data yielded from RIViT-seq increased the number of sigma factor-gene pairs confirmed in Streptomyces coelicolor from 209 to 399. Here, grey arrows denote previously known regulation and red arrows are regulation identified by RIViT-seq; orange nodes mark sigma factors while gray nodes mark other genes. (Otani, H., Mouncey, N.J. Nat Commun 13, 3502 (2022). https://doi.org/10.1038/s41467-022-31191-w)
    Streamlining Regulon Identification in Bacteria
    Regulons are a group of genes that can be turned on or off by the same regulatory protein. RIViT-seq technology could speed up associating transcription factors with their target genes.

    More

    (PXFuel)
    Designer DNA: JGI Helps Users Blaze New Biosynthetic Pathways
    In a special issue of the journal Synthetic Biology, JGI scientific users share how they’ve worked with the JGI DNA Synthesis Science Program and what they’ve discovered through their collaborations.

    More

    A genetic element that generates targeted mutations, called diversity-generating retroelements (DGRs), are found in viruses, as well as bacteria and archaea. Most DGRs found in viruses appear to be in their tail fibers. These tail fibers – signified in the cartoon by the blue virus’ downward pointing ‘arms’— allow the virus to attach to one cell type (red), but not the other (purple). DGRs mutate these ‘arms,’ giving the virus opportunities to switch to different prey, like the purple cell. (Courtesy of Blair Paul)
    A Natural Mechanism Can Turbocharge Viral Evolution
    A team has discovered that diversity generating retroelements (DGRs) are not only widespread, but also surprisingly active. In viruses, DGRs appear to generate diversity quickly, allowing these viruses to target new microbial prey.

    More

  • Our Projects
    • Search JGI Projects
    • DOE Metrics/Statistics
    • Approved User Proposals
    • Legacy Projects
    Photograph of a stream of diatoms beneath Arctic sea ice.
    Polar Phytoplankton Need Zinc to Cope with the Cold
    As part of a long-term collaboration with the JGI Algal Program, researchers studying function and activity of phytoplankton genes in polar waters have found that these algae rely on dissolved zinc to photosynthesize.

    More

    This data image shows the monthly average sea surface temperature for May 2015. Between 2013 and 2016, a large mass of unusually warm ocean water--nicknamed the blob--dominated the North Pacific, indicated here by red, pink, and yellow colors signifying temperatures as much as three degrees Celsius (five degrees Fahrenheit) higher than average. Data are from the NASA Multi-scale Ultra-high Resolution Sea Surface Temperature (MUR SST) Analysis product. (Courtesy NASA Physical Oceanography Distributed Active Archive Center)
    When “The Blob” Made It Hotter Under the Water
    Researchers tracked the impact of a large-scale heatwave event in the ocean known as “The Blob” as part of an approved proposal through the Community Science Program.

    More

    A plantation of poplar trees. (David Gilbert)
    Genome Insider podcast: THE Bioenergy Tree
    The US Department of Energy’s favorite tree is poplar. In this episode, hear from ORNL scientists who have uncovered remarkable genetic secrets that bring us closer to making poplar an economical and sustainable source of energy and materials.

    More

  • Data & Tools
    • IMG
    • Data Portal
    • MycoCosm
    • PhycoCosm
    • Phytozome
    • GOLD
    HPCwire Editor's Choice Award (logo crop) for Best Use of HPC in the Life Sciences
    JGI Part of Berkeley Lab Team Awarded Best Use of HPC in Life Sciences
    The HPCwire Editors Choice Award for Best Use of HPC in Life Sciences went to the Berkeley Lab team comprised of JGI and ExaBiome Project team, supported by the DOE Exascale Computing Project for MetaHipMer, an end-to-end genome assembler that supports “an unprecedented assembly of environmental microbiomes.”

    More

    With a common set of "baseline metadata," JGI users can more easily access public data sets. (Steve Wilson)
    A User-Centered Approach to Accessing JGI Data
    Reflecting a structural shift in data access, the JGI Data Portal offers a way for users to more easily access public data sets through a common set of metadata.

    More

    Phytozome portal collage
    A More Intuitive Phytozome Interface
    Phytozome v13 now hosts upwards of 250 plant genomes and provides users with the genome browsers, gene pages, search, BLAST and BioMart data warehouse interfaces they have come to rely on, with a more intuitive interface.

    More

  • User Programs
    • Calls for Proposals
    • Special Initiatives & Programs
    • Product Offerings
    • User Support
    • Policies
    • Submit a Proposal
    screencap from Amundson and Wilkins subsurface microbiome video
    Digging into Microbial Ecosystems Deep Underground
    JGI users and microbiome researchers at Colorado State University have many questions about the microbial communities deep underground, including the role viral infection may play in other natural ecosystems.

    Read more

    Yeast strains engineered for the biochemical conversion of glucose to value-added products are limited in chemical output due to growth and viability constraints. Cell extracts provide an alternative format for chemical synthesis in the absence of cell growth by isolating the soluble components of lysed cells. By separating the production of enzymes (during growth) and the biochemical production process (in cell-free reactions), this framework enables biosynthesis of diverse chemical products at volumetric productivities greater than the source strains. (Blake Rasor)
    Boosting Small Molecule Production in Super “Soup”
    Researchers supported through the Emerging Technologies Opportunity Program describe a two-pronged approach that starts with engineered yeast cells but then moves out of the cell structure into a cell-free system.

    More

    These bright green spots are fluorescently labelled bacteria from soil collected from the surface of plant roots. For reference, the scale bar at bottom right is 10 micrometers long. (Rhona Stuart)
    A Powerful Technique to Study Microbes, Now Easier
    In JGI's Genome Insider podcast: LLNL biologist Jennifer Pett-Ridge collaborated with JGI scientists through the Emerging Technologies Opportunity Program to semi-automate experiments that measure microbial activity in soil.

    More

  • News & Publications
    • News
    • Blog
    • Podcasts
    • Webinars
    • Publications
    • Newsletter
    • Logos and Templates
    • Photos
    A view of the mangroves from which the giant bacteria were sampled in Guadeloupe. (Hugo Bret)
    Giant Bacteria Found in Guadeloupe Mangroves Challenge Traditional Concepts
    Harnessing JGI and Berkeley Lab resources, researchers characterized a giant - 5,000 times bigger than most bacteria - filamentous bacterium discovered in the Caribbean mangroves.

    More

    In their approved proposal, Frederick Colwell of Oregon State University and colleagues are interested in the microbial communities that live on Alaska’s glacially dominated Copper River Delta. They’re looking at how the microbes in these high latitude wetlands, such as the Copper River Delta wetland pond shown here, cycle carbon. (Courtesy of Rick Colwell)
    Monitoring Inter-Organism Interactions Within Ecosystems
    Many of the proposals approved through JGI's annual Community Science Program call focus on harnessing genomics to developing sustainable resources for biofuels and bioproducts.

    More

    Coloring the water, the algae Phaeocystis blooms off the side of the sampling vessel, Polarstern, in the temperate region of the North Atlantic. (Katrin Schmidt)
    Climate Change Threatens Base of Polar Oceans’ Bountiful Food Webs
    As warm-adapted microbes edge polewards, they’d oust resident tiny algae. It's a trend that threatens to destabilize the delicate marine food web and change the oceans as we know them.

    More

Data & Tools
Home › Data & Tools › Software › BBTools › BBTools User Guide › BBMerge Guide

BBMerge Guide

BBMerge is designed to merge two overlapping paired reads into a single read. For example, a 2x150bp read pair with an insert size of 270bp would result in a single 270bp read. This is useful in amplicon studies, as clustering and consensus are far easier with single reads than paired reads, and also in assembly, where longer reads allow the use of longer kmers (for kmer-based assemblers) or fewer comparisons (for overlap-based assemblers). And in either case, the quality of the overlapping bases is improved. BBMerge is also capable of error-correcting the overlapping portion of reads without merging them, as well as merging nonoverlapping reads, if enough coverage is available. BBMerge is the fastest and by far the most accurate overlap-based read merger currently in existence.

BBMerge’s parameters are described in its shell script (bbmerge.sh). This file provides usage examples of various common tasks.

*Notes*

Memory and Kmer Operations:

BBMerge has 2 shell scripts, bbmerge.sh and bbmerge-auto.sh. They are equivalent except for memory usage; so if you override memory auto detection with the -Xmx flag, they are equivalent. bbmerge.sh is designed for overlap-based merging only, and uses a fixed 1GB of RAM (though it can function with much less than that). bbmerge-auto.sh attempts to grab all available physical memory. It is designed for kmer-based operations using Tadpole, which include both merging overlapping and non-overlapping reads, kmer-based error-correction, and kmer-based filtering. If you use an option such as extend, extend2, ecct, kfilter, rem, or rsem, then BBMerge will automatically store kmers. This will use much more time and memory but potentially have various advantages like increased accuracy and increased merge rate of longer-insert pairs. Kmer-based operations should only be used with shotgun (randomly-fragmented) libraries, never with amplicon libraries (such as 16S). They require sufficient coverage; 5x is typically enough, but more is better.

Output streams:

BBMerge supports “out” (aka “outm” or “outmerged”) and “outu” (“outunmerged”). Reads that are merged, or mergeable, go to out, and the rest go to outu. There is a “join” flag (default true) that controls whether mergeable reads get merged. If it is set to false, mergeable reads will be written interleaved to out. All output streams are optional.

Threads and speed:

BBMerge is multithreaded and scales linearly with the number of processor cores, so it’s best to let it automatically use all of them. You can restrict the number of worker threads with the “t” flag if you are working on a shared node. To achieve the maximal speed on a system with many (20+) cores, BBMerge should be fed two files (using in1 and in2) rather than a single interleaved file.
JNI acceleration:

BBMerge has an optional C component (written by Jonathan Rood) which will accelerate merging by approximately 20%. This can be activated with the “jni” flag, but it must first be compiled. For details on compiling it, see /bbmap/jni/README.txt

Strictness:

BBMerge has a lot of settings controlling merging stringency, such as maxratio, ratiomargin, entropy, efilter, etc. Advanced users should feel free to tune these as needed. But it’s a lot simpler to use the predefined strictness levels which adjust the specific settings according to the results of extensive benchmarking. To use a predefined strictness level, simply add a flag like “loose” (you don’t need to add one for “default”). The predefined strictness levels, from strictest to loosest:
xstrict, ustrict, vstrict, strict, default, loose, vloose, uloose, xloose
Stricter settings have lower merge rates and fewer false positives; looser settings have higher merge rates and more false positives. Loose settings are generally, not necessary except with low-quality data (which often happens in low-diversity amplicon sequencing using long reads). A false-positive means a read pair that merged with the wrong overlap length – these can cause problems in assembly or clustering (leading to spurious clusters). However, at any level of strictness, BBMerge has by far the lowest false-positive rate of any read merger, in my testing.

Trimming:

Adapter-trimming reads is fine and is recommended prior to running BBDuk, particularly if kmers will be used. Quality-trimming is usually not recommended, unless the library has major quality issues at the right end of reads resulting in a low merge rate, in which case only weak trimming (such as qtrim=r trimq=8) should be used.

When not to use:

If you run BBMerge, and under, say, 15% of the reads merge, even at very loose stringency, it’s probably a waste of time to merge – you’ll just make the workflow more complicated, and possibly get a lot of false-positives. Also, don’t try to merge single-ended libraries or long-mate-pair libraries that are not in an “innie” orientation. Generally you should not be merging LMP libraries at all except for special analysis purposes, such as determining what fraction of your LMP library is actually short-insert fragments.

*Usage Examples*

Basic merging:

bbmerge.sh in=reads.fq out=merged.fq outu=unmerged.fq ihist=ihist.txt

This will merge the reads by overlap. If no best overlap is found, the pair will go to outu; otherwise, the reads will be merged and sent to out. After finishing, an insert size histogram will be written to ihist.txt. This can be produced even if “out” or “outu” are not specified.

Overlap-based error-correction:

bbmerge.sh in=reads.fq out=corrected.fq ecco mix

This will correct reads that overlap, rather than merging them. Where the two reads agree, the quality score will be increased; where they disagree, the score will be reduced, and the base call will be changed to the base with the higher quality. If the bases differ and the scores are equal, the base will be replaced with N.

Merging of nonoverlapping reads using kmers:

bbmerge-auto.sh in=reads.fq out=merged.fq outu=unmerged.fq ihist=ihist.txt ecct extend2=20 iterations=5

This will attempt to merge each pair by overlap. If unsuccessful, both reads will be error-corrected using Tadpole, and then merging will be tried again. If still unsuccessful, both reads will be extended by 20bp, then merging will be attempted again. This will repeat up to 5 times, or until neither of the reads can be extended any more due to a branch or dead-end in the kmer graph. If the reads are not merged, all of the changes are undone and the original pair will be sent to outu. “extend2=20 iterations=5” will extend each read by up to 100bp, which increases the maximum insert sizes that can be merged by 200bp. So, for example, a 2x150bp library can normally only merge inserts up to around 290bp; this would extend that capability to 490bp, and the middle would be filled in with assembled bases.

Using kmers to reduce false positives:

bbmerge-auto.sh in=reads.fq out=merged.fq outu=unmerged.fq rem extend2=50 k=62
or
bbmerge-auto.sh in=reads.fq out=merged.fq outu=unmerged.fq rsem extend2=50 k=62

This is similar to the above section on merging nonoverlapping reads, but the goal is to perform conservative merges, particularly in repetitive areas. “rem” or “requireextensionmatch” will try to merge the raw reads, then try to extend them, then merge them again. If the two merges give the same result, then the reads will be merged. If they give different results, the reads will only be merged if the raw reads gave no solution but the extended reads gave a solution indicating the raw reads don’t overlap, or if the raw reads DID have a solution and no extension was possible. “rsem” is stricter, in that if the raw reads did not give a solution, extension will not be attempted. In practice rsem has a lower false-positive merge rate than rem, but rem allows non-overlapping merges.

Discovering adapter sequences:

bbmerge.sh in=reads.fq outa=adapters.fa

This will report the consensus adapter sequences of pairs with insert size shorter than read length. The adapter sequences can then be used for trimming with BBDuk or fed back into BBMerge to improve merging accuracy.
Using adapter sequences to improve merging accuracy:
bbmerge.sh in=reads.fq out=merged.fq adapter1=GATCGGAAGAGCACACGTCTGAACTCCAGTC adapter2=GATCGGAAGAGCACACGTCTGAACTCCAGTC
or
bbmerge.sh in=reads.fq out=merged.fq adapters=adapters.fa

The argument for adapter1=, adapter2=, or adapters= can be a literal sequence, a fasta file, or a comma-delimited list of sequences and/or fasta files. The adapter sequences will only be used to ensure that reads that overlap with an implied insert size of less than read length actually contain adapter sequence at the expected location. This is optional but can substantially increase accuracy, so it is highly recommended if you know the adapter sequence. Do not use this option if the left end of reads have been trimmed in any way. Trimming the right end is fine; e.g., ktrim=r or qtrim=r in BBDuk. However, it is generally not recommended to do any quality-trimming prior to merging.

Allowing perfect overlaps only:

bbmerge.sh in=reads.fq out=merged.fq pfilter=1

“pfilter” bans merges in which the probability of the resulting mismatches falls below a specified value, based on the quality values of the bases that don’t match. “pfilter=1” is the strictest possible setting, which bans merges in which there are any mismatches in the overlap region.

Recommended command for optimal accuracy:

bbmerge-auto.sh in=reads.fq out=merged.fq adapter1=something adapter2=something rem k=62 extend2=50 ecct

If you have sufficient depth for kmer-based extension and error-correction (and are using a shotgun fragment library), and you know the adapter sequences, this command will maximize the correct merges and minimize the incorrect merges, in my testing. You can also add a strictness modifier (such as “loose” or “vstrict”) as desired. I typically use vstrict when preparing reads for assembly and loose when calculating an insert size distribution.

  • BBTools User Guide
    • Installation Guide
    • Usage Guide
    • Data Preprocessing
    • Add Adapters Guide
    • BBDuk Guide
    • BBMap Guide
    • BBMask Guide
    • BBMerge Guide
    • BBNorm Guide
    • CalcUniqueness Guide
    • Clumpify Guide
    • Dedupe Guide
    • Reformat Guide
    • Repair Guide
    • Seal Guide
    • Split Nextera Guide
    • Statistics Guide
    • Tadpole Guide
    • Taxonomy Guide
  • BBTools FAQ and Support Forums

More topics:

  • COVID-19 Status
  • News
  • Science Highlights
  • Blog
  • Webinars
  • CSP Plans
  • Featured Profiles
  • Careers
  • Contact Us
  • Events
  • User Meeting
  • MGM Workshops
  • Internal
  • Disclaimer
  • Credits
  • Policies
  • Emergency Info
  • Accessibility / Section 508 Statement
  • Flickr
  • LinkedIn
  • RSS
  • Twitter
  • YouTube
Lawrence Berkeley National Lab Biosciences Area
A project of the US Department of Energy, Office of Science

JGI is a DOE Office of Science User Facility managed by Lawrence Berkeley National Laboratory

© 1997-2023 The Regents of the University of California