DOE Joint Genome Institute

  • COVID-19
  • About
  • Phones
  • Contacts
  • Our Science
    • DOE Mission Areas
    • Bioenergy Research Centers
    • Science Programs
    • Products
    • Science Highlights
    • Scientists
    Maize can produce a cocktail of antibiotics with a handful of enzymes. (Sam Fentress, CC BY-SA 2.0)
    How Maize Makes An Antibiotic Cocktail
    Zealexins are produced in every corn variety and protect maize by fending off fungal and microbial infections using surprisingly few enzymes.

    More

    The genome of the common fiber vase or Thelephora terrestris was among those used in the study. (Francis Martin)
    From Competition to Cooperation
    By comparing 135 fungal sequenced genomes, researchers were able to carry out a broader analysis than had ever been done before to look at how saprotrophs have transitioned to the symbiotic lifestyle.

    More

    Miscanthus grasses. (Roy Kaltschmidt/Berkeley Lab)
    A Grass Model to Help Improve Giant Miscanthus
    The reference genome for M. sinensis, and the associated genomic tools, allows Miscanthus to both inform and benefit from breeding programs of related candidate bioenergy feedstock crops such as sugarcane and sorghum.

    More

  • Our Projects
    • Search JGI Projects
    • DOE Metrics/Statistics
    • Approved User Proposals
    • Legacy Projects
    Poplar (Populus trichocarpa and P. deltoides) grow in the Advanced Plant Phenotyping Laboratory (APPL) at Oak Ridge National Laboratory in Tennessee. Poplar is an important biofuel feedstock, and Populus trichocarpa is the first tree species to have its genome sequenced — a feat accomplished by JGI. (Image courtesy of Oak Ridge National Laboratory, U.S. Dept. of Energy)
    Podcast: Xiaohan Yang on A Plantiful Future
    Building off plant genomics collaborations between the JGI and Oak Ridge National Laboratory, Xiaohan Yang envisions customizing plants for the benefit of human society.

    More:

    Expansin complex with cell wall in background. (Courtesy of Daniel Cosgrove)
    Synthesizing Microbial Expansins with Unusual Activities
    Expansin proteins from diverse microbes have potential uses in deconstructing lignocellulosic biomass for conversion to renewable biofuels, nanocellulosic fibers, and commodity biochemicals.

    Read more

    High oleic pennycress. (Courtesy of Ratan Chopra)
    Pennycress – A Solution for Global Food Security, Renewable Energy and Ecosystem Benefits
    Pennycress (Thlaspi arvense) is under development as a winter annual oilseed bioenergy crop. It could produce up to 3 billion gallons of seed oil annually while reducing soil erosion and fertilizer runoff.

    Read more

  • Data & Tools
    • IMG
    • Genome Portal
    • MycoCosm
    • PhycoCosm
    • Phytozome
    • GOLD
    Artistic interpretation of CheckV assessing virus genome sequences from environmental samples. (Rendered by Zosia Rostomian​, Berkeley Lab)
    An Automated Tool for Assessing Virus Data Quality
    CheckV can be broadly utilized by the research community to gauge virus data quality and will help researchers to follow best practices and guidelines for providing the minimum amount of information for an uncultivated virus genome.

    More

    Unicellular algae in the Chlorella genus, magnified 1300x. (Andrei Savitsky)
    A One-Stop Shop for Analyzing Algal Genomes
    The PhycoCosm data portal is an interactive browser that allows algal scientists and enthusiasts to look deep into more than 100 algal genomes, compare them, and visualize supporting experimental data.

    More

    Artistic interpretation of how microbial genome sequences from the GEM catalog can help fill in gaps of knowledge about the microbes that play key roles in the Earth's microbiomes. (Rendered by Zosia Rostomian​, Berkeley Lab)
    Podcast: A Primer on Genome Mining
    In Natural Prodcast: the basics of genome mining, and how JGI researchers conducted it in IMG/ABC on thousands of metagenome-derived genomes for a Nature Biotechnology paper.

    Read more

  • User Programs
    • Calls for User Proposals
    • Special Initiatives & Programs
    • User Support
    • Submit a Proposal
    Scanning electron micrographs of diverse diatoms. (Credits: Diana Sarno, Marina Montresor, Nicole Poulsen, Gerhard Dieckmann)
    Learn About the Approved 2021 Large-Scale CSP Proposals
    A total of 27 proposals have been approved through JGI's annual Community Science Program (CSP) call. For the first time, 63 percent of the accepted proposals come from researchers who have not previously been a principal investigator on an approved JGI proposal.

    Read more

    MiddleGaylor Michael Beman UC Merced
    How to Successfully Apply for a CSP Proposal
    Reach out to JGI staff for feedback before submitting a proposal. Be sure to describe in detail what you will do with the data.

    Read more

    Click on the image or go here to watch the video "Enriching target populations for genomic analyses using HCR-FISH" from the journal Microbiome describing the research.
    How to Target a Microbial Needle within a Community Haystack
    Enabled by the JGI’s Emerging Technologies Opportunity Program, researchers have developed, tested and deployed a pipeline to first target cells from communities of uncultivated microbes, and then efficiently retrieve and characterize their genomes.

    Read more

  • News & Publications
    • News
    • Blog
    • Podcasts
    • Publications
    • Scientific Posters
    • Newsletter
    • Logos and Templates
    • Photos
    Artistic interpretation of how microbial genome sequences from the GEM catalog can help fill in gaps of knowledge about the microbes that play key roles in the Earth's microbiomes. (Rendered by Zosia Rostomian​, Berkeley Lab)
    Uncovering Novel Genomes from Earth’s Microbiomes
    A public repository of 52,515 microbial draft genomes generated from environmental samples around the world, expanding the known diversity of bacteria and archaea by 44%, is now available .

    More

    Green millet (Setaria viridis) plant collected in the wild. (Courtesy of the Kellogg lab)
    Shattering Expectations: Novel Seed Dispersal Gene Found in Green Millet
    In Nature Biotechnology, a very high quality reference Setaria viridis genome was sequenced, and for the first time in wild populations, a gene related to seed dispersal was identified.

    More

    The Brachypodium distachyon-B. stacei-B. hybridum polyploid model complex. (Illustrations credits: Juan Luis Castillo)
    The More the Merrier: Making the Case for Plant Pan-genomes
    Crop breeders have harnessed polyploidy to increase fruit and flower size, and confer stress tolerance traits. Using a Brachypodium model system, researchers have sought to learn the origins, evolution and development of plant polyploids. The work recently appeared in Nature Communications.

    Read more

Data & Tools
Home › Data & Tools › BBTools › BBTools User Guide › Dedupe Guide

Dedupe Guide

Dedupe was written to eliminate duplicate contigs in assemblies, and later expanded to find all contained and overlapping sequences in a dataset, allowing a specified number of substitutions or edit distance. It is now also capable of clustering sequences based on similarity, and printing dot-formatted all-to-all overlap graphs.
Kmer-based assemblers do not typically create redundant contigs when working correctly, though an exception can be made in the case of transcriptome assemblers. However, overlap-based assemblers may create duplicate sequences, and merged kmer-based assemblies (such as 5 assemblies of the same reads with different kmer lengths) will usually contain massive redundancy. Also, public databases such as nt and RefSeq often have hundreds of thousands of duplicate sequences due to poor curation. Dedupe is primarily designed to handle these situations. While there are other tools designed for this purpose, none are even remotely as fast or comprehensive as Dedupe.

*Notes*

Memory:

Dedupe stores all unique sequences in memory. The cost is around 500 bytes per unique sequence, plus the sequences themselves (1 byte per base). It is possible to run Dedupe in subset mode to deduplicate datasets that do not fit in memory, but that will not be covered in this guide.

Threads and Scaling:

Dedupe is fully multithreaded, and scales near-linearly with the number of cores. Finding exact duplicates is so fast that it typically becomes bottlenecked by the file input streams, which max out at around 500 Mbp/s each. When deduplicating multiple references, using “in=a.fasta,b.fasta,c.fasta” allows each to be read in a different stream, increasing the maximal throughput compared to first concatenating all references into a single file.

Phases:

Dedupe has 6 phases, most of which are optional and depend on the processing mode. They are always executed (or skipped) in the same order.
1) Exact Matches.
During this required phase, sequences are loaded into memory, and exact duplicates (including reverse-complements) are detected and discarded. Hashtables are filled with sequence hash codes of sequences. If containments or overlaps will be examined in later stages, kmers from the ends of sequences will also be added to hash tables. After this phase, the input files will not be used again.
2) Absorb Containments.
If “absorbcontainments” is enabled (default), every read X is scanned for kmers; each kmer is looked up in a hashtable. If the kmer occurs in some other read Y, then Y is aligned with X to see if X contains Y (meaning Y is equal in length or shorter than X, and the number of substitutions or edits is at most the values specified with the “s” and “e” flags). If so, Y is discarded.
3) Find Overlaps.
If “findoverlaps” is enabled (non-default), overlaps will be sought using the same process as containment-absorbtion, but X will not need to contain Y; they must simply have an overlap of at least minoverlap (default 200). Neither is absorbed, and nor are they merged; the overlap is simply recorded.
4) Make Clusters.
If “cluster” is enabled (non-default), clusters will be created by searching the overlap graph. Each cluster is the set of all reads reachable via transitive overlaps. For example, if X overlaps Y, and Y overlaps Z, then X, Y, and Z will form a cluster, even if X does not overlap Z. That means if there is even a single edge between 2 clusters, they will become one cluster.
5) Process Clusters.
If “processclusters” is enabled (non-default), the clusters will be post processed to simplify them. This involves various graph simplification operations (which can be individually toggled) like removing redundant edges and (when possible) flipping some of the sequences so that they are all in the same orientation.
6) Output
Finally, all of the output files are generated.

Read Deduplication:

Dedupe can be used for deduplicating read sets, and supports paired reads as well (in which case it requires a pair to be the duplicate of another pair). However, due to memory usage, it is not particularly efficient in this role, considering the volume of data that can be generated on modern sequencing platforms. Dedupe can easily handle a 10M read MiSeq run, but a HighSeq lane with 300M reads might take hundreds of GB of RAM. In those cases, deduplication methods based on sorting would be more efficient (for example, mapping and deduplicating based on mapping position). Dedupe does not perform those operations.
Also, the current implementation of Dedupe is strictly limited to 2 billion unique sequences regardless of how much memory is available.

Pair Limitations:

Dedupe supports paired reads, but it was not really designed for them. When processing paired reads, some parts of Dedupe are restricted to a single thread due to a complication that causes non-deterministic output. As such, processing paired reads is slower than unpaired reads. Also, pair support is limited to exact matches and overlaps, not containments.

JNI acceleration:

Dedupe has an optional C component (written by Jonathan Rood) which will accelerate overlap and containment detection by a lot (at least double). However, it only has an effect if an edit distance is allowed. This can be activated with the “jni” flag, but it must first be compiled. For details on compiling it, see /bbmap/jni/README.txt. When clustering amplicons and allowing an edit distance, the “jni” and “pto” flags are highly recommended as they will dramatically increase speed.

Dedupe versus Dedupe2:

Dedupe and Dedupe2 are identical except that Dedupe2 allows an unlimited number of affixes (kmer prefixes and suffixes used for seeding overlap detection). This is only useful when searching for overlaps with a relatively low identity, since kmers are required to have exact matches. More affixes use more memory and slow things down, so don’t go overboard. You can call dedupe.sh or dedupe2.sh; internally, either Dedupe or Dedupe2 will get used depending on how many affixes were requested with the “nam” (“numaffixmaps”) flag. The fact that 2 shell scripts are present is a legacy.

*Usage Examples*

Exact duplicate removal only:

dedupe.sh in=X.fa out=Z.fa ac=f

The “ac=f” flag disables containment removal.
Exact duplicate and contained sequence removal:
dedupe.sh in=X.fa out=Y.fa

Finding duplicate sequences:

dedupe.sh in=X.fa out=Y.fa outd=duplicates.fa

All removed sequences will end up in “duplicates.fa”.

Deduplication of multiple files:

dedupe.sh in=X1.fa,X2.fa,X3.fa out=Y.fa

Deduplication allowing mismatches:

dedupe.sh in=X.fa out=Y.fa s=5 e=2

This will allow up to 5 substitutions, or 2 edits. What does this mean? Well, 5 substitutions is OK. 2 insertions or 2 deletions is OK. 2 insertions and 3 substitutions is OK. But, 5 insertions is not OK, because the edit distance specifies the bandwith of the banded aligner, and more than 2 insertions or deletions would go out of bounds. “s=5” alone would allow 5 substitutions and zero indels, while “e=2” alone would allow up to 2 of any mutations (2 subs, 1 sub 1 insertion, etc).

Deduplication with a minimum identity:

dedupe.sh in=X.fa out=Y.fa minidentity=99

This will consider two sequences to be duplicates if their identity is at least 99%. Indels are not allowed unless you specifically set the “e” flag. So, “minidentity=99” would consider 2 1000bp sequences to be duplicates if they had up to 1000*1% = 10 substitutions. “minidentity=99 e=5” would consider 2 1000bp sequences to be duplicates if they had up to 10 mutations, but only up to 5 of them could be indels. “minidentity=99 e=20” would consider 2 1000bp sequences to be duplicates if they had up to 20 mutations, all of which could be indels. Why not 10? Because it uses the max of the number of edits set by “e” and the number set by identity.

Clustering by overlap:

dedupe.sh in=X.fq pattern=cluster%.fq ac=f am=f s=1 mo=200 c pc csf=stats.txt outbest=best.fq fo c mcs=3 cc dot=graph.dot

This will find overlaps (fo) using a min overlap length (mo) of 200 and allowing 1 substitution (s). Then, reads will be clustered (c), and clusters of at least size 3 (mcs) will be written to individual files: cluster1.fq, cluster2.fq, etc. Also, the single best representative of each cluster (based on length and quality scores) will be written to outbest.fq (this makes more sense for amplicon clustering than assembly). A graph representing the overlaps will be written to graph.dot, which can be visualized with graphviz.

Clustering full-length PacBio 16s reads of insert:

reformat.sh in=reads_of_insert.fastq out=filtered.fq minlen=1420 maxlen=1640 maq=20 qin=33
then:
dedupe.sh in=filtered.fq csf=stats_e26.txt outbest=best_e26.fq qin=33 usejni=t am=f ac=f fo c rnc=f mcs=3 k=27 mo=1420 ow cc pto nam=4 e=26 pattern=cluster_%.fq dot=graph.dot

This first filters out low-quality data and probable chimeras (based on length) using Reformat. Then, clustering is done allowing up to 26 edits (this was chosen to allow roughly 99% accurate 1540bp amplicons to overlap; it should be adjusted depending on the accuracy and length of the data). A minimum overlap length is set to 1420bp. “nam=4 k=27” means 4 nonoverlapping 27-mers are used as seeds from each end of the sequences.

*Set Operations*

It is possible to do arbitrary set operations (intersection, union, subtraction) with Dedupe, though it’s not trivial. They are made possible by the “uniqueonly” flag, which discards all copies of sequences that have duplicates, rather than retaining exactly one. Note that similar operations are possible on kmer sets rather than sequence sets using kcompress.sh.

Set creation:

dedupe.sh in=X.fa out=X2.fa ac=f
dedupe.sh in=Y.fa out=Y2.fa ac=f

This is a necessary first step to ensure that X2 and Y2 are sets, meaning they have no duplicates.

Set union:

dedupe.sh in=X.fa,Y.fa out=union.fa ac=f

Set subtraction:

dedupe.sh in=X2.fa,union.fa out=Y2_minus_X2.fa uniqueonly ac=f

Set symmetric difference:

dedupe.sh in=X2.fa,Y2.fa out=symmetric_difference.fa uniqueonly ac=f

Set intersection:

dedupe.sh in=X2.fa,symmetric_difference.fa out=intersection.fa uniqueonly ac=f

  • BBTools User Guide
    • Installation Guide
    • Usage Guide
    • Data Preprocessing
    • Add Adapters Guide
    • BBDuk Guide
    • BBMap Guide
    • BBMask Guide
    • BBMerge Guide
    • BBNorm Guide
    • CalcUniqueness Guide
    • Clumpify Guide
    • Dedupe Guide
    • Reformat Guide
    • Repair Guide
    • Seal Guide
    • Split Nextera Guide
    • Statistics Guide
    • Tadpole Guide
    • Taxonomy Guide
  • BBTools FAQ and Support Forums

More topics:

  • COVID-19 Status
  • News
  • Science Highlights
  • Blog
  • Podcasts
  • CSP Plans
  • Featured Profiles
  • Careers
  • Contact Us
  • Events
  • User Meeting
  • MGM Workshops
  • Internal
  • Disclaimer
  • Credits
  • Emergency Info
  • Accessibility / Section 508 Statement
  • RSS feed
  • Flickr
  • LinkedIn
  • Twitter
  • YouTube
Lawrence Berkeley National Lab Biosciences Area
A project of the US Department of Energy, Office of Science

JGI is a DOE Office of Science User Facility managed by Lawrence Berkeley National Laboratory

© 1997-2021 The Regents of the University of California