DOE Joint Genome Institute

  • COVID-19
  • About Us
  • Contact Us
  • Our Science
    • DOE Mission Areas
    • Bioenergy Research Centers
    • Science Programs
    • Science Highlights
    • Scientists
    Data yielded from RIViT-seq increased the number of sigma factor-gene pairs confirmed in Streptomyces coelicolor from 209 to 399. Here, grey arrows denote previously known regulation and red arrows are regulation identified by RIViT-seq; orange nodes mark sigma factors while gray nodes mark other genes. (Otani, H., Mouncey, N.J. Nat Commun 13, 3502 (2022). https://doi.org/10.1038/s41467-022-31191-w)
    Streamlining Regulon Identification in Bacteria
    Regulons are a group of genes that can be turned on or off by the same regulatory protein. RIViT-seq technology could speed up associating transcription factors with their target genes.

    More

    (PXFuel)
    Designer DNA: JGI Helps Users Blaze New Biosynthetic Pathways
    In a special issue of the journal Synthetic Biology, JGI scientific users share how they’ve worked with the JGI DNA Synthesis Science Program and what they’ve discovered through their collaborations.

    More

    A genetic element that generates targeted mutations, called diversity-generating retroelements (DGRs), are found in viruses, as well as bacteria and archaea. Most DGRs found in viruses appear to be in their tail fibers. These tail fibers – signified in the cartoon by the blue virus’ downward pointing ‘arms’— allow the virus to attach to one cell type (red), but not the other (purple). DGRs mutate these ‘arms,’ giving the virus opportunities to switch to different prey, like the purple cell. (Courtesy of Blair Paul)
    A Natural Mechanism Can Turbocharge Viral Evolution
    A team has discovered that diversity generating retroelements (DGRs) are not only widespread, but also surprisingly active. In viruses, DGRs appear to generate diversity quickly, allowing these viruses to target new microbial prey.

    More

  • Our Projects
    • Search JGI Projects
    • DOE Metrics/Statistics
    • Approved User Proposals
    • Legacy Projects
    Photograph of a stream of diatoms beneath Arctic sea ice.
    Polar Phytoplankton Need Zinc to Cope with the Cold
    As part of a long-term collaboration with the JGI Algal Program, researchers studying function and activity of phytoplankton genes in polar waters have found that these algae rely on dissolved zinc to photosynthesize.

    More

    This data image shows the monthly average sea surface temperature for May 2015. Between 2013 and 2016, a large mass of unusually warm ocean water--nicknamed the blob--dominated the North Pacific, indicated here by red, pink, and yellow colors signifying temperatures as much as three degrees Celsius (five degrees Fahrenheit) higher than average. Data are from the NASA Multi-scale Ultra-high Resolution Sea Surface Temperature (MUR SST) Analysis product. (Courtesy NASA Physical Oceanography Distributed Active Archive Center)
    When “The Blob” Made It Hotter Under the Water
    Researchers tracked the impact of a large-scale heatwave event in the ocean known as “The Blob” as part of an approved proposal through the Community Science Program.

    More

    A plantation of poplar trees. (David Gilbert)
    Genome Insider podcast: THE Bioenergy Tree
    The US Department of Energy’s favorite tree is poplar. In this episode, hear from ORNL scientists who have uncovered remarkable genetic secrets that bring us closer to making poplar an economical and sustainable source of energy and materials.

    More

  • Data & Tools
    • IMG
    • Data Portal
    • MycoCosm
    • PhycoCosm
    • Phytozome
    • GOLD
    HPCwire Editor's Choice Award (logo crop) for Best Use of HPC in the Life Sciences
    JGI Part of Berkeley Lab Team Awarded Best Use of HPC in Life Sciences
    The HPCwire Editors Choice Award for Best Use of HPC in Life Sciences went to the Berkeley Lab team comprised of JGI and ExaBiome Project team, supported by the DOE Exascale Computing Project for MetaHipMer, an end-to-end genome assembler that supports “an unprecedented assembly of environmental microbiomes.”

    More

    With a common set of "baseline metadata," JGI users can more easily access public data sets. (Steve Wilson)
    A User-Centered Approach to Accessing JGI Data
    Reflecting a structural shift in data access, the JGI Data Portal offers a way for users to more easily access public data sets through a common set of metadata.

    More

    Phytozome portal collage
    A More Intuitive Phytozome Interface
    Phytozome v13 now hosts upwards of 250 plant genomes and provides users with the genome browsers, gene pages, search, BLAST and BioMart data warehouse interfaces they have come to rely on, with a more intuitive interface.

    More

  • User Programs
    • Calls for Proposals
    • Special Initiatives & Programs
    • Product Offerings
    • User Support
    • Policies
    • Submit a Proposal
    screencap from Amundson and Wilkins subsurface microbiome video
    Digging into Microbial Ecosystems Deep Underground
    JGI users and microbiome researchers at Colorado State University have many questions about the microbial communities deep underground, including the role viral infection may play in other natural ecosystems.

    Read more

    Yeast strains engineered for the biochemical conversion of glucose to value-added products are limited in chemical output due to growth and viability constraints. Cell extracts provide an alternative format for chemical synthesis in the absence of cell growth by isolating the soluble components of lysed cells. By separating the production of enzymes (during growth) and the biochemical production process (in cell-free reactions), this framework enables biosynthesis of diverse chemical products at volumetric productivities greater than the source strains. (Blake Rasor)
    Boosting Small Molecule Production in Super “Soup”
    Researchers supported through the Emerging Technologies Opportunity Program describe a two-pronged approach that starts with engineered yeast cells but then moves out of the cell structure into a cell-free system.

    More

    These bright green spots are fluorescently labelled bacteria from soil collected from the surface of plant roots. For reference, the scale bar at bottom right is 10 micrometers long. (Rhona Stuart)
    A Powerful Technique to Study Microbes, Now Easier
    In JGI's Genome Insider podcast: LLNL biologist Jennifer Pett-Ridge collaborated with JGI scientists through the Emerging Technologies Opportunity Program to semi-automate experiments that measure microbial activity in soil.

    More

  • News & Publications
    • News
    • Blog
    • Podcasts
    • Webinars
    • Publications
    • Newsletter
    • Logos and Templates
    • Photos
    A view of the mangroves from which the giant bacteria were sampled in Guadeloupe. (Hugo Bret)
    Giant Bacteria Found in Guadeloupe Mangroves Challenge Traditional Concepts
    Harnessing JGI and Berkeley Lab resources, researchers characterized a giant - 5,000 times bigger than most bacteria - filamentous bacterium discovered in the Caribbean mangroves.

    More

    In their approved proposal, Frederick Colwell of Oregon State University and colleagues are interested in the microbial communities that live on Alaska’s glacially dominated Copper River Delta. They’re looking at how the microbes in these high latitude wetlands, such as the Copper River Delta wetland pond shown here, cycle carbon. (Courtesy of Rick Colwell)
    Monitoring Inter-Organism Interactions Within Ecosystems
    Many of the proposals approved through JGI's annual Community Science Program call focus on harnessing genomics to developing sustainable resources for biofuels and bioproducts.

    More

    Coloring the water, the algae Phaeocystis blooms off the side of the sampling vessel, Polarstern, in the temperate region of the North Atlantic. (Katrin Schmidt)
    Climate Change Threatens Base of Polar Oceans’ Bountiful Food Webs
    As warm-adapted microbes edge polewards, they’d oust resident tiny algae. It's a trend that threatens to destabilize the delicate marine food web and change the oceans as we know them.

    More

Data & Tools
Home › Data & Tools › Software › BBTools › BBTools User Guide › Dedupe Guide

Dedupe Guide

Dedupe was written to eliminate duplicate contigs in assemblies, and later expanded to find all contained and overlapping sequences in a dataset, allowing a specified number of substitutions or edit distance. It is now also capable of clustering sequences based on similarity, and printing dot-formatted all-to-all overlap graphs.
Kmer-based assemblers do not typically create redundant contigs when working correctly, though an exception can be made in the case of transcriptome assemblers. However, overlap-based assemblers may create duplicate sequences, and merged kmer-based assemblies (such as 5 assemblies of the same reads with different kmer lengths) will usually contain massive redundancy. Also, public databases such as nt and RefSeq often have hundreds of thousands of duplicate sequences due to poor curation. Dedupe is primarily designed to handle these situations. While there are other tools designed for this purpose, none are even remotely as fast or comprehensive as Dedupe.

*Notes*

Memory:

Dedupe stores all unique sequences in memory. The cost is around 500 bytes per unique sequence, plus the sequences themselves (1 byte per base). It is possible to run Dedupe in subset mode to deduplicate datasets that do not fit in memory, but that will not be covered in this guide.

Threads and Scaling:

Dedupe is fully multithreaded, and scales near-linearly with the number of cores. Finding exact duplicates is so fast that it typically becomes bottlenecked by the file input streams, which max out at around 500 Mbp/s each. When deduplicating multiple references, using “in=a.fasta,b.fasta,c.fasta” allows each to be read in a different stream, increasing the maximal throughput compared to first concatenating all references into a single file.

Phases:

Dedupe has 6 phases, most of which are optional and depend on the processing mode. They are always executed (or skipped) in the same order.
1) Exact Matches.
During this required phase, sequences are loaded into memory, and exact duplicates (including reverse-complements) are detected and discarded. Hashtables are filled with sequence hash codes of sequences. If containments or overlaps will be examined in later stages, kmers from the ends of sequences will also be added to hash tables. After this phase, the input files will not be used again.
2) Absorb Containments.
If “absorbcontainments” is enabled (default), every read X is scanned for kmers; each kmer is looked up in a hashtable. If the kmer occurs in some other read Y, then Y is aligned with X to see if X contains Y (meaning Y is equal in length or shorter than X, and the number of substitutions or edits is at most the values specified with the “s” and “e” flags). If so, Y is discarded.
3) Find Overlaps.
If “findoverlaps” is enabled (non-default), overlaps will be sought using the same process as containment-absorbtion, but X will not need to contain Y; they must simply have an overlap of at least minoverlap (default 200). Neither is absorbed, and nor are they merged; the overlap is simply recorded.
4) Make Clusters.
If “cluster” is enabled (non-default), clusters will be created by searching the overlap graph. Each cluster is the set of all reads reachable via transitive overlaps. For example, if X overlaps Y, and Y overlaps Z, then X, Y, and Z will form a cluster, even if X does not overlap Z. That means if there is even a single edge between 2 clusters, they will become one cluster.
5) Process Clusters.
If “processclusters” is enabled (non-default), the clusters will be post processed to simplify them. This involves various graph simplification operations (which can be individually toggled) like removing redundant edges and (when possible) flipping some of the sequences so that they are all in the same orientation.
6) Output
Finally, all of the output files are generated.

Read Deduplication:

Dedupe can be used for deduplicating read sets, and supports paired reads as well (in which case it requires a pair to be the duplicate of another pair). However, due to memory usage, it is not particularly efficient in this role, considering the volume of data that can be generated on modern sequencing platforms. Dedupe can easily handle a 10M read MiSeq run, but a HighSeq lane with 300M reads might take hundreds of GB of RAM. In those cases, deduplication methods based on sorting would be more efficient (for example, mapping and deduplicating based on mapping position). Dedupe does not perform those operations.
Also, the current implementation of Dedupe is strictly limited to 2 billion unique sequences regardless of how much memory is available.

Pair Limitations:

Dedupe supports paired reads, but it was not really designed for them. When processing paired reads, some parts of Dedupe are restricted to a single thread due to a complication that causes non-deterministic output. As such, processing paired reads is slower than unpaired reads. Also, pair support is limited to exact matches and overlaps, not containments.

JNI acceleration:

Dedupe has an optional C component (written by Jonathan Rood) which will accelerate overlap and containment detection by a lot (at least double). However, it only has an effect if an edit distance is allowed. This can be activated with the “jni” flag, but it must first be compiled. For details on compiling it, see /bbmap/jni/README.txt. When clustering amplicons and allowing an edit distance, the “jni” and “pto” flags are highly recommended as they will dramatically increase speed.

Dedupe versus Dedupe2:

Dedupe and Dedupe2 are identical except that Dedupe2 allows an unlimited number of affixes (kmer prefixes and suffixes used for seeding overlap detection). This is only useful when searching for overlaps with a relatively low identity, since kmers are required to have exact matches. More affixes use more memory and slow things down, so don’t go overboard. You can call dedupe.sh or dedupe2.sh; internally, either Dedupe or Dedupe2 will get used depending on how many affixes were requested with the “nam” (“numaffixmaps”) flag. The fact that 2 shell scripts are present is a legacy.

*Usage Examples*

Exact duplicate removal only:

dedupe.sh in=X.fa out=Z.fa ac=f

The “ac=f” flag disables containment removal.
Exact duplicate and contained sequence removal:
dedupe.sh in=X.fa out=Y.fa

Finding duplicate sequences:

dedupe.sh in=X.fa out=Y.fa outd=duplicates.fa

All removed sequences will end up in “duplicates.fa”.

Deduplication of multiple files:

dedupe.sh in=X1.fa,X2.fa,X3.fa out=Y.fa

Deduplication allowing mismatches:

dedupe.sh in=X.fa out=Y.fa s=5 e=2

This will allow up to 5 substitutions, or 2 edits. What does this mean? Well, 5 substitutions is OK. 2 insertions or 2 deletions is OK. 2 insertions and 3 substitutions is OK. But, 5 insertions is not OK, because the edit distance specifies the bandwith of the banded aligner, and more than 2 insertions or deletions would go out of bounds. “s=5” alone would allow 5 substitutions and zero indels, while “e=2” alone would allow up to 2 of any mutations (2 subs, 1 sub 1 insertion, etc).

Deduplication with a minimum identity:

dedupe.sh in=X.fa out=Y.fa minidentity=99

This will consider two sequences to be duplicates if their identity is at least 99%. Indels are not allowed unless you specifically set the “e” flag. So, “minidentity=99” would consider 2 1000bp sequences to be duplicates if they had up to 1000*1% = 10 substitutions. “minidentity=99 e=5” would consider 2 1000bp sequences to be duplicates if they had up to 10 mutations, but only up to 5 of them could be indels. “minidentity=99 e=20” would consider 2 1000bp sequences to be duplicates if they had up to 20 mutations, all of which could be indels. Why not 10? Because it uses the max of the number of edits set by “e” and the number set by identity.

Clustering by overlap:

dedupe.sh in=X.fq pattern=cluster%.fq ac=f am=f s=1 mo=200 c pc csf=stats.txt outbest=best.fq fo c mcs=3 cc dot=graph.dot

This will find overlaps (fo) using a min overlap length (mo) of 200 and allowing 1 substitution (s). Then, reads will be clustered (c), and clusters of at least size 3 (mcs) will be written to individual files: cluster1.fq, cluster2.fq, etc. Also, the single best representative of each cluster (based on length and quality scores) will be written to outbest.fq (this makes more sense for amplicon clustering than assembly). A graph representing the overlaps will be written to graph.dot, which can be visualized with graphviz.

Clustering full-length PacBio 16s reads of insert:

reformat.sh in=reads_of_insert.fastq out=filtered.fq minlen=1420 maxlen=1640 maq=20 qin=33
then:
dedupe.sh in=filtered.fq csf=stats_e26.txt outbest=best_e26.fq qin=33 usejni=t am=f ac=f fo c rnc=f mcs=3 k=27 mo=1420 ow cc pto nam=4 e=26 pattern=cluster_%.fq dot=graph.dot

This first filters out low-quality data and probable chimeras (based on length) using Reformat. Then, clustering is done allowing up to 26 edits (this was chosen to allow roughly 99% accurate 1540bp amplicons to overlap; it should be adjusted depending on the accuracy and length of the data). A minimum overlap length is set to 1420bp. “nam=4 k=27” means 4 nonoverlapping 27-mers are used as seeds from each end of the sequences.

*Set Operations*

It is possible to do arbitrary set operations (intersection, union, subtraction) with Dedupe, though it’s not trivial. They are made possible by the “uniqueonly” flag, which discards all copies of sequences that have duplicates, rather than retaining exactly one. Note that similar operations are possible on kmer sets rather than sequence sets using kcompress.sh.

Set creation:

dedupe.sh in=X.fa out=X2.fa ac=f
dedupe.sh in=Y.fa out=Y2.fa ac=f

This is a necessary first step to ensure that X2 and Y2 are sets, meaning they have no duplicates.

Set union:

dedupe.sh in=X.fa,Y.fa out=union.fa ac=f

Set subtraction:

dedupe.sh in=X2.fa,union.fa out=Y2_minus_X2.fa uniqueonly ac=f

Set symmetric difference:

dedupe.sh in=X2.fa,Y2.fa out=symmetric_difference.fa uniqueonly ac=f

Set intersection:

dedupe.sh in=X2.fa,symmetric_difference.fa out=intersection.fa uniqueonly ac=f

  • BBTools User Guide
    • Installation Guide
    • Usage Guide
    • Data Preprocessing
    • Add Adapters Guide
    • BBDuk Guide
    • BBMap Guide
    • BBMask Guide
    • BBMerge Guide
    • BBNorm Guide
    • CalcUniqueness Guide
    • Clumpify Guide
    • Dedupe Guide
    • Reformat Guide
    • Repair Guide
    • Seal Guide
    • Split Nextera Guide
    • Statistics Guide
    • Tadpole Guide
    • Taxonomy Guide
  • BBTools FAQ and Support Forums

More topics:

  • COVID-19 Status
  • News
  • Science Highlights
  • Blog
  • Webinars
  • CSP Plans
  • Featured Profiles
  • Careers
  • Contact Us
  • Events
  • User Meeting
  • MGM Workshops
  • Internal
  • Disclaimer
  • Credits
  • Policies
  • Emergency Info
  • Accessibility / Section 508 Statement
  • Flickr
  • LinkedIn
  • RSS
  • Twitter
  • YouTube
Lawrence Berkeley National Lab Biosciences Area
A project of the US Department of Energy, Office of Science

JGI is a DOE Office of Science User Facility managed by Lawrence Berkeley National Laboratory

© 1997-2023 The Regents of the University of California