DOE Joint Genome Institute

  • COVID-19
  • About
  • Phones
  • Contacts
  • Our Science
    • DOE Mission Areas
    • Bioenergy Research Centers
    • Science Programs
    • Products
    • Science Highlights
    • Scientists
    Screencap of green algae video for PNAS paper
    Green Algae Reveal One mRNA Encodes Many Proteins
    A team of researchers has found numerous examples of polycistronic expression – in which two or more genes are encoded on a single molecule of mRNA – in two species of green algae.

    Read more

    Advances in Rapidly Engineering Non-model Bacteria
    CRAGE is a technique for chassis (or strain)-independent recombinase-assisted genome engineering, allowing scientists to conduct genome-wide screens and explore biosynthetic pathways. Now, CRAGE is being applied to other synthetic biology problems.

    Read more

    Maize can produce a cocktail of antibiotics with a handful of enzymes. (Sam Fentress, CC BY-SA 2.0)
    How Maize Makes An Antibiotic Cocktail
    Zealexins are produced in every corn variety and protect maize by fending off fungal and microbial infections using surprisingly few enzymes.

    More

  • Our Projects
    • Search JGI Projects
    • DOE Metrics/Statistics
    • Approved User Proposals
    • Legacy Projects
    Poplar (Populus trichocarpa and P. deltoides) grow in the Advanced Plant Phenotyping Laboratory (APPL) at Oak Ridge National Laboratory in Tennessee. Poplar is an important biofuel feedstock, and Populus trichocarpa is the first tree species to have its genome sequenced — a feat accomplished by JGI. (Image courtesy of Oak Ridge National Laboratory, U.S. Dept. of Energy)
    Podcast: Xiaohan Yang on A Plantiful Future
    Building off plant genomics collaborations between the JGI and Oak Ridge National Laboratory, Xiaohan Yang envisions customizing plants for the benefit of human society.

    More:

    Expansin complex with cell wall in background. (Courtesy of Daniel Cosgrove)
    Synthesizing Microbial Expansins with Unusual Activities
    Expansin proteins from diverse microbes have potential uses in deconstructing lignocellulosic biomass for conversion to renewable biofuels, nanocellulosic fibers, and commodity biochemicals.

    Read more

    High oleic pennycress. (Courtesy of Ratan Chopra)
    Pennycress – A Solution for Global Food Security, Renewable Energy and Ecosystem Benefits
    Pennycress (Thlaspi arvense) is under development as a winter annual oilseed bioenergy crop. It could produce up to 3 billion gallons of seed oil annually while reducing soil erosion and fertilizer runoff.

    Read more

  • Data & Tools
    • IMG
    • Genome Portal
    • MycoCosm
    • PhycoCosm
    • Phytozome
    • GOLD
    Artistic interpretation of CheckV assessing virus genome sequences from environmental samples. (Rendered by Zosia Rostomian​, Berkeley Lab)
    An Automated Tool for Assessing Virus Data Quality
    CheckV can be broadly utilized by the research community to gauge virus data quality and will help researchers to follow best practices and guidelines for providing the minimum amount of information for an uncultivated virus genome.

    More

    Unicellular algae in the Chlorella genus, magnified 1300x. (Andrei Savitsky)
    A One-Stop Shop for Analyzing Algal Genomes
    The PhycoCosm data portal is an interactive browser that allows algal scientists and enthusiasts to look deep into more than 100 algal genomes, compare them, and visualize supporting experimental data.

    More

    Artistic interpretation of how microbial genome sequences from the GEM catalog can help fill in gaps of knowledge about the microbes that play key roles in the Earth's microbiomes. (Rendered by Zosia Rostomian​, Berkeley Lab)
    Podcast: A Primer on Genome Mining
    In Natural Prodcast: the basics of genome mining, and how JGI researchers conducted it in IMG/ABC on thousands of metagenome-derived genomes for a Nature Biotechnology paper.

    Read more

  • User Programs
    • Calls for User Proposals
    • Special Initiatives & Programs
    • User Support
    • Submit a Proposal
    screencap long reads webinar_ Metagenome Program
    Utilizing long-read sequencing for metagenomics and DNA modification detection webinar
    Watch the webinar on how the JGI employs single-molecule, long-read DNA sequences to aid with genome assembly and transcriptome analysis of microbial, fungal, and plant research projects.

    More

    SIP engagement webinar
    “SIP technologies at EMSL and JGI” Webinar
    The concerted stable isotope-related tools and resources of the JGI and the Environmental Molecular Sciences Laboratory (EMSL) may be requested by applying for the annual “Facilities Integrating Collaborations for User Science” (FICUS) call.

    Read more

    martin-adams-unsplash
    CSP Functional Genomics Call Ongoing
    The CSP Functional Genomics call helps users translate genomic information into biological function. Proposals submitted by July 31, 2021 will be part of the next review.

    Read more

  • News & Publications
    • News
    • Blog
    • Podcasts
    • Webinars
    • Publications
    • Newsletter
    • Logos and Templates
    • Photos
    Aerial photo of the switchgrass diversity panel late in the 2020 season at the Kellogg Biological Station in Michigan. (Robert Goodwin)
    A Team Effort Toward Targeted Crop Improvements
    A multi-institutional team has produced a high-quality reference sequence of the complex switchgrass genome. Building off this work, researchers at three DOE Bioenergy Research Centers have expanded the network of common gardens and are exploring improvements to switchgrass.

    More

    Artistic interpretation of how microbial genome sequences from the GEM catalog can help fill in gaps of knowledge about the microbes that play key roles in the Earth's microbiomes. (Rendered by Zosia Rostomian​, Berkeley Lab)
    Uncovering Novel Genomes from Earth’s Microbiomes
    A public repository of 52,515 microbial draft genomes generated from environmental samples around the world, expanding the known diversity of bacteria and archaea by 44%, is now available .

    More

    Green millet (Setaria viridis) plant collected in the wild. (Courtesy of the Kellogg lab)
    Shattering Expectations: Novel Seed Dispersal Gene Found in Green Millet
    In Nature Biotechnology, a very high quality reference Setaria viridis genome was sequenced, and for the first time in wild populations, a gene related to seed dispersal was identified.

    More

Data & Tools
Home › Data & Tools › BBTools › BBTools User Guide › Tadpole Guide

Tadpole Guide

Tadpole is a kmer-based assembler, with additional capabilities of error-correcting and extending reads. It does not do any complicated graph analysis or scaffolding, and therefore, is not particularly good for diploid organisms. However, compared to most other assemblers, it is incredibly fast, has a very low misassembly rate, and is very adept at handling extremely irregular or super high coverage distributions. It does not have any annoying side-effects of generating temp files and directories. Also, it can selectively assemble a coverage ‘band’ from a dataset (for example, just areas with a depth between 1000x and 1500x). These features make it a good choice for microbial single-cell data, viruses, organelles, and preliminary assemblies for use in binning, quality recalibration, insert-size estimation, and so forth. Tadpole has no upper limit on kmer length.

Tadpole’s parameters are described in its shell script (tadpole.sh). This file provides usage examples of various common tasks.

*Notes*

Memory:

Tadpole will, by default, attempt to claim all available memory. It uses approximately 20 bytes per unique kmer for k=1-31, 30 bytes per kmer for k=32-62, and so forth in increments of 31. However, with most datasets, the bulk of the kmers (and thus memory) are unwanted error kmers rather than genomic kmers. It is possible to save memory by making Tadpole ignore low-quality kmers using the “minprob” flag (this ignores kmers that, based on their quality scores, have less than a specified probability of being error-free). Alternatively, bloom filters can be used to screen low-depth kmers efficiently using the “prefilter” flag. Also, memory will be used somewhat more efficiently if the “prealloc” flag is applied, which makes Tadpole allocate all physical memory immediately rather than growing as needed. If Tadpole runs out of memory on a dataset despite using these options, you may consider using BBNorm to normalize or error-correct the data first. Both of those will reduce the number of unique kmers in the dataset.

Processing modes and output streams:

The default mode is contig-building; reads are processed, kmers are counted, then contigs are made from kmers and written to a file. The alternate mode is error correction / extension, which can be entered with the flag “mode=correct” or “mode=extend”; either of those modes supports both error-correction and extension (making the reads longer by assembling at their ends). In contig mode, the reads will be processed once, and the contigs will be written to “out”. In correct or extend mode, the reads will be processed twice (once to count kmers, and once to modify the reads), and will be written to “out”.

Threads:

Tadpole is fully multithreaded, both for kmer-counting and for the output phase (contig-building, error-correction, or extension). You should allow it to use all available processors except when operating on a shared node, in which case you may need to cap the number of threads with the “t” flag.

Kmer Length:

Tadpole supports unlimited kmer length, but it does not support all kmer lengths. Specifically, it supports every value of K from 1-31, every multiple of 2 from 32-62 (meaning 32, 34, 36, etc), every multiple of 3 from 63-93, and so forth. There is a wrapper script, tadwrapper.sh, that will assemble a range of different kmer lengths to determine which is best. Typically, about 2/3rds of read length is a good value for K for assembly. For error-correction, about 1/3rd of read length is better. In order to assemble with longer kmers, it is possible to error-correct and extend reads with short kmers (such as 31-mers), then use the longer extended (and potentially merged) reads to assemble with a longer kmer. Longer kmers are better able to resolve repetitive features in genomes, and thus tend to yield more continuous assemblies. The tradeoff is that longer kmers have lower coverage.

Shave and Rinse:

These flags examine the graph immediately after kmer-counting is finished, to remove kmers that cause error-induced branches. Specifically, “shave” removes kmers along dead-end paths with very low depth that branch off from a higher-depth path, and “rinse” removes kmers along very-low-depth bubbles that start and end at branches off a higher-depth path. Both are optional and can be applied to any processing mode. They do not currently seem to make a significant difference.

Continuity and fragmentation:

Tadpole is designed to be conservative and avoid misassemblies in repetitive regions. As a result, the assemblies may sometimes be more fragmented than necessary. With sufficient coverage and read length, fragmentation can often be reduced by choosing a longer kmer. Alternately, reducing the value of branchmult1 and branchmult2 (to, say, “bm1=8 bm2=2”) can often increase the continuity of an assembly, though that does come with an increased risk of misassemblies.

*Usage Examples*

Assembly:

tadpole.sh in=reads.fq out=contigs.fa k=93

This will assemble the reads into contigs. Each contig will consist of unique kmers, so contigs will not overlap by more than K-1 bases. Contigs end when there is a branch or dead-end in the kmer graph. The specific triggers for detecting a branch or dead-end are controlled by the flags mincountextend, branchmult1, branchmult2, and branchlower. Contigs will only be assembled starting with kmers with depth at least mincountseed, and contigs shorter than mincontig or with average coverage lower than mincoverage will be discarded.

Error correction:

tadpole.sh in=reads.fq out=ecc.fq mode=correct k=50

This corrects the reads and outputs corrected reads. Correction is handled by two algorithms, “pincer” and “tail”. Pincer corrects errors bidirectionally, using kmers on the left and right; therefore, it can only work on bases in the middle of the read, at least K away from either end. Tail is not as robust, but is able to work on the ends of the read. So, it’s best to leave them both enabled, in which case the middle bases are corrected with pincer, and the end bases are corrected with tail.

Error marking:

tadpole.sh in=reads.fq out=ecc.fq mode=correct k=50 ecc=f mbb=2

This will not correct bases, but simply mark bases that appear to be errors by replacing them with N. A base is considered a probable error (in this mode) if it is fully covered by kmers with depth below the value (in this case, 2). Mbb and ecc can be used together.
Read Extension:
tadpole.sh in=reads.fq out=extended.fq mode=extend k=93 el=50 er=50

This will extend reads by up to 50bp to the left and 50bp to the right. Extension will stop prematurely if a branch or dead-end is encountered. Read extension and error-correction may be done at the same time, but that’s not always ideal, as they may have different optimal values of K. Error-correction should use kmers shorter than 1/2 read length at the longest; otherwise, the middle of the read can’t get corrected.

  • BBTools User Guide
    • Installation Guide
    • Usage Guide
    • Data Preprocessing
    • Add Adapters Guide
    • BBDuk Guide
    • BBMap Guide
    • BBMask Guide
    • BBMerge Guide
    • BBNorm Guide
    • CalcUniqueness Guide
    • Clumpify Guide
    • Dedupe Guide
    • Reformat Guide
    • Repair Guide
    • Seal Guide
    • Split Nextera Guide
    • Statistics Guide
    • Tadpole Guide
    • Taxonomy Guide
  • BBTools FAQ and Support Forums

More topics:

  • COVID-19 Status
  • News
  • Science Highlights
  • Blog
  • Podcasts
  • Webinars
  • CSP Plans
  • Featured Profiles
  • Careers
  • Contact Us
  • Events
  • User Meeting
  • MGM Workshops
  • Internal
  • Disclaimer
  • Credits
  • Emergency Info
  • Accessibility / Section 508 Statement
  • RSS feed
  • Flickr
  • LinkedIn
  • Twitter
  • YouTube
Lawrence Berkeley National Lab Biosciences Area
A project of the US Department of Energy, Office of Science

JGI is a DOE Office of Science User Facility managed by Lawrence Berkeley National Laboratory

© 1997-2021 The Regents of the University of California