DOE Joint Genome Institute

  • COVID-19
  • About Us
  • Contact Us
  • Our Science
    • DOE Mission Areas
    • Bioenergy Research Centers
    • Science Programs
    • Science Highlights
    • Scientists
    (PXFuel)
    Designer DNA: JGI Helps Users Blaze New Biosynthetic Pathways
    In a special issue of the journal Synthetic Biology, JGI scientific users share how they’ve worked with the JGI DNA Synthesis Science Program and what they’ve discovered through their collaborations.

    More

    A genetic element that generates targeted mutations, called diversity-generating retroelements (DGRs), are found in viruses, as well as bacteria and archaea. Most DGRs found in viruses appear to be in their tail fibers. These tail fibers – signified in the cartoon by the blue virus’ downward pointing ‘arms’— allow the virus to attach to one cell type (red), but not the other (purple). DGRs mutate these ‘arms,’ giving the virus opportunities to switch to different prey, like the purple cell. (Courtesy of Blair Paul)
    A Natural Mechanism Can Turbocharge Viral Evolution
    A team has discovered that diversity generating retroelements (DGRs) are not only widespread, but also surprisingly active. In viruses, DGRs appear to generate diversity quickly, allowing these viruses to target new microbial prey.

    More

    Algae growing in a bioreactor. (Dennis Schroeder, NREL)
    Refining the Process of Identifying Algae Biotechnology Candidates
    Researchers combined expertise at the National Labs to screen, characterize, sequence and then analyze the genomes and multi-omics datasets for algae that can be used for large-scale production of biofuels and bioproducts.

    More

  • Our Projects
    • Search JGI Projects
    • DOE Metrics/Statistics
    • Approved User Proposals
    • Legacy Projects
    This data image shows the monthly average sea surface temperature for May 2015. Between 2013 and 2016, a large mass of unusually warm ocean water--nicknamed the blob--dominated the North Pacific, indicated here by red, pink, and yellow colors signifying temperatures as much as three degrees Celsius (five degrees Fahrenheit) higher than average. Data are from the NASA Multi-scale Ultra-high Resolution Sea Surface Temperature (MUR SST) Analysis product. (Courtesy NASA Physical Oceanography Distributed Active Archive Center)
    When “The Blob” Made It Hotter Under the Water
    Researchers tracked the impact of a large-scale heatwave event in the ocean known as “The Blob” as part of an approved proposal through the Community Science Program.

    More

    A plantation of poplar trees. (David Gilbert)
    Genome Insider podcast: THE Bioenergy Tree
    The US Department of Energy’s favorite tree is poplar. In this episode, hear from ORNL scientists who have uncovered remarkable genetic secrets that bring us closer to making poplar an economical and sustainable source of energy and materials.

    More

    Ian Rambo, graduate student at UT-Austin, was a DOE Graduate Student Research Fellow at the JGI
    Virus-Microbe Interactions of Mud Island Mangroves
    Through the DOE Office of Science Graduate Student Research (SCGSR) program, Ian Rambo worked on part of his dissertation at the JGI. The chapter focuses on how viruses influence carbon cycling in coastal mangroves.

    More

  • Data & Tools
    • IMG
    • Data Portal
    • MycoCosm
    • PhycoCosm
    • Phytozome
    • GOLD
    HPCwire Editor's Choice Award (logo crop) for Best Use of HPC in the Life Sciences
    JGI Part of Berkeley Lab Team Awarded Best Use of HPC in Life Sciences
    The HPCwire Editors Choice Award for Best Use of HPC in Life Sciences went to the Berkeley Lab team comprised of JGI and ExaBiome Project team, supported by the DOE Exascale Computing Project for MetaHipMer, an end-to-end genome assembler that supports “an unprecedented assembly of environmental microbiomes.”

    More

    With a common set of "baseline metadata," JGI users can more easily access public data sets. (Steve Wilson)
    A User-Centered Approach to Accessing JGI Data
    Reflecting a structural shift in data access, the JGI Data Portal offers a way for users to more easily access public data sets through a common set of metadata.

    More

    Phytozome portal collage
    A More Intuitive Phytozome Interface
    Phytozome v13 now hosts upwards of 250 plant genomes and provides users with the genome browsers, gene pages, search, BLAST and BioMart data warehouse interfaces they have come to rely on, with a more intuitive interface.

    More

  • User Programs
    • Calls for Proposals
    • Special Initiatives & Programs
    • Product Offerings
    • User Support
    • Policies
    • Submit a Proposal
    screencap from Amundson and Wilkins subsurface microbiome video
    Digging into Microbial Ecosystems Deep Underground
    JGI users and microbiome researchers at Colorado State University have many questions about the microbial communities deep underground, including the role viral infection may play in other natural ecosystems.

    Read more

    Yeast strains engineered for the biochemical conversion of glucose to value-added products are limited in chemical output due to growth and viability constraints. Cell extracts provide an alternative format for chemical synthesis in the absence of cell growth by isolating the soluble components of lysed cells. By separating the production of enzymes (during growth) and the biochemical production process (in cell-free reactions), this framework enables biosynthesis of diverse chemical products at volumetric productivities greater than the source strains. (Blake Rasor)
    Boosting Small Molecule Production in Super “Soup”
    Researchers supported through the Emerging Technologies Opportunity Program describe a two-pronged approach that starts with engineered yeast cells but then moves out of the cell structure into a cell-free system.

    More

    These bright green spots are fluorescently labelled bacteria from soil collected from the surface of plant roots. For reference, the scale bar at bottom right is 10 micrometers long. (Rhona Stuart)
    A Powerful Technique to Study Microbes, Now Easier
    In JGI's Genome Insider podcast: LLNL biologist Jennifer Pett-Ridge collaborated with JGI scientists through the Emerging Technologies Opportunity Program to semi-automate experiments that measure microbial activity in soil.

    More

  • News & Publications
    • News
    • Blog
    • Podcasts
    • Webinars
    • Publications
    • Newsletter
    • Logos and Templates
    • Photos
    In their approved proposal, Frederick Colwell of Oregon State University and colleagues are interested in the microbial communities that live on Alaska’s glacially dominated Copper River Delta. They’re looking at how the microbes in these high latitude wetlands, such as the Copper River Delta wetland pond shown here, cycle carbon. (Courtesy of Rick Colwell)
    Monitoring Inter-Organism Interactions Within Ecosystems
    Many of the proposals approved through JGI's annual Community Science Program call focus on harnessing genomics to developing sustainable resources for biofuels and bioproducts.

    More

    Coloring the water, the algae Phaeocystis blooms off the side of the sampling vessel, Polarstern, in the temperate region of the North Atlantic. (Katrin Schmidt)
    Climate Change Threatens Base of Polar Oceans’ Bountiful Food Webs
    As warm-adapted microbes edge polewards, they’d oust resident tiny algae. It's a trend that threatens to destabilize the delicate marine food web and change the oceans as we know them.

    More

    Integrating JGI Capabilities for Exploring Earth’s Secondary Metabolome
    Natural Prodcast podcast: Nigel Mouncey
    JGI Director Nigel Mouncey has a vision to build out an integrative genomics approach to looking at the interactions of organisms and environments. He also sees secondary metabolism analysis and research as a driver for novel technologies that can serve all JGI users.

    More

Data & Tools
Home › Data & Tools › Software › BBTools › BBTools User Guide › Tadpole Guide

Tadpole Guide

Tadpole is a kmer-based assembler, with additional capabilities of error-correcting and extending reads. It does not do any complicated graph analysis or scaffolding, and therefore, is not particularly good for diploid organisms. However, compared to most other assemblers, it is incredibly fast, has a very low misassembly rate, and is very adept at handling extremely irregular or super high coverage distributions. It does not have any annoying side-effects of generating temp files and directories. Also, it can selectively assemble a coverage ‘band’ from a dataset (for example, just areas with a depth between 1000x and 1500x). These features make it a good choice for microbial single-cell data, viruses, organelles, and preliminary assemblies for use in binning, quality recalibration, insert-size estimation, and so forth. Tadpole has no upper limit on kmer length.

Tadpole’s parameters are described in its shell script (tadpole.sh). This file provides usage examples of various common tasks.

*Notes*

Memory:

Tadpole will, by default, attempt to claim all available memory. It uses approximately 20 bytes per unique kmer for k=1-31, 30 bytes per kmer for k=32-62, and so forth in increments of 31. However, with most datasets, the bulk of the kmers (and thus memory) are unwanted error kmers rather than genomic kmers. It is possible to save memory by making Tadpole ignore low-quality kmers using the “minprob” flag (this ignores kmers that, based on their quality scores, have less than a specified probability of being error-free). Alternatively, bloom filters can be used to screen low-depth kmers efficiently using the “prefilter” flag. Also, memory will be used somewhat more efficiently if the “prealloc” flag is applied, which makes Tadpole allocate all physical memory immediately rather than growing as needed. If Tadpole runs out of memory on a dataset despite using these options, you may consider using BBNorm to normalize or error-correct the data first. Both of those will reduce the number of unique kmers in the dataset.

Processing modes and output streams:

The default mode is contig-building; reads are processed, kmers are counted, then contigs are made from kmers and written to a file. The alternate mode is error correction / extension, which can be entered with the flag “mode=correct” or “mode=extend”; either of those modes supports both error-correction and extension (making the reads longer by assembling at their ends). In contig mode, the reads will be processed once, and the contigs will be written to “out”. In correct or extend mode, the reads will be processed twice (once to count kmers, and once to modify the reads), and will be written to “out”.

Threads:

Tadpole is fully multithreaded, both for kmer-counting and for the output phase (contig-building, error-correction, or extension). You should allow it to use all available processors except when operating on a shared node, in which case you may need to cap the number of threads with the “t” flag.

Kmer Length:

Tadpole supports unlimited kmer length, but it does not support all kmer lengths. Specifically, it supports every value of K from 1-31, every multiple of 2 from 32-62 (meaning 32, 34, 36, etc), every multiple of 3 from 63-93, and so forth. There is a wrapper script, tadwrapper.sh, that will assemble a range of different kmer lengths to determine which is best. Typically, about 2/3rds of read length is a good value for K for assembly. For error-correction, about 1/3rd of read length is better. In order to assemble with longer kmers, it is possible to error-correct and extend reads with short kmers (such as 31-mers), then use the longer extended (and potentially merged) reads to assemble with a longer kmer. Longer kmers are better able to resolve repetitive features in genomes, and thus tend to yield more continuous assemblies. The tradeoff is that longer kmers have lower coverage.

Shave and Rinse:

These flags examine the graph immediately after kmer-counting is finished, to remove kmers that cause error-induced branches. Specifically, “shave” removes kmers along dead-end paths with very low depth that branch off from a higher-depth path, and “rinse” removes kmers along very-low-depth bubbles that start and end at branches off a higher-depth path. Both are optional and can be applied to any processing mode. They do not currently seem to make a significant difference.

Continuity and fragmentation:

Tadpole is designed to be conservative and avoid misassemblies in repetitive regions. As a result, the assemblies may sometimes be more fragmented than necessary. With sufficient coverage and read length, fragmentation can often be reduced by choosing a longer kmer. Alternately, reducing the value of branchmult1 and branchmult2 (to, say, “bm1=8 bm2=2”) can often increase the continuity of an assembly, though that does come with an increased risk of misassemblies.

*Usage Examples*

Assembly:

tadpole.sh in=reads.fq out=contigs.fa k=93

This will assemble the reads into contigs. Each contig will consist of unique kmers, so contigs will not overlap by more than K-1 bases. Contigs end when there is a branch or dead-end in the kmer graph. The specific triggers for detecting a branch or dead-end are controlled by the flags mincountextend, branchmult1, branchmult2, and branchlower. Contigs will only be assembled starting with kmers with depth at least mincountseed, and contigs shorter than mincontig or with average coverage lower than mincoverage will be discarded.

Error correction:

tadpole.sh in=reads.fq out=ecc.fq mode=correct k=50

This corrects the reads and outputs corrected reads. Correction is handled by two algorithms, “pincer” and “tail”. Pincer corrects errors bidirectionally, using kmers on the left and right; therefore, it can only work on bases in the middle of the read, at least K away from either end. Tail is not as robust, but is able to work on the ends of the read. So, it’s best to leave them both enabled, in which case the middle bases are corrected with pincer, and the end bases are corrected with tail.

Error marking:

tadpole.sh in=reads.fq out=ecc.fq mode=correct k=50 ecc=f mbb=2

This will not correct bases, but simply mark bases that appear to be errors by replacing them with N. A base is considered a probable error (in this mode) if it is fully covered by kmers with depth below the value (in this case, 2). Mbb and ecc can be used together.
Read Extension:
tadpole.sh in=reads.fq out=extended.fq mode=extend k=93 el=50 er=50

This will extend reads by up to 50bp to the left and 50bp to the right. Extension will stop prematurely if a branch or dead-end is encountered. Read extension and error-correction may be done at the same time, but that’s not always ideal, as they may have different optimal values of K. Error-correction should use kmers shorter than 1/2 read length at the longest; otherwise, the middle of the read can’t get corrected.

  • BBTools User Guide
    • Installation Guide
    • Usage Guide
    • Data Preprocessing
    • Add Adapters Guide
    • BBDuk Guide
    • BBMap Guide
    • BBMask Guide
    • BBMerge Guide
    • BBNorm Guide
    • CalcUniqueness Guide
    • Clumpify Guide
    • Dedupe Guide
    • Reformat Guide
    • Repair Guide
    • Seal Guide
    • Split Nextera Guide
    • Statistics Guide
    • Tadpole Guide
    • Taxonomy Guide
  • BBTools FAQ and Support Forums

More topics:

  • COVID-19 Status
  • News
  • Science Highlights
  • Blog
  • Webinars
  • CSP Plans
  • Featured Profiles
  • Careers
  • Contact Us
  • Events
  • User Meeting
  • MGM Workshops
  • Internal
  • Disclaimer
  • Credits
  • Policies
  • Emergency Info
  • Accessibility / Section 508 Statement
  • Flickr
  • LinkedIn
  • RSS
  • Twitter
  • YouTube
Lawrence Berkeley National Lab Biosciences Area
A project of the US Department of Energy, Office of Science

JGI is a DOE Office of Science User Facility managed by Lawrence Berkeley National Laboratory

© 1997-2022 The Regents of the University of California