DOE Joint Genome Institute

  • COVID-19
  • About Us
  • Contact Us
  • Our Science
    • DOE Mission Areas
    • Bioenergy Research Centers
    • Science Programs
    • Science Highlights
    • Scientists
    Data yielded from RIViT-seq increased the number of sigma factor-gene pairs confirmed in Streptomyces coelicolor from 209 to 399. Here, grey arrows denote previously known regulation and red arrows are regulation identified by RIViT-seq; orange nodes mark sigma factors while gray nodes mark other genes. (Otani, H., Mouncey, N.J. Nat Commun 13, 3502 (2022). https://doi.org/10.1038/s41467-022-31191-w)
    Streamlining Regulon Identification in Bacteria
    Regulons are a group of genes that can be turned on or off by the same regulatory protein. RIViT-seq technology could speed up associating transcription factors with their target genes.

    More

    (PXFuel)
    Designer DNA: JGI Helps Users Blaze New Biosynthetic Pathways
    In a special issue of the journal Synthetic Biology, JGI scientific users share how they’ve worked with the JGI DNA Synthesis Science Program and what they’ve discovered through their collaborations.

    More

    A genetic element that generates targeted mutations, called diversity-generating retroelements (DGRs), are found in viruses, as well as bacteria and archaea. Most DGRs found in viruses appear to be in their tail fibers. These tail fibers – signified in the cartoon by the blue virus’ downward pointing ‘arms’— allow the virus to attach to one cell type (red), but not the other (purple). DGRs mutate these ‘arms,’ giving the virus opportunities to switch to different prey, like the purple cell. (Courtesy of Blair Paul)
    A Natural Mechanism Can Turbocharge Viral Evolution
    A team has discovered that diversity generating retroelements (DGRs) are not only widespread, but also surprisingly active. In viruses, DGRs appear to generate diversity quickly, allowing these viruses to target new microbial prey.

    More

  • Our Projects
    • Search JGI Projects
    • DOE Metrics/Statistics
    • Approved User Proposals
    • Legacy Projects
    Photograph of a stream of diatoms beneath Arctic sea ice.
    Polar Phytoplankton Need Zinc to Cope with the Cold
    As part of a long-term collaboration with the JGI Algal Program, researchers studying function and activity of phytoplankton genes in polar waters have found that these algae rely on dissolved zinc to photosynthesize.

    More

    This data image shows the monthly average sea surface temperature for May 2015. Between 2013 and 2016, a large mass of unusually warm ocean water--nicknamed the blob--dominated the North Pacific, indicated here by red, pink, and yellow colors signifying temperatures as much as three degrees Celsius (five degrees Fahrenheit) higher than average. Data are from the NASA Multi-scale Ultra-high Resolution Sea Surface Temperature (MUR SST) Analysis product. (Courtesy NASA Physical Oceanography Distributed Active Archive Center)
    When “The Blob” Made It Hotter Under the Water
    Researchers tracked the impact of a large-scale heatwave event in the ocean known as “The Blob” as part of an approved proposal through the Community Science Program.

    More

    A plantation of poplar trees. (David Gilbert)
    Genome Insider podcast: THE Bioenergy Tree
    The US Department of Energy’s favorite tree is poplar. In this episode, hear from ORNL scientists who have uncovered remarkable genetic secrets that bring us closer to making poplar an economical and sustainable source of energy and materials.

    More

  • Data & Tools
    • IMG
    • Data Portal
    • MycoCosm
    • PhycoCosm
    • Phytozome
    • GOLD
    HPCwire Editor's Choice Award (logo crop) for Best Use of HPC in the Life Sciences
    JGI Part of Berkeley Lab Team Awarded Best Use of HPC in Life Sciences
    The HPCwire Editors Choice Award for Best Use of HPC in Life Sciences went to the Berkeley Lab team comprised of JGI and ExaBiome Project team, supported by the DOE Exascale Computing Project for MetaHipMer, an end-to-end genome assembler that supports “an unprecedented assembly of environmental microbiomes.”

    More

    With a common set of "baseline metadata," JGI users can more easily access public data sets. (Steve Wilson)
    A User-Centered Approach to Accessing JGI Data
    Reflecting a structural shift in data access, the JGI Data Portal offers a way for users to more easily access public data sets through a common set of metadata.

    More

    Phytozome portal collage
    A More Intuitive Phytozome Interface
    Phytozome v13 now hosts upwards of 250 plant genomes and provides users with the genome browsers, gene pages, search, BLAST and BioMart data warehouse interfaces they have come to rely on, with a more intuitive interface.

    More

  • User Programs
    • Calls for Proposals
    • Special Initiatives & Programs
    • Product Offerings
    • User Support
    • Policies
    • Submit a Proposal
    screencap from Amundson and Wilkins subsurface microbiome video
    Digging into Microbial Ecosystems Deep Underground
    JGI users and microbiome researchers at Colorado State University have many questions about the microbial communities deep underground, including the role viral infection may play in other natural ecosystems.

    Read more

    Yeast strains engineered for the biochemical conversion of glucose to value-added products are limited in chemical output due to growth and viability constraints. Cell extracts provide an alternative format for chemical synthesis in the absence of cell growth by isolating the soluble components of lysed cells. By separating the production of enzymes (during growth) and the biochemical production process (in cell-free reactions), this framework enables biosynthesis of diverse chemical products at volumetric productivities greater than the source strains. (Blake Rasor)
    Boosting Small Molecule Production in Super “Soup”
    Researchers supported through the Emerging Technologies Opportunity Program describe a two-pronged approach that starts with engineered yeast cells but then moves out of the cell structure into a cell-free system.

    More

    These bright green spots are fluorescently labelled bacteria from soil collected from the surface of plant roots. For reference, the scale bar at bottom right is 10 micrometers long. (Rhona Stuart)
    A Powerful Technique to Study Microbes, Now Easier
    In JGI's Genome Insider podcast: LLNL biologist Jennifer Pett-Ridge collaborated with JGI scientists through the Emerging Technologies Opportunity Program to semi-automate experiments that measure microbial activity in soil.

    More

  • News & Publications
    • News
    • Blog
    • Podcasts
    • Webinars
    • Publications
    • Newsletter
    • Logos and Templates
    • Photos
    A view of the mangroves from which the giant bacteria were sampled in Guadeloupe. (Hugo Bret)
    Giant Bacteria Found in Guadeloupe Mangroves Challenge Traditional Concepts
    Harnessing JGI and Berkeley Lab resources, researchers characterized a giant - 5,000 times bigger than most bacteria - filamentous bacterium discovered in the Caribbean mangroves.

    More

    In their approved proposal, Frederick Colwell of Oregon State University and colleagues are interested in the microbial communities that live on Alaska’s glacially dominated Copper River Delta. They’re looking at how the microbes in these high latitude wetlands, such as the Copper River Delta wetland pond shown here, cycle carbon. (Courtesy of Rick Colwell)
    Monitoring Inter-Organism Interactions Within Ecosystems
    Many of the proposals approved through JGI's annual Community Science Program call focus on harnessing genomics to developing sustainable resources for biofuels and bioproducts.

    More

    Coloring the water, the algae Phaeocystis blooms off the side of the sampling vessel, Polarstern, in the temperate region of the North Atlantic. (Katrin Schmidt)
    Climate Change Threatens Base of Polar Oceans’ Bountiful Food Webs
    As warm-adapted microbes edge polewards, they’d oust resident tiny algae. It's a trend that threatens to destabilize the delicate marine food web and change the oceans as we know them.

    More

Data & Tools
Home › Data & Tools › Software › BBTools › BBTools User Guide › Usage Guide

Usage Guide

Terminology Notes

“Read” in this file is used synonymously with “sequence”, whether it is contig in a fasta file or a short read produced by a sequencing platform. “Paired reads” or “pair” refer to 2 reads that are generated by sequencing both ends of a single fragment of DNA. These are typically delivered in two fastq files, named something like “read1.fastq.gz” and “read2.fastq.gz”. The alternative is single-ended reads, in which only one end of the molecule is sequenced. When paired reads are available, it is important to always process them together, rather than for example mapping the read 1 file and the read 2 file in two separate processes.

Usage

Most BBTools use the same syntax and operate with a set of standard flags. Individual tools also have specific flags – for example, kmer-based tools support the flag “k” to specify the kmer length, and non-kmer-based tools don’t. This guide describes the standard syntax and most common flags. Custom syntax and flags for a given tool are described in that tool’s shellscript.

Standard Syntax

Most BBTools (such as Reformat or BBNorm) process genomic sequences in some fashion, and are executed like this:

reformat.sh in=reads.fq out=processed.fq

The shellscript allows autodetection of memory (in some cases) and classpath.
The above command is equivalent to this:

java -ea -Xmx200m -cp /path/to/bbmap/current/ jgi.ReformatReads in=reads.fq out=processed.fq

Note that “/path/to/bbmap/current/” needs to be replaced with an actual path. While the shellscript will only work in bash (or some other Linux/Unix/MacOS shells),
the full Java command will work on any environment with Java installed, such as Windows.

Tools that use a reference (such as BBMap, BBDuk, and Seal) will also need the additional flag “ref=”:

bbmap.sh in=reads.fq out=mapped.sam ref=genome.fasta

In each of the above cases, the flags can be arranged in any order.

Paired Reads

Most BBTools support paired reads. These may be in two files, or interleaved in a single file, which BBTools will autodetect based on the read names. When the reads are in two files, you can use the in2 and out2 flags, like this:

reformat.sh in1=read1.fq in2=read2.fq out1=processed1.fq out2=processed2.fq

It is also possible to specify paired files like this:

reformat.sh in=read#.fq out=processed#.fq

…which is equivalent to the above command.

It is important to process paired files together in one command so that they are kept in the proper order. If you have dual input files and only 1 output file, the output will be written interleaved, and vice-versa. All tools that support paired reads will keep pairs together. For example, Reformat supports subsampling; if read 1 is discarded, read 2 will also be discarded. This prevents a loss of synchronization that corrupts the output.

Multiple Output and % Symbol

Some tools (such as Seal, BBSplit, BBMap, Dedupe) can use the % symbol as a wildcard, to be replaced by some other word when generating many files from a single input. It is recommended that the % symbol be avoided in filenames. As an example, assume you run Seal to bin some reads based on matching sequences in the fasta file “ref.fa”, which contains the genomes of e.coli and salmonella:

seal.sh in=reads.fq pattern=out_%.fq ref=ref.fa

This would produce the output files “out_e.coli.fq” and “out_salmonella.fq”.

File Formats and Extensions

BBTools support most standard sequence formats, including fastq, fasta, fasta+qual, scarf, sam, and (if samtools is installed) bam. They also support gzip and (if bzip2 or pbzip2 is installed) bzip2. The tools are sensitive to file extensions. For example:

reformat.sh in=reads.fq.gz out=processed.fa

In this case, reformat will try to read a gzip-compressed fastq file and output an uncompressed fasta file. For BBMap, this means that it will output a sam file if you name the output “.sam”, bam if you name it “.bam”, fastq if you name it “.fastq”, and so forth. BBTools are usually capable of autodetecting input format (for example, if you feed it a fasta file called “stuff.txt” it will be able determine that it is in fasta format), but this is not recommended. Also, it is possible to specify an extensionless name for an output file, in which case the default format is used; the default varies by tool.

List of supported file extensions:

Fastq: fastq, fq
Fasta: fasta, fa, fas, fna, ffn, frn, seq, fsa, faa
Bread: bread (BBMap internal format; deprecated)
Sam: sam
Bam: bam
Qual: qual (should be accompanied with fasta)
Scarf: scarf (an old Illumina format; input only)
Phylip: phylip (only supported by phylip2fasta; input only)
Text: txt (used for logs, stats, and histograms)
Header: header (use this extension to write read names only)

List of supported compression extensions:

Gzip: gzip, gz
Bzip2: bz2
Zip: zip

Piping and Screen Output

Most tools can accept input from stdin and write output to stdout, with notable exceptions being BBNorm and Tadpole in some processing modes, which require reading the input file multiple times. Piping works like this:

cat reads.fq.gz | reformat.sh in=stdin.fq.gz out=stdout.fa int=f > x.fa

Note that the extensions are added to stdin and stdout so that Reformat knows how to interpret the data; when piping, it cannot first autodetect the file format. Similarly, it cannot autodetect whether the reads are interleaved or not. So, “int=f” (equivalent to “interleaved=false”) was added to force it to treat the data as single-ended.

By default, all tools write status information to stderr, not stdout. To capture a program’s screen output, do this:

reformat.sh in=a.fq out=b.fq 1>out.txt 2>err.txt

Or, to direct both to a single file:

reformat.sh in=a.fq out=b.fq 1>out.txt 2>&1

Memory and Java Flags

There are two flags that are passed by the shellscripts directly to Java rather than to BBTools, “-Xmx” and “-da”.
Java does not dynamically grow virtual memory as needed like C programs. The amount of virtual memory must be specified up front, and it will immediately be grabbed; the physical memory used will only increase as needed. The shellscripts will try to autodetect memory and set it to an appropriate value, but sometimes this will need to be overridden (for example, if you are using a shared node and don’t really need all the memory, or not enough memory was allocated and the program crashed with a memory exception). To force a program to use 3 gigs of RAM, use the flag “-Xmx3g”. For example:

reformat.sh in=reads.fq out=processed.fq -Xmx3g

That’s the equivalent of:

java -ea -Xmx3g -cp /path/to/bbmap/current/ jgi.ReformatReads in=reads.fq out=processed.fq

The “-ea” flag means “enable assertions”, which will make BBTools crash if they detect a problem. If you want to ignore the problem and force it to run anyway, you can use the “-da” flag. The -da flag may also increase speed slightly.

Threads

Most BBTools are multithreaded, and will automatically detect and use all available threads. This is usually desirable when you have exclusive use of a computer, but may not be on a shared node. The number of threads can be capped at X with the flag “t=X” (threads=X). The total CPU usage may still go higher, though, due to several factors:
1) Input and output are handled in separate threads; “t=X” only regulates the number of worker threads.
2) Java uses additional threads for garbage collection and other virtual machine tasks.
3) When subprocesses (such as pigz) are spawned, they also individually obey the thread limit, but if you set “t=4” and the process spawns 3 pigz instances, you could still potentially use over 16 threads – 4 worker threads, 4 threads for each pigz process, plus other threads for the JVM and I/O.
If you have exclusive use of a computer, you don’t need to worry about spawning too many threads; this is only an issue with regards to fairness on shared nodes.

Subprocesses

If they are installed, BBTools will automatically use samtools for sam<->bam conversion, and bzip2 or pbzip2 for processing bz2 files. It may use pigz to accelerate processing of gzipped files, depending on the number of threads available. This is generally fine on a standalone computer, but in some circumstances, depending on the cluster configuration, the scheduler may kill a process that spawns a subprocess for violating virtual memory limits (Amazon instances may do this). In that case, you can disable pigz support with the flags “pigz=f unpigz=f”. Alternatively, you can force pigz to be used with “pigz=t unpigz=t”. The default for those flags depends on the tool. Pigz processes will never be spawned unless the number of threads allowed is at least 3. It is also possible to spawn gzip instances instead of pigz instances, but this only gives a small speed increase over using Java for gzip processing.

Additional Help

There are many forum threads on SeqAnswers describing the usage of different BBTools, linked from this thread:
http://seqanswers.com/forums/showthread.php?t=41057

*Standard flags*

The flags below work with many or all BBTools that process reads, but the list is not complete because it does not include flags specific to only one or a few tools – those are listed in the shellscript. They are listed with their default setting, but some of the defaults differ between tools; the specific default is also listed in the shellscript. Where the description starts with something in parentheses, like “(in1)”, that is an acceptable alternative version of the flag.

Flag Syntax

With the exception of certain special flags like help flags (–help, –version) and Java flags (-Xmx, -da), all flags are in the same format: “a=b” where “a” is the name of the flag, and is not case-sensitive, and “b” is the value, which is case-sensitive (for filenames). Flags may be in any order, and never need leading hyphens, except for those special flags mentioned above. If a flag is set twice, the later value will override the former; for example, “reformat.sh in=x.fq in=y.fq” will use y.fq as input. The special value “null” means “blank”. For example, “reformat.sh in=a.fq out=null” and “reformat.sh in=a.fq” are equivalent – neither will output anything.
For boolean variables, “null” is equivalent to “true”, and values may be abbreviated “t and f. So, these are all equivalent:

Help Flags

--help                  Print the usage information from the shellscript (when run from a
			shellscript).  Alternately you can just look at the shellscript with
			a text editor.
--version               Print the version of BBTools.

Config Files

config=file             A file or a comma-delimited list of files.  If this flag is present,
			the contents of the config file will be added to the command line.
			Config files must contain one argument per line.  Config files are
			never required, but may be useful when a command line would be too
			long or when arguments contain whitespace. See readme_config.txt
			for more information.

Input Flags

in=file                 (in1) Main input.
in2=file                Input for 2nd read of pairs in a different file.
interleaved=auto        (int) t/f overrides interleaved autodetection.
samplerate=1            Set lower to only process a fraction of input reads.
qfin=.qual file         Read qualities from this qual file, for the reads coming from 'in'
qfin2=.qual file        Read qualities from this qual file, for the reads coming from 'in2'
extout=                 Allows overriding of input file format. For example, "extin=.fq"
			would force the input to be read in fastq format regardless of the
			file name.
trimreaddescription=f   (trd) Trim the names of reads after the first whitespace.
touppercase=f           (tuc) Convert lowercase letters in reads to upper case.
lowercaseton=f          (lctn) Convert lowercase letters in reads to N.
utot=f                  Convert U bases to T.

Output Flags

out=file                (out1) Main output.
out2=file               Output for 2nd read of pairs in a different file.
qfout=.qual file        Write qualities from this qual file, for the reads going to 'out'
qfout2=.qual file       Write qualities from this qual file, for the reads coming from 'out2'
extout=                 Allows overriding of output file format.  For example, "extout=.fq"
			would force the output to be written in fastq format regardless of
			the file name.
fastawrap=70            Length of lines in fasta output.
overwrite=f             Allow overwriting of existing files.
append=f                Append to existing files.

Sampling Flags

reads=-1                Set to a positive number to only process this many input reads (or
			pairs), then quit.
samplerate=1            Randomly output only this fraction of reads; 1 means sampling is
			disabled.
sampleseed=-1           Set to a positive number to use that prng seed for sampling (allowing
			deterministic sampling).  A negative number will use a random seed.

Threading Flags

threads=auto            (t) Number of worker threads to spawn.

Compression Flags

ziplevel=2              (zl) Compression level for zip or gzip output; 1-9.
unpigz=                 Spawn a pigz process for faster decompression. Requires pigz to be
			installed.  Valid values are t or f; the default varies by program.
pigz=                   Spawn a pigz process for faster compression. Requires pigz to be
			installed.  Valid values are t, f, or a number; the default varies by
			program.  "pigz=X" will enable pigz, and also force all pigz
			processes to use exactly X threads.

Quality-Related Flags

qin=auto                Input quality offset: 33 (Sanger), 64, or auto.
qout=auto               ASCII offset for output quality.  May be 33 (Sanger), 64 (Illumina),
			or auto (same as input).
qfake=30                Quality value used for fasta to fastq reformatting.
maxcalledquality=41     Cap quality values at this upper level.
mincalledquality=0      Cap quality values at this lower level.
ignorebadquality        (ibq) Don't crash if quality values appear to be incorrect.
qtrim=f                 Enable or disable quality trimming.  May be set to r, l, or rl to
			trim the right, left, or both sides.
trimq=                  Trim bases below this quality value.

Length-Related Flags

fastareadlen=           Fasta sequences longer than this are broken into subsequences of at
			most this length, and given a suffix such as _part_1. Only works with
			fasta files; generally designed for mapping very long sequences with BBMap.
fastaminlen=            Discard fasta sequences shorter than this.
maxlen=                 Has different meanings depending on the program. For BBMap, reads
			longer than this will be broken to pieces this length. For most other
			programs, it acts as a filter.
minlen=                 Has different meanings depending on the program. Typically, reads
			shorter than this will be discarded.

Histogram Flags

bhist=file              Base composition histogram by position.
qhist=file              Quality histogram by position.
qchist=file             Count of bases with each quality value.
aqhist=file             Histogram of average read quality.
bqhist=file             Quality histogram designed for box plots.
lhist=file              Read length histogram.
gchist=file             Read GC content histogram.
gcbins=100              Number gchist bins.  Set to 'auto' to use read length.

*Advanced Flags*
Debugging and Benchmarking Flags


verbose=f               Print status messages for debugging.
parsecustom=f           Parse synthetic read names for custom data stored by RandomReads.

Buffering and I/O Flags

readbufferlength=200    Number of reads to store per ListNum.  A ListNum is the smallest unit
			of work sent to a worker thread.
readbuffers=            Number of ListNums to store in the queue waiting for worker threads.
			The default is 150% of the number of threads.
bf1=                    Set to true to force ByteFile1 to be used for reading files.
bf2=                    Set to true to force ByteFile2 to be used for reading files (faster).

MPI and JNI Flags

usejni=f                Set to true to enable JNI-accelerated versions of BBMerge, BBMap, and
			Dedupe. Requires the C code to be compiled.
  • BBTools User Guide
    • Installation Guide
    • Usage Guide
    • Data Preprocessing
    • Add Adapters Guide
    • BBDuk Guide
    • BBMap Guide
    • BBMask Guide
    • BBMerge Guide
    • BBNorm Guide
    • CalcUniqueness Guide
    • Clumpify Guide
    • Dedupe Guide
    • Reformat Guide
    • Repair Guide
    • Seal Guide
    • Split Nextera Guide
    • Statistics Guide
    • Tadpole Guide
    • Taxonomy Guide
  • BBTools FAQ and Support Forums

More topics:

  • COVID-19 Status
  • News
  • Science Highlights
  • Blog
  • Webinars
  • CSP Plans
  • Featured Profiles
  • Careers
  • Contact Us
  • Events
  • User Meeting
  • MGM Workshops
  • Internal
  • Disclaimer
  • Credits
  • Policies
  • Emergency Info
  • Accessibility / Section 508 Statement
  • Flickr
  • LinkedIn
  • RSS
  • Twitter
  • YouTube
Lawrence Berkeley National Lab Biosciences Area
A project of the US Department of Energy, Office of Science

JGI is a DOE Office of Science User Facility managed by Lawrence Berkeley National Laboratory

© 1997-2023 The Regents of the University of California