BBTools contains a taxonomy package designed for processing NCBI taxonomy information, which is in the form of taxonomy files, and sequence name annotations in sequence files from NCBI’s ftp site. The related tools allow you to filter or bin annotated sequences by taxonomy, or use Seal to classify read abundance at a specific taxonomic level.
*Notes*
Acquiring taxonomy data:
First, the taxonomy files must be downloaded from NCBI. The are currently available here:
ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdmp.zip
and
ftp://ftp.ncbi.nih.gov/pub/taxonomy/gi_taxid_nucl.dmp.gz
In Linux, it is most convenient to fetch them like this:
wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdmp.zip
Before further use, these must be processed by gitable.sh and taxtree.sh. taxdmp.zip needs to be unzipped first. For more details, see the Usage Examples section.
Acquiring sequence data:
NCBI is constantly rearranging its site, but currently, you can get sequence data here:
ftp://ftp.ncbi.nlm.nih.gov/genomes/
And specifically, ftp://ftp.ncbi.nlm.nih.gov/genomes/archive/old_refseq/Bacteria/all.fna.tar.gz contains all of the RefSeq bacterial data. The current version should be available here: ftp://ftp.ncbi.nlm.nih.gov/refseq/release/bacteria/*.fna*, though that’s inconvenient as it’s in many files. There are also other clades in ftp://ftp.ncbi.nlm.nih.gov/refseq/release/
Sequence naming conventions:
The only requirement for reference sequence data is that it is named in a recognizable manner. This includes 3 possibilities:
1) It starts with a gi number, in this format:
gi|123|other stuff
This is the current naming convention used by all NCBI data.
2) It starts with an NCBI taxonomic identifier, in this format:
ncbi|123|other stuff
3) It is a proper Genus_species pair, with whitespace replaced by underscores, like this:
Homo_sapiens
Note that 3 is not recommended because there could be a name collision, but in general, it should be fine. It is not strictly necessary to replace whitespace with underscores, but that is often convenient. For taxa above the species level, just a single taxonomic name is fine, such as “Gammaproteobacteria”. Sequence names are case-insensitive.
Specifying processed taxonomy files:
Most taxonomy-aware tools (Seal, FilterByTaxa, etc) require the path to the tree and/or gitable files to be specified in the command line. The default locations of processed taxonomy files are hard-coded in TaxTree.DefaultTreeFile and DefaultTableFile, currently (on Genepool, JGI’s cluster) at /global/projectb/sandbox/gaag/bbtools/tax. So, on Genepool, you can use the flag “tree=auto” instead of the full path, and the default will be used. For non-JGI users the full path would be needed.
Memory and Threads:
The taxonomy-related tools (other than Seal) are all single-threaded and relatively low-memory.
*Usage Examples*
Creating a TaxTree file:
taxtree.sh names.dmp nodes.dmp tree.taxtree.gz
names.dmp and nodes.dmp come from unzipping taxdmp.zip (see “Acquiring taxonomy data”). The TaxTree file is needed by every tool processing taxonomy. Note that this command’s flags are order-sensitive.
Creating a GiTable file:
gitable.sh gi_taxid_nucl.dmp.gz gitable.int1d.gz
gi_taxid_nucl.dmp.gz comes from NCBI (see “Acquiring taxonomy data”). The GiTable is needed by tools processing taxonomy, but only if reference sequences are named with gi numbers (e.g. gi|123|stuff). Note that this command’s flags are order-sensitive.
Renaming gi numbers to NCBI Tax ID numbers:
gi2taxid.sh in=bacteria.fa out=renamed.fa gi=gitable.int1d.gz
This is an optional step, but will rename sequences such as “gi|123|stuff” to “ncbi|456|stuff”. Replacing the gi number with a Tax ID number means in subsequent steps the gitable is no longer needed, which is more efficient.
Filtering sequences by taxonomy, according to sequence names:
filterbytaxa.sh in=bacteria.fa out=filtered.fa tree=tree.taxtree.gz table=gitable.int1d.gz names=Escherichia_coli level=phylum include=t
This will create a file, “filtered.fa”, containing all the sequences in the same phylum as E.coli. It is also possible to use numeric taxonomic IDs with the “ids” flag, or create a file containing everything except E.coli’s phylum using the “include=f” flag. Gitable is not needed unless the sequences are named with gi numbers.
Binning sequences by taxonomy, according to sequence names:
splitbytaxa.sh in=bacteria.fa out=%.fa tree=tree.taxtree.gz table=gitable.int1d.gz level=phylum
This will split the file into many output files, such as Deinococcus-Thermus.fa and Bacteroidetes.fa. For taxonomic binning based on sequence content rather than names, see Seal. Gitable is not needed unless the sequences are named with gi numbers.
Printing the taxonomic ancestry of a taxa:
taxonomy.sh tree=tree.taxtree.gz table=gitable.int1d.gz homo_sapiens meiothermus_ruber 123 gi_123
This will print the complete taxonomy of Homo sapiens, Meiothermus ruber, and the organism has a sequence with an NCBI identifier of 123 (Pirellula), and the organism with a gi number of 123 (Bos taurus). Note that an underscore was used instead of the vertical line (gi_123) because vertical line is a reserved symbol for piping, so it’s annoying to use on the command line. Also note that “123” or “ncbi_123” indicate a taxonomy number, while “gi_123” indicates a gi number. Gitable is not needed unless you enter gi numbers.