Complex intron-exon structure of eukaryotic genes makes their prediction challenging. Quality of gene prediction in eukaryotic genomes can be improved by combining different gene prediction approaches (ab initio, based on homology, ESTs, synteny, or their combinations) and experimental data (transcriptomics, proteomics, etc). In the course of fungal genome annotations we compared different gene predictors and annotation pipelines to assess and refine our annotation strategies for future genomes. Results of two such tests are presented here:
1. Annotation of Heterobasidion annosum genome
Results: Several gene predictors and annotation pipelines were used in annotating the genome of fungus H. annosum v1.0 and accuracy of gene prediction was compared based on homology and EST support. Combination of tools used in the JGI annotation pipeline predicted larger sets of genes with best support.
|Number of predicted gene models||11,547||9,609||8,409||12,270|
|with partial EST support||5,544||3,829||4,567||5,248|
|with full length EST support||2,538||1,182||2,896||3,073|
|with homology support||6,758||6,043||5,750||7,214|
|with strong homology support (>80% aa identity, >80% coverage)||112||109||174||187|
|with homology and EST support||2,894||2,172||2,720||2,953|
|Average EST coverage per gene||77.7%||68.2%||80.8%||79.1%|
|Supported splice sites||41,581||40,808||45,498||47,671|
|Average homology coverage per gene||64%||60%||68%||69%|
EuGene models were built and provided by a collaborator. All models were used in JGI pipeline. EST support was computed based on 40,807 ESTs and 10,126 EST cluster consensus sequences mapped by BLAT; protein homology was computed by blast against NCBI NR.
2. Comparison of MAKER and JGI Annotation pipeline
Results: Publicly available annotation pipeline MAKER was compared with JGI annotation pipeline [4,5]. For Basidiomycete Dichomitus squalens , JGI pipeline predicted more genes with better support using several lines of evidence.
|JGI Annotation pipeline
|Number of predicted gene models||9,940||12,290|
|with Swissprot hits||6,521||7,356|
|with non-repeat PFAM domains||5,365||6,010|
|with EST support||9,252||10,796|
|with >90% EST support||7,729||9,178|
|Number of unique PFAM domains||2,207||2,245|
|Average EST coverage per gene||93.0%||93.3%|
|Splice sites supported by ESTs||99,627||102,200|
Inputs: Aassembly v1.0 of D. squalens, 359,410 proteins seeds from NCBI NR, 16,501 EST cluster consensus sequences mapped by BLAT to the assembly. Mapper used the following gene predictors: Exonerate, FgenesH (same parameters as in JGI pipeline) and Augustus. All genes were blasted against the same Swissprot set of 530,264 protein sequences (downloaded Jul5 2011), EST sequences, and PFAM database(Pfam_v21)
3. Comparative Analysis Methods and Tools
Genome annotation and analysis requires development and validation of new algorithms and tools. Several directions of this development include methods to analyze eukaryotic genome organization (tandem and segmental duplication, gene-based synteny, including for multiple related genomes), gene structure (intron conservation or loss across genomes), gene gain/loss (detection of possible errors in automated clustering results for analysis of gene families, creating whole genome based phylogenetic trees based on clustering results, pfam domain analysis to detect expanded and lost families), genome evolution, gene expression, genome variation, metabolic pathways and regulatory elements. Test new gene predictors, including those using Rna-Seq data and synteny-based approaches on validated gene sets in terms of accuracy and speed, pipelines (eg, MAKER), repeat finding software, and non-coding RNA finding software. This project aims at (1) developing algorithms and prototypes for new genome analysis methods for publications; (2) testing new gene prediction and genome analysis tools for possible integration into production annotation process.
Comparative Gene Modeling
Comparative gene modeling aimed to improve the initial gene predictions for a set of closely related organisms and correct for missing or incorrectly predicted genes (incorrect splice sites, chimeras, gene fragments, etc).The idea of comparative modeling is that for closely related genomes, most orthologs have the same conserved gene structure. The algorithm maps all gene models predicted in all genomes to all individual genomes, and for each locus selects among the potentially many competing models, the one which is most closely resemble the homologous genes from other genomes. This procedure maybe iterated several times until no change in gene models will be observed
For Basidiomycete Dichomitus squalens reannotation using comparative modeling is compared with initial JGI production annotation:
|JGI Annotation pipeline||Comparative modeling|
|Number of predicted gene models||12,290||12,802|
|with Swissprot hits||7,356||7,900|
|with non-repeat PFAM domains||6,010||6,353|
|with EST support||10,796||11,105|
|with >90% EST support||9,178||9,444|
|Number of unique PFAM domains||2,245||2,322|
|Average EST coverage per gene||93.3%||93.3%|
|Splice sites supported by ESTs||102,200||104,246|
- Schiex T, Moisan A, Rouzé P. (2001) Computational Biology, selected papers from JOBIM’ 2000, no 2066 in LNCS. Springer Verlag; EuGène, an eukaryotic gene finder that combines several type of evidence; pp. 118–133.
- Ter-Hovhannisyan V, Lomsadze A, Chernoff YO, Borodovsky M. (2008) Gene prediction in novel fungal genomes using an ab initio algorithm with unsupervised training. Genome Res. 18(12):1979-90.
- Solovyev V, Kosarev P, Seledsov I, Vorobyev D. (2006) Automatic annotation of eukaryotic genes, pseudogenes and promoters. Genome Biol. 7 Suppl 1:S10.1-12.
- Grigoriev IV, Martinez DA, Salamov AA (2006) Fungal genomic annotation. In Applied Mycology and Biotechnology (Eds. Aurora, DK, Berka, RM, Singh, GB), Elsevier Press, Vol 6 (Bioinformatics), 123-142.
- Cantarel BL, Korf I, Robb SM, Parra G, Ross E, Moore B, Holt C, Sánchez Alvarado A, Yandell M. (2008) MAKER: an easy-to-use annotation pipeline designed for emerging model organism genomes. Genome Res. 18(1):188-96.