DOE Joint Genome Institute

  • COVID-19
  • About Us
  • Contact Us
  • Our Science
    • DOE Mission Areas
    • Bioenergy Research Centers
    • Science Programs
    • Science Highlights
    • Scientists
    Data yielded from RIViT-seq increased the number of sigma factor-gene pairs confirmed in Streptomyces coelicolor from 209 to 399. Here, grey arrows denote previously known regulation and red arrows are regulation identified by RIViT-seq; orange nodes mark sigma factors while gray nodes mark other genes. (Otani, H., Mouncey, N.J. Nat Commun 13, 3502 (2022). https://doi.org/10.1038/s41467-022-31191-w)
    Streamlining Regulon Identification in Bacteria
    Regulons are a group of genes that can be turned on or off by the same regulatory protein. RIViT-seq technology could speed up associating transcription factors with their target genes.

    More

    (PXFuel)
    Designer DNA: JGI Helps Users Blaze New Biosynthetic Pathways
    In a special issue of the journal Synthetic Biology, JGI scientific users share how they’ve worked with the JGI DNA Synthesis Science Program and what they’ve discovered through their collaborations.

    More

    A genetic element that generates targeted mutations, called diversity-generating retroelements (DGRs), are found in viruses, as well as bacteria and archaea. Most DGRs found in viruses appear to be in their tail fibers. These tail fibers – signified in the cartoon by the blue virus’ downward pointing ‘arms’— allow the virus to attach to one cell type (red), but not the other (purple). DGRs mutate these ‘arms,’ giving the virus opportunities to switch to different prey, like the purple cell. (Courtesy of Blair Paul)
    A Natural Mechanism Can Turbocharge Viral Evolution
    A team has discovered that diversity generating retroelements (DGRs) are not only widespread, but also surprisingly active. In viruses, DGRs appear to generate diversity quickly, allowing these viruses to target new microbial prey.

    More

  • Our Projects
    • Search JGI Projects
    • DOE Metrics/Statistics
    • Approved User Proposals
    • Legacy Projects
    Photograph of a stream of diatoms beneath Arctic sea ice.
    Polar Phytoplankton Need Zinc to Cope with the Cold
    As part of a long-term collaboration with the JGI Algal Program, researchers studying function and activity of phytoplankton genes in polar waters have found that these algae rely on dissolved zinc to photosynthesize.

    More

    This data image shows the monthly average sea surface temperature for May 2015. Between 2013 and 2016, a large mass of unusually warm ocean water--nicknamed the blob--dominated the North Pacific, indicated here by red, pink, and yellow colors signifying temperatures as much as three degrees Celsius (five degrees Fahrenheit) higher than average. Data are from the NASA Multi-scale Ultra-high Resolution Sea Surface Temperature (MUR SST) Analysis product. (Courtesy NASA Physical Oceanography Distributed Active Archive Center)
    When “The Blob” Made It Hotter Under the Water
    Researchers tracked the impact of a large-scale heatwave event in the ocean known as “The Blob” as part of an approved proposal through the Community Science Program.

    More

    A plantation of poplar trees. (David Gilbert)
    Genome Insider podcast: THE Bioenergy Tree
    The US Department of Energy’s favorite tree is poplar. In this episode, hear from ORNL scientists who have uncovered remarkable genetic secrets that bring us closer to making poplar an economical and sustainable source of energy and materials.

    More

  • Data & Tools
    • IMG
    • Data Portal
    • MycoCosm
    • PhycoCosm
    • Phytozome
    • GOLD
    HPCwire Editor's Choice Award (logo crop) for Best Use of HPC in the Life Sciences
    JGI Part of Berkeley Lab Team Awarded Best Use of HPC in Life Sciences
    The HPCwire Editors Choice Award for Best Use of HPC in Life Sciences went to the Berkeley Lab team comprised of JGI and ExaBiome Project team, supported by the DOE Exascale Computing Project for MetaHipMer, an end-to-end genome assembler that supports “an unprecedented assembly of environmental microbiomes.”

    More

    With a common set of "baseline metadata," JGI users can more easily access public data sets. (Steve Wilson)
    A User-Centered Approach to Accessing JGI Data
    Reflecting a structural shift in data access, the JGI Data Portal offers a way for users to more easily access public data sets through a common set of metadata.

    More

    Phytozome portal collage
    A More Intuitive Phytozome Interface
    Phytozome v13 now hosts upwards of 250 plant genomes and provides users with the genome browsers, gene pages, search, BLAST and BioMart data warehouse interfaces they have come to rely on, with a more intuitive interface.

    More

  • User Programs
    • Calls for Proposals
    • Special Initiatives & Programs
    • Product Offerings
    • User Support
    • Policies
    • Submit a Proposal
    screencap from Amundson and Wilkins subsurface microbiome video
    Digging into Microbial Ecosystems Deep Underground
    JGI users and microbiome researchers at Colorado State University have many questions about the microbial communities deep underground, including the role viral infection may play in other natural ecosystems.

    Read more

    Yeast strains engineered for the biochemical conversion of glucose to value-added products are limited in chemical output due to growth and viability constraints. Cell extracts provide an alternative format for chemical synthesis in the absence of cell growth by isolating the soluble components of lysed cells. By separating the production of enzymes (during growth) and the biochemical production process (in cell-free reactions), this framework enables biosynthesis of diverse chemical products at volumetric productivities greater than the source strains. (Blake Rasor)
    Boosting Small Molecule Production in Super “Soup”
    Researchers supported through the Emerging Technologies Opportunity Program describe a two-pronged approach that starts with engineered yeast cells but then moves out of the cell structure into a cell-free system.

    More

    These bright green spots are fluorescently labelled bacteria from soil collected from the surface of plant roots. For reference, the scale bar at bottom right is 10 micrometers long. (Rhona Stuart)
    A Powerful Technique to Study Microbes, Now Easier
    In JGI's Genome Insider podcast: LLNL biologist Jennifer Pett-Ridge collaborated with JGI scientists through the Emerging Technologies Opportunity Program to semi-automate experiments that measure microbial activity in soil.

    More

  • News & Publications
    • News
    • Blog
    • Podcasts
    • Webinars
    • Publications
    • Newsletter
    • Logos and Templates
    • Photos
    A view of the mangroves from which the giant bacteria were sampled in Guadeloupe. (Hugo Bret)
    Giant Bacteria Found in Guadeloupe Mangroves Challenge Traditional Concepts
    Harnessing JGI and Berkeley Lab resources, researchers characterized a giant - 5,000 times bigger than most bacteria - filamentous bacterium discovered in the Caribbean mangroves.

    More

    In their approved proposal, Frederick Colwell of Oregon State University and colleagues are interested in the microbial communities that live on Alaska’s glacially dominated Copper River Delta. They’re looking at how the microbes in these high latitude wetlands, such as the Copper River Delta wetland pond shown here, cycle carbon. (Courtesy of Rick Colwell)
    Monitoring Inter-Organism Interactions Within Ecosystems
    Many of the proposals approved through JGI's annual Community Science Program call focus on harnessing genomics to developing sustainable resources for biofuels and bioproducts.

    More

    Coloring the water, the algae Phaeocystis blooms off the side of the sampling vessel, Polarstern, in the temperate region of the North Atlantic. (Katrin Schmidt)
    Climate Change Threatens Base of Polar Oceans’ Bountiful Food Webs
    As warm-adapted microbes edge polewards, they’d oust resident tiny algae. It's a trend that threatens to destabilize the delicate marine food web and change the oceans as we know them.

    More

News & Publications
Home › Blog › Data Quality, Data Sets and New Directions: Plotting IMG’s Next 10 Years

June 11, 2015

Data Quality, Data Sets and New Directions: Plotting IMG’s Next 10 Years

DOE JGI Prokaryote Super Program Head Nikos Kyrpides

DOE JGI Prokaryote Super Program Head Nikos Kyrpides

At the recent 10th Annual Genomics of Energy & Environment meeting hosted by the U.S. Department of Energy Joint Genome Institute (DOE JGI), a DOE Office of Science User Facility, Nikos Kyrpides, head of the DOE JGI Prokaryote Super Program, received the Van Niel International Prize in Bacterial Systematics. The Van Niel Prize was established in 1986 in honor of microbiologist Cornelis Van Niel’s contribution to scholarship in the field of microbiology, and is awarded every three years by the University of Queensland in Australia on the recommendation of a panel of experts of the International Committee on Systematics of Prokaryotes. Phil Hugenholtz, Director of the Center for Ecogenomics at the University of Queensland and a former DOE JGI colleague of Kyrpides, was on hand at the Meeting to present the award.

An example of Kyrpides’ efforts to systematically describe and classify microbes in action can be seen in the Integrated Microbial Genomes (IMG) data management system that his program developed and maintains in partnership with the Biosciences Computing Group of Berkeley Lab’s Computational Research Division. IMG is the leading data analysis system of the DOE JGI’s Prokaryote Super Program, and Kyrpides has been pushing the developments as the scientific lead of the project from its first working prototype in 2005 to its current incarnation. On the IMG system’s 10-year anniversary, he took time to reflect on the milestones achieved thus far and future directions.

What are the highlights of the last 10 years to you?

In a period of 10 years, IMG has broken several records and has been established as one of the premier data management systems in the community for comparative analysis of microbial genomes and metagenomes. Its data size has grown 70-fold in terms of number of data sets and 22,000-fold in number of genes. We have currently almost 50,000 genomes in our system, containing 90 million genes. It’s taken 20 years to sequence all of those genomes; I anticipate we will easily double that number in the next two years. We have 6,000 metagenome data sets, which contain 29 billion genes. As far as I know, this represents the largest publicly available database of metagenomics genes and therefore this is one more of IMG’s records. We’ve grown from a few hundred to about 12,000 registered users in more than 90 countries. We provide an alternative source of data, particularly for metagenomes, and we add significant value through the integration of various data types, as well as with curation and annotation.

In terms of data integration, we’ve managed to integrate several different data types including one of the largest collections of curated metadata from the GOLD database, as well as several omics types including transcriptomics, metatranscriptomics, proteomics, and methylomics. In an effort to connect to our DNA synthesis program at the JGI, we have integrated a large collection of known natural products and connected them to their biosynthetic gene clusters, creating one of the largest resources in the field. We are currently working towards the integration of metabolomics and transposomics data produced at the JGI. Adding all of these means a completely different operation from the straightforward comparison of genes and genomes. With transcriptomes, for example, you’re now talking about the expression of genes you already have, and expression levels vary under varying conditions. In transposomics, you look at the genes that are essential or have different fitness under varying conditions. So the original IMG’s three-dimensional model of genes, genomes and functions has become more multidimensional as you add each of the different data types.

What do you think has helped IMG grow over the past 10 years?

One of the critical things is that it was a joint development between a group of engineers under the leadership of Victor Markowitz, with long experience in genomic data, and a group of biologists that had very strong genomics and bioinformatics backgrounds. Biologists provided the requirements on how the data analysis tools and workflows should be organized, and the developers implemented exactly what the biologists wanted. It’s clear there was a grand vision upfront to handle this much growth in the past 10 years. However, there is always a difference between knowing the path and walking the path. In our case, I believe we did both. We can continue another 10 years on this current system, although we also need to start exploring new solutions for more efficient handling of the data deluge ahead.

One more of our early choices that I believed proved to be critical both for the growth and the success of the system was to offer only a single data processing option for all datasets submitted into our system. We do the annotation for the users, and we process the datasets the way we know best. Maintaining a huge system such as IMG gives you great power, and with great power comes great responsibility. I believe we’re obliged to figure out and apply the best annotation practice at any time rather than allowing users to figure out what to use and which one choose as some other systems do. Providing an environment where all the data are uniformly processed and annotated is of paramount value and importance.

(Left to Right:) Microbial ecologist and director of the University of Queensland Australian Centre for Ecogenomics Phil Hugenholtz, German microbiologist Hans-Peter Klenk, 2011-2014 Van Niels awardee Nikos Kyrpides and 2008-2011 Van Niels awardee George Garrity.

(Left to Right:) Microbial ecologist and director of the University of Queensland Australian Centre for Ecogenomics Phil Hugenholtz, German microbiologist Hans-Peter Klenk, 2011-2014 Van Niel awardee Nikos Kyrpides and 2008-2011 Van Niel awardee George Garrity.

Looking forward to the next 10 years, what are some of the challenges the IMG system will need to tackle?

Our data sets are thousands of terabytes in size and we’ll be going to petabytes soon. We need to scale at the level of hundreds of thousands of data sets and hundred of billions of genes. Right now our user interface can support the comparison of a few hundred datasets but what we need and what researchers are asking for is to compare thousands against thousands. No one is doing something like that now. Everyone is currently comparing a metagenome against isolate genomes, but no system can efficiently provide a comparison of a metagenome against other metagenomes. Given the size of the data involved, that would take weeks and you can’t do this efficiently on a production scale (i.e. on a weekly basis) even with high performance computing (HPC) right now.

The National Energy Research Scientific Computing Center (NERSC) is a vital partner in succeeding in the era of big data. We’re already operating at the scale where processing of our data requires a HPC environment and we are very fortunate that at the JGI this is provided by NERSC. We need a bigger database and bigger computer clusters to support the growing community demand, but we also need to have the right computational environment to run our pipelines.

Another big challenge is how to support big data, without sacrificing data quality. For example, annotating the metadata in the Genomes OnLine Database (GOLD) is heavily manual, but it adds tremendous value to the sequence data. Manual annotation certainly contradicts with scaling, but the availability of metadata is critical information in order to interpret the data we have.

How do you see IMG integrating with KBase? What are the challenges here?

The two systems have different scientific goals and overall mission and because of that they also have fundamentally different design commitments, and follow different principles in data organization and user support. For example, while IMG’s focus is on the comparative analysis of microbial genomes and metagenomes with emphasis on the interface between the two, the focus of the Department of Energy Systems Biology Knowledgebase (KBase) seems to be more on the isolate genome side and metabolic modeling at least for now. Moreover, while one of IMG’s strengths lies in the state-of-the-art data processing pipelines integrated into the system, enabling uniform annotation (and therefore comparability) across all of its datasets, KBase is following a different principle, allowing users to select across various available annotation pipelines. Due to the above, system integration doesn’t seem to be the right path here. Our primary goal instead is to enable users to review and analyze their data as well as move easily across the two systems. In order to achieve that we need to develop a seamless data transfer/exchange between JGI and KBase and this is currently the direction of our joint efforts.

What do you hope IMG will look like and be able to do for users in 10 years?

Phil Hugenholtz (right) presented the Van Niel Prize to Nikos Kyrpides (left)

Former JGIer Phil Hugenholtz (right), now at the University of Queensland Center for Ecogenomics, presented the Van Niel Prize to Nikos Kyrpides (left).

The exponential growth of sequence data fueled from the democratization of genome sequencing is having already a dramatic effect on the available solutions in data management, including integrating, storing, and processing of the data as well as enabling their analysis and distribution to the community. Some of the most frequently adopted solutions improvise on “cutting corners” with detrimental effects on the quality and precision of the results. Practitioners of these types of approaches invariably select data partitioning instead of integration, and speed over accuracy, thus severely inflating the ramifications of the “streetlight effect” in data interpretation.

My hope for the next 10 years is that IMG will persevere with its current course in supporting the JGI user community and JGI Science through its emphasis on high quality, and will maintain its position as a premier comparative analysis system for microbial genomes and metagenomes, worldwide. To achieve that, IMG will need to remain strongly coupled to the constantly evolving JGI’s scientific directions and technology developments, particularly in the area of functional genomics, and at the same time continue providing unique solutions for data integration and visualization.

In terms of new directions, my expectation is that in the next decade, the biggest overhaul in the landscape of microbial genomics and metagenomics will be at the interface of the two, and therefore this is where a large part of future IMG developments will focus. In keeping with its goal of supporting the analysis of both the parts and the whole, I would like to see IMG playing a central role in enabling the identification and analysis of individual populations from environmental communities, as well as facilitating the elucidation of their role within the community.

What should the user community know about IMG as the data management system embarks on the next 10 years?

There’s a huge amount of functionality in IMG already, but we certainly need to continue adding more. The two main directions in the near future include adding more functionality and efficient supporting data/size growth. New functionality will include expanding the system to support the new data types produced from JGI functional genomics efforts (e.g. metabolomics and transposomics), but also creating specialized datamarts such as the IMG-ABC (integrating Natural Products and their corresponding Biosynthetic gene clusters)

We are also expanding our coverage of eukaryotic genomes to include more plant and fungal genomes into IMG. Our goal is to achieve a more holistic approach in data integration and analysis, in order to study complex biological systems, such as the plant microbiome. Of course, getting the large isolate genomes in the system will mean substantial increases in the comparison times and computational resources investments. But that’s the obvious way to go, you need to have all the data integrated. If you have missing parts, discovery is missed.

Share this:

  • Click to share on Facebook (Opens in new window)
  • Click to share on LinkedIn (Opens in new window)
  • Click to share on Pinterest (Opens in new window)
  • Click to share on Twitter (Opens in new window)
  • Click to print (Opens in new window)

Filed Under: Blog

More topics:

  • COVID-19 Status
  • News
  • Science Highlights
  • Blog
  • Webinars
  • CSP Plans
  • Featured Profiles

Related Content:

JGIota: Sequencing Shiitakes with David Hibbett

A Genome Insider Logo Image

JGI at 25: Mapping Switchgrass Traits with Common Gardens

Aerial photo of the switchgrass diversity panel late in the 2020 season at the Kellogg Biological Station in Michigan. (Robert Goodwin)

JGI at 25: Following Fungi that Pry Apart Plant Polymers

A brown goat with white horns looks at green hay

Exploring Possibilities: 2022 JGI-UC Merced Interns

2022 JGI-UC Merced interns (Thor Swift/Berkeley Lab)

JGI at 25: Using team science to build communities around data

JGI at 25: Expanding Metagenomics to Capture Viral Diversity

Artist rendering of genome standards being applied to deciphering the extensive diversity of viruses. (Illustration by Leah Pantea)
  • Careers
  • Contact Us
  • Events
  • User Meeting
  • MGM Workshops
  • Internal
  • Disclaimer
  • Credits
  • Policies
  • Emergency Info
  • Accessibility / Section 508 Statement
  • Flickr
  • LinkedIn
  • RSS
  • Twitter
  • YouTube
Lawrence Berkeley National Lab Biosciences Area
A project of the US Department of Energy, Office of Science

JGI is a DOE Office of Science User Facility managed by Lawrence Berkeley National Laboratory

© 1997-2023 The Regents of the University of California