Sequencing technology has changed dramatically over the last 25-plus years since the JGI’s inception, making it possible for researchers to get a close look at more ecosystems and organisms than ever before. In 2006, the JGI produced 33 billion base pairs of sequence; by 2023, that number was almost 717 trillion. Last year, the JGI surpassed three Petabases of data sequenced — that’s three-quadrillion base pairs of DNA sequence!
To meet this “data tsunami,” as Kathy Yelick of the Computing Sciences Area at Lawrence Berkeley National Laboratory (Berkeley Lab) describes it, JGI researchers have also collaborated to advance computing infrastructure and analysis. This way, JGI users — both primary, data-generating users, as well as secondary, data-downloading users — have access to high-quality, assembled datasets from all of this information. The JGI aims to create pipelines, tools and other standardized data resources that the global research community can access and use to manage the information being generated across disciplines
These advanced computing capabilities meet the needs of scientists out in the field, and developing such tools has been a collaborative effort. For example, with support from the Exascale Computing Project, the JGI — along with the National Energy Research Scientific Computing Center, Oak Ridge Leadership Computing Facility and Exabiome — worked to offer researchers high-powered tools for large-scale assembly and analysis.
To put these efforts into perspective, we caught up with JGI staff, collaborators and users to see how software and supercomputing capabilities have evolved to enable large-scale analysis. The story of a massive metagenome assembly for a time-series experiment from the highly-studied Lake Mendota is told as part of a three-part series of our Genome Insider podcast.
(Subscribe to Genome Insider on Apple Podcasts, Spotify or wherever you get your podcasts)
Microbial ecologist and JGI user Trina McMahon has been sampling microbes at Lake Mendota, which borders the University of Wisconsin-Madison, for over 20 years in order to better understand how the freshwater ecosystem works. When she set out to analyze 500 metagenomes from this sample set through the Community Science Program, it was the largest project the JGI had ever put together.
“When we proposed to do the Mendota time series, we knew that it was going to be more samples and more data per sample, and it was kind of … like — let’s write a proposal to sequence samples that will break JGI’s computers,” McMahon said, laughing, when we chatted with her in 2023.
Watch the sample preparation done for this project by Trina McMahon’s lab at the University of Wisconsin, Madison. |
Luckily for us, McMahon’s work hasn’t broken anything at the JGI. The Lake Mendota data set was run through the assembler MetaHipMer2. The computing program enables a more detailed read of the massive Lake Mendota data set, illuminating not just the highly-abundant microbes we expect to see, but also low-abundance microorganisms whose signal would have been filtered out as noise using more traditional methods. Thus far 25 terabases of metagenome data have been sequenced and assembled through the combined resources of the JGI, NERSC and ExaBiome teams.
- Part 1: Many, Many Mers — While biologists were out sampling Lake Mendota in their boats, data scientists and software developers at the JGI and Berkeley Lab were developing specific programs to handle this scale of data.
- Part 2: Souped Up Computing — Summit, Frontier, Perlmutter and now Dori, oh my! Take a look at the supercomputers that stitch together large datasets with the assembler program MetaHipMer2.
- Part 3: Boating Out to David Buoy — How did the mega dataset from Lake Mendota come together? We learn how researchers get these specialized snapshots of a freshwater ecosystem.
The size of this dataset is a testament to sampling an ecosystem over a long period of time, and supporting datasets like the Lake Mendota time series is an investment that pays dividends. Microbes are at the core of many environmental processes, and their genetic makeup forms the basis of their behavior in the environment. Large metagenomic surveys could lead to better predictive models of global climate. They may also provide a basic understanding of the roles microbiomes play in determining the behavior of the physical environment and provide opportunities for biologically-based mitigation strategies.
“I can say this very confidently,” said JGI Metagenome Program Lead Emily Eloe-Fadrosh. “There is no other place in the world that can do these types of metagenome assemblies.”