Application of High Performance Computing to the DOE Joint Genomic Institute’s Data Challenges

January 25-26, 2010
DOE Joint Genome Institute, Walnut Creek, CA USA
—by invitation only—

Meeting Motivation

The generation of sequence is no longer the bottleneck in the widespread application of genomic science to the understanding of fundamental biological processes. However, as running time and cost cease to be the limiting factors in sequence generation, the volume of sequence data that can be routinely generated raises a plethora of analysis and data management challenges.

The Department of Energy’s National Laboratories are at the cutting edge of high-performance scientific computing, with computational platforms, network infrastructure and personnel focused on enabling computationally intensive projects in DOE-relevant science. To date this capability has been predominately directed towards modeling and simulation in the physical and engineering sciences including such complex problems as climate modeling. However, biological systems research in general, and sequence-enabled research at the JGI in particular, have reached a scale and computational complexity that requires support from the largest available computational facilities if computational analysis and data management are not to become the new bottlenecks in the translation of sequence data to biological information.

This workshop will bring together producers, managers, and users of next generation sequencing data from the Joint Genome Institute with computational scientists from the National Labs, providing a venue for the discussion of significant computational challenges. These will include analysis, data integration and management challenges faced by the JGI, and the identification of appropriate infrastructure and methodologies for addressing these challenges in partnership with the National Labs HPC resources.

Recognizing that in the present age of “commodity supercomputing” there is significant expertise and experience with large-scale computation on massive datasets outside the National Labs, and that other genome sequencing centers are gearing up to face similar challenges (Broad, Sanger, Wash U) the workshop will also provide a venue to hear about best practices and novel algorithms being developed and deployed outside DOE facilities.

Workshop Goals

The essential goals of this meeting are to accelerate the translation of sequence data to biological information. Towards this end, the workshop will:

Identify and define challenges faced by the JGI and the genomic science community related to next-gen-sequencing driven exponential data growth coupled with the identification of potential computational solutions to these challenges.

Educate members of the HPC community on the types of algorithmic, scaling, performance and data management and analysis problems encountered in the genomic domain, empowering them to map frameworks and solutions from the HPC domain to the genomic domain

Lay the groundwork for establishing collaborations focused on 4 to 6 central problems that individuals or institutions might be able to help the JGI in solving.

Intended Participants (by invitation only)

Computational researchers and managers from DOE high performance computing organizations at the national laboratories
Computational researchers from outside the DOE system who bring relevant experiences and expertise
Researchers, biologists and informatics management from the Joint Genome Institute involved in the production, management and analysis of next-generation sequencing data.
Researchers who may have developed computational solutions to the issues outlined below

Introductory Presentations

Eddy Rubin: Overview of the DOE JGI and Genomics now and in the next decade

Rick Stevens: Overview of the path for DOE HPC to intersect with genomics

Presentations followed by Discussion groups:

The goal of the presentations will be for individuals close to the data and associated science to educate the HPC community in the computational challenges faced. The discussion groups will be lead by the HPC participants in deciphering the issues and possible approaches to solutions

Topics Include

I. Short Read Sequence Assembly and Integration of Reads from Multiple Sequencing Platforms
II. Development of Information from the Data: Annotation and Large-Scale Data Integration
III Evolutionary Analysis and Genome Feature Detection
VI. Data Handling and Computational Infrastructure

Output

The output of the workshop will be a short document that articulates the key computational challenges faced by the JGI in the next 5 years and a discussion of the possible HPC solutions to these challenges (that are offered by DOE National Laboratory computational facilities?).