Published in:
Bioinformatics (Sep 10 2013)
Author(s):
DOI:
10.1093/bioinformatics/btt528
Abstract:
MOTIVATION: The recent revolution in sequencing technologies has led to an exponential growth of sequence data. As a result, most of the current bioinformatics tools become obsolete as they fail to scale with data. To tackle this “data deluge”, here we introduce the BioPig sequence analysis toolkit as one of the solutions that scale to data and computation. RESULTS: We built BioPig upon the Apache’s Hadoop MapReduce system and the Pig data flow language. Compared to traditional serial and MPI based algorithms, BioPig has three major advantages: first, BioPig’s programmability greatly reduces development time for parallel bioinformatics applications; second, testing BioPig with up to 500 Gb sequences demonstrates that it scales automatically with size of data; and finally, BioPig can be ported without modification on many Hadoop infrastructures, as tested with Magellan system at NERSC and the Amazon Elastic Compute Cloud. In summary, BioPig represents a novel program framework with the potential to greatly accelerate data-intensive bioinformatics analysis. AVAILABILITY: BioPig is released as open source software under the BSD license at https://sites.google.com/a/lbl.gov/biopig/ CONTACT: [email protected].