Metagenomics on a Cloud

June 30, 2010

One goal of the Magellan project is to understand which science applications and user communities are best suited for science cloud computing, in fact some DOE metagenomics researchers have already given public clouds a whirl. Their preliminary findings about the strengths and limitations of commercial cloud computing platforms will be extremely valuable as DOE explores cloud computing for science.

Metagenomics is the study of metagenomes, genetic material recovered directly from environmental samples. By identifying and understanding bacterial species based on sequence similarity, some researchers hope to put microbial communities to work mitigating global warming and cleaning up toxic waste sites, among other tasks. Today, relatively inexpensive platforms can sequence the genomes of various organisms in a matter of days. A few decades ago, it took 13 years and $2.3 billion for the Human Genome Project to sequence the entire human genome. Although the time and cost of sequencing has decreased, the biological interpretation of these rapidly growing datasets remains complex and time consuming, creating a surge in computing demand.

To determine the best solution for meeting the metagenomics community's increasing demand for computing resources, a collaboration of researchers from the Lawrence Berkeley National Laboratory's Biological Data Management and Technology Center, Advanced Computing for Science Department and the National Energy Research Scientific Computing Center (NERSC) explored the performance and scalability of BLAST, a popular metagenomics application, on a variety of platforms including:

a traditional HPC platform: NERSC's Cray XT4 "Franklin" system
a traditional midrange platform: the 32-node "Planck" Linux cluster at NERSC
a commercial "infrastructure as a service" cloud: Amazon’s EC2
a shared research "platform as a service" cloud: Yahoo M45.

BLAST (Basic Local Alignment Search Tool) is the community's standard for sequence comparison, enabling researchers to compare a query sequence with a library or database of sequences, and identifylibrary sequences that resemble the query sequence above a certain threshold. The code is a part of the Microbial Genome and Metagenome (MGM) pipeline at the DOE Joint Genome Institute (JGI).

Although the team achieved scalable performance on all of the evaluated platforms, their results show that the cost of running BLAST-based codes on commercial cloud architectures increased significantly as they scaled up. This was primarily due to the premium cost associated with on-demand access. Additional overhead costs of scientific computing on a commercial cloud include boot-up time and data transfer and loading time. The team also found startup implementation costs for customization of a cloud environment for each science discipline, and concluded that applications such as BLAST need high-end machines with the latest CPUs and memory that are less commonly used in the commercial cloud applications. From a software perspective, they found that Hadoop, an open-source implementation of MapReduce, could successfully be used with BLAST and similar codes with a little more development work. There is a need for better tools for managing data, cost and virtual machine images.

In another study, Jared Wilkening, a software developer at the Argonne National Laboratory, tested the feasibility of employing Amazon EC2 to run BLAST-based biology and bioinformatics tools. He notes that the BLAST-based codes, like the one he used on the Amazon EC2, are perfect for cloud computing because there is little internal synchronization, therefore it doesn't rely on high performance interconnects. Like the Berkeley team, he found that Amazon is significantly more expensive than locally owned clusters. Wilkening's paper was published in Cluster 2009.

In the future, the Berkeley team hopes to test the performance and scalability of other portions of the MGM pipeline on a variety of platforms. They presented their BLAST performance comparisons at the Using Clouds for Parallel Computations in Systems Biology workshop at SC09.

Contributors: Lavanya Ramakrishnan, Victor Markowitz, Shane Canon, Shreyas Cholia, Keith Jackson, John Shalf and Linda Vu of the Lawrence Berkeley National Laboratory, and Jared Wilkening of the Argonne National Laboratory.

About NERSC and Berkeley Lab
The National Energy Research Scientific Computing Center (NERSC) is a U.S. Department of Energy Office of Science User Facility that serves as the primary high performance computing center for scientific research sponsored by the Office of Science. Located at Lawrence Berkeley National Laboratory, NERSC serves almost 10,000 scientists at national laboratories and universities researching a wide range of problems in climate, fusion energy, materials science, physics, chemistry, computational biology, and other disciplines. Berkeley Lab is a DOE national laboratory located in Berkeley, California. It conducts unclassified scientific research and is managed by the University of California for the U.S. Department of Energy. »Learn more about computing sciences at Berkeley Lab.