NERSCPowering Scientific Discovery Since 1974

Meraculous

Description

De novo whole genome assembly reconstructs genomic sequence from short, overlapping, and potentially erroneous fragments called reads, such as those typically generated by state-of-the-art high-throughput sequencers. Meraculous is one component in a multi-part, massively parallel de novo genome assembly pipeline developed jointly by researchers at UC Berkeley, Lawrence Berkeley National Lab, and the DOE Joint Genome Institute. Like its namesake (originally implemented in Perl), Meraculous constructs and traverses the de Bruijn graph of all overlapping substrings of length k (k-mers) present in the input dataset of redundant short sequence reads. By traversing the de Bruijn graph, and discovering all (possibly disconnected) linear subgraphs, Meraculous is able to construct high-quality contiguous sequences of genomic data composed of (typically many of) the original reads. These resulting sequences are known as contigs.

Unlike its namesake, Meraculous is implemented in Unified Parallel C, which was chosen to support the one-sided model of communication that is required in order to construct and later traverse the de Bruijn graph, represented as a distributed hash table where the k-mers are the keys, in a scalable manner. One-sided communication is necessary due to the complex, data dependent access patterns encountered in the latter operations, where peers in communication, extents of data access, and timings there are not known a priori.

Required Problem Sets

Problems are defined in the ./benchmark directory. For Meraculous, small, medium and large problems have been defined. Results of the large problem will be used as the reference and target in the calculation of SSI.

Source Distribution

Source can be downloaded here.

The data sets are very large, the metagenomes file being nearly 1 TB, and hence should only be downloaded using a high speed connection. The data sets can be found here.

How to Build, Run and Verify

Refer to the README.APEX file in the source distribution.

Authorship

Meraculous was developed jointly by Lawrence Berkeley National Laboratory's Computational Research Division, Joint Genome Institute and the University of California Berkeley, led by Evangelos Georganas.

Change Log

11/9/2015 Source distribution links enabled
10/30/2015 Initial release