Annual Report
2001
TABLE OF CONTENTS YEAR IN REVIEW SCIENCE HIGHLIGHTS
SCIENCE HIGHLIGHTS:
BIOLOGICAL and ENVIRONMENTAL RESEARCH

Computational Analysis of Genomic Sequence Data

 
Director's
Perspective
 
Computational Science at NERSC
NERSC Systems and Services
High Performance Computing R&D at Berkeley Lab
Basic Energy Sciences
Biological and Environmental Research
Fusion Energy Sciences
High Energy and Nuclear Physics
Advanced Scientific Computing Research and Other Projects
On the right, each line segment represents a genomic fragment whose sequence has been determined at each end (arrows). On the left, blue rectangles represent contiguous stretches of reassembled sequence. This visualization tool allows rapid inspection of the automated assemblies produced by JAZZ. The genome shown is that of the white rot fungus Phanerochaete chrysosporium.

Research Objectives
We are developing, implementing, and applying parallel code for the assembly of whole genome shotgun sequence data, including constraints placed by the availability of paired end sequence. Genomes to be assembled include Fugu rubripes, the Japanese pufferfish, which has a 400 million base-pair (Mbp) genome; the 200 Mbp Ciona intestinalis genome; and the 2.6 billion base pair (Gbp) mouse genome. Other large animal, plant, and fungal genomes will be assembled in the future. We will then use parallel implementations of BLAST and other sequence comparison codes for the high-throughput analysis of genomic and other sequence data. Further development will include rapid parallel searches for short conserved elements.

Computational Approach
We use custom parallel code to accomplish genome assembly. The overall plan has three phases: (1) rapid identification of overlaps between pairs of sequence fragments, (2) the construction of a linear layout of these fragments that is consistent with overlaps and pair-end information, and (3) the conversion of this layout into a consensus sequence. Based on the sequence coverage goals of the Joint Genome Institute sequencing effort, contigs of 50,000 bases or more are expected. Phase 1 is the most time consuming part of the project, and has been parallelized using MPI and tested on a 3´ coverage mouse dataset. Phase 2 is memory intensive; it can be completed on a single IBM SP node, threaded over 16 processors, but requires large RAM. Phase 3 is embarrassingly parallel. We have developed a second large-scale assembler with an alternative Phase 2/3 division of labor, and we expect to use both implementations as needed depending on detailed aspects of datasets.

Sequence comparison will be carried out using an implementation of parallel BLAST ported to the IBM SP. Since 5,000 to 10,000 contigs are expected, these comparisons can be distributed across multiple processors in an embarrassingly parallel fashion. Similarly, comparisons of microbial genomes with one another can be straightforwardly parallelized.

Accomplishments
We have used the IBM SP to develop and test our newly developed large-scale genome assembly system, JAZZ, which reconstructs contiguous genome sequences by overlapping the short subsequences that can be determined using modern DNA sequencing technology. JAZZ self-consistently uses pair-end information in the construction of contigs, and produces ordered and oriented sequence scaffolds as output. An initial mouse genome assembly has been carried out. The test dataset consisted of mouse sequence fragments that, on average, cover each base of the mouse genome three times-a total of 14 million sequence fragments. We assembled these fragments into approximately one million 3,000-base-pair contiguous sequences. This assembly required over 100,000 hours of processor time, primarily for the fragment comparison step. We are now prepared to assemble future mammalian genome datasets, enabling public whole genome sequencing efforts.

Significance
The availability of genomic sequences for a wide range of organisms allows unprecedented access to the fundamental parts-list-and-instructions for constructing these organisms, and allows comparison of these lists to uncover essential pathways for various bacterial and eukaryotic processes related to energy, the environment, and human susceptibilities. For example, the human and fish lineages diverged nearly 400 million years ago; direct comparison of their genomic sequence will reveal conserved (and therefore presumably functional) elements defining coding and regulatory sequences. Relatively small-scale comparisons of this sort have proven useful for identifying human genes; we expect the comparison of the human genome with that of a more distant vertebrate will highlight these conserved elements on a genomic scale.

Publications
http://www.jgi.doe.gov

< Table of Contents Top ^
Next >