|
|
 |
 |
|
| On the right, each line segment represents
a genomic fragment whose sequence has been determined at each end
(arrows). On the left, blue rectangles represent contiguous stretches
of reassembled sequence. This visualization tool allows rapid inspection
of the automated assemblies produced by JAZZ. The genome shown is
that of the white rot fungus Phanerochaete chrysosporium. |
|
Daniel
Rokhsar, Lawrence Berkeley National Laboratory and Joint Genome Institute
Research
Objectives
We are developing, implementing, and applying parallel code for
the assembly of whole genome shotgun sequence data, including constraints
placed by the availability of paired end sequence. Genomes to be assembled
include Fugu rubripes, the Japanese pufferfish, which has a 400
million base-pair (Mbp) genome; the 200 Mbp Ciona intestinalis
genome; and the 2.6 billion base pair (Gbp) mouse genome. Other large
animal, plant, and fungal genomes will be assembled in the future. We
will then use parallel implementations of BLAST and other sequence comparison
codes for the high-throughput analysis of genomic and other sequence data.
Further development will include rapid parallel searches for short conserved
elements.
Computational
Approach
We use custom parallel code to accomplish genome assembly. The
overall plan has three phases: (1) rapid identification of overlaps between
pairs of sequence fragments, (2) the construction of a linear layout of
these fragments that is consistent with overlaps and pair-end information,
and (3) the conversion of this layout into a consensus sequence. Based
on the sequence coverage goals of the Joint Genome Institute sequencing
effort, contigs of 50,000 bases or more are expected. Phase 1 is the most
time consuming part of the project, and has been parallelized using MPI
and tested on a 3´ coverage mouse dataset. Phase 2 is memory intensive;
it can be completed on a single IBM SP node, threaded over 16 processors,
but requires large RAM. Phase 3 is embarrassingly parallel. We have developed
a second large-scale assembler with an alternative Phase 2/3 division
of labor, and we expect to use both implementations as needed depending
on detailed aspects of datasets.
Sequence comparison will be carried out using an implementation of parallel
BLAST ported to the IBM SP. Since 5,000 to 10,000 contigs are expected,
these comparisons can be distributed across multiple processors in an
embarrassingly parallel fashion. Similarly, comparisons of microbial genomes
with one another can be straightforwardly parallelized.
Accomplishments
We have used the IBM SP to develop and test our newly developed
large-scale genome assembly system, JAZZ, which reconstructs contiguous
genome sequences by overlapping the short subsequences that can be determined
using modern DNA sequencing technology. JAZZ self-consistently uses pair-end
information in the construction of contigs, and produces ordered and oriented
sequence scaffolds as output. An initial mouse genome assembly has been
carried out. The test dataset consisted of mouse sequence fragments that,
on average, cover each base of the mouse genome three times-a total of
14 million sequence fragments. We assembled these fragments into approximately
one million 3,000-base-pair contiguous sequences. This assembly required
over 100,000 hours of processor time, primarily for the fragment comparison
step. We are now prepared to assemble future mammalian genome datasets,
enabling public whole genome sequencing efforts.
Significance
The availability of genomic sequences for a wide range of organisms
allows unprecedented access to the fundamental parts-list-and-instructions
for constructing these organisms, and allows comparison of these lists
to uncover essential pathways for various bacterial and eukaryotic processes
related to energy, the environment, and human susceptibilities. For example,
the human and fish lineages diverged nearly 400 million years ago; direct
comparison of their genomic sequence will reveal conserved (and therefore
presumably functional) elements defining coding and regulatory sequences.
Relatively small-scale comparisons of this sort have proven useful for
identifying human genes; we expect the comparison of the human genome
with that of a more distant vertebrate will highlight these conserved
elements on a genomic scale.
Publications
http://www.jgi.doe.gov
|