By Sarah Moussa
CS 267, Fall 2002
The Problem:
Protein folding is perhaps the most compelling and challenging problem facing computational biochemists today. In addition to being academically interesting, solving the protein-folding problem could also potentially save millions of development dollars for pharmaceutical companies and could make “designer” medicines a reality, as well as promoting our understanding of many cellular processes and diseases.
The essence of the problem is predicting the native conformation of a protein (the natural folded state which is a ridiculously complicated form) completely, given just the amino acid sequence of the protein and some information about its environment (for example the type of solution surrounding the protein). Many scientists are approaching this problem from the bottom up—using first principles to model the electrical, chemical, and quantum interactions of the individual atoms that make up the protein and the environment. One such researcher is Peter Kollman of UC San Francisco (as well as my research advisor Teresa Head-Gordon!). In his massive 1998 simulation, he considered a small 36 amino-acid protein called “villin headpiece sub-domain”. The simulation required 12,000 atoms in order to model the protein and the surrounding water.
Adding significantly to the complexity of such molecular dynamics simulations is the fact that the calculations of force, position, momentum, etc. must be carried out with a very small time step in order to avoid significant discretization and computational errors. Typically, simulations are carried out with a time step in the femtosecond range (10-15 sec), which means the simulated quantities must be fully calculated a billion times to simulate only a single microsecond of real time!
Fortunately, techniques can be used to efficiently parallelize even the complex particle-particle computations in molecular dynamics simulations. In Kollman’s group, a researcher by the name of Yong Duan applied many parallel programming techniques to optimize and improve the parallel scalability of the molecular dynamics piece of the AMBER software suite (the acronym stands for Assisted Model Building with Energy Refinement), a popular biomolecule modeling software package. His changes caused a 70% performance increase in single-processor performance and also improved the load balancing and communication between processors, increasing the performance of the application on a 256-processor Cray T3E six-fold and allowing the group to simulate a full microsecond of folding, 100 times longer than any previous protein folding simulation. Quite an impressive feat and a great motivation for mastering parallel programming techniques!
The Platform:
The improved molecular dynamics simulation code developed by Kollman’s group was written and tested on the Cray T3D at the Pittsburgh Supercomputing Center (PSC) in 1997, and was subsequently run on a (4x faster) Cray T3E. The 256-processor machines used by the group in 1997 of course don’t appear in the latest TOP500 list (published in June 2002) as they have been surpassed by larger Cray T3E systems, the best of which being the T3E1200 at Cray Research, which comes in at #22 on the latest list.
The Cray T3D system used by Kollman’s group was the first in a series of massively parallel processing (MPP) systems from CRAY Research but did not appear on the TOP500 lists from 1996 or 1997. The 256-processor T3E used by the group for the majority of the simulation did appear at #13 on the TOP500 list from November 1996.
|
Rank |
Manufacturer |
Computer |
Rmax |
Installation Site |
Country |
Year |
Area of Installation |
# Proc |
Rpeak |
Nmax |
N1/2 |
|
|
1 |
Hitachi/Tsukuba |
CP-PACS/2048 |
368200 |
Japan |
1996 |
Academic |
2048 |
614000 |
103680 |
30720 |
|
|
|
2 |
Fujitsu |
Numerical
Wind Tunnel |
229700 |
Japan |
1996 |
Research
Aerospace |
167 |
281000 |
66132 |
18018 |
|
|
|
3 |
Hitachi |
SR2201/1024 |
220400 |
University of Tokyo
Tokyo |
Japan |
1996 |
Academic |
1024 |
307000 |
138240 |
34560 |
|
|
4 |
Intel |
XP/S140 |
143400 |
Sandia National Labs Albuquerque |
USA |
1993 |
Research |
3680 |
184000 |
55700 |
20500 |
|
|
5 |
Intel |
XP/S-MP
150 |
127100 |
Oak Ridge National Laboratory Oak Ridge |
USA |
1995 |
Research |
3072 |
154000 |
86000 |
17800 |
|
|
13 |
Cray |
T3E
LC256-128 |
93200 |
Pittsburgh
Supercomputer Center Pittsburgh |
USA |
1996 |
Research |
256 |
154000 |
53664 |
11040 |
|
Table 1. The peak LINPACK performance of the T3E LC256-128 at the Pittsburgh Supercomputing Center was 93,200 (Gflops) and the overall rank was #13 on the TOP500 list from November 1996.
Kollman’s group was able to realize performance 170 times faster on the 256-processor machine than the single-processor performance, equivalent to an efficiency of 66%. This was a marked departure from previous molecular dynamics applications, which tended to level off at 40, 50, or 60 processors.
The Techniques:
The largest challenge to parallel scalability for this application was the enormous communication cost driven by the fact that the original algorithm required all processors to have information about all atoms in the system in order to calculate the full set of forces on each atom. In earlier versions of AMBER, this requirement drove up communication costs and latency as this bulk information was passed between processors which lay idle waiting for these huge information transfers to complete.
Duan broke the problem into smaller pieces, making each processor solely responsible for a different rectangular block. Instead of each processor maintaining the full position and force information for all atoms in the system, the acts of distributing updated positions and of gathering force information were broken into isolated tasks that could be performed by each processor on an as-needed basis. This presumably meant the processors could swap coordinate/force information on a pair-wise basis, incrementally updating the resultant force calculations for their own atoms while simultaneously waiting for information transfers from different neighboring blocks (this is my speculation; I had a hard time finding detailed papers on Duan’s precise algorithms).
References:
http://www.cray.com/products/applications/amber.pdf
http://www.psc.edu/science/Kollman98/kollman98.html
http://www.psc.edu/machines/cray/t3d/t3d.html