Protein Folding: “Polishing” AMBER to Increase Parallel Scalability

 

By Sarah Moussa

CS 267, Fall 2002

 

The Problem:

 

Protein folding is perhaps the most compelling and challenging problem facing computational biochemists today.  In addition to being academically interesting, solving the protein-folding problem could also potentially save millions of development dollars for pharmaceutical companies and could make “designer” medicines a reality, as well as promoting our understanding of many cellular processes and diseases.

 

The essence of the problem is predicting the native conformation of a protein (the natural folded state which is a ridiculously complicated form) completely, given just the amino acid sequence of the protein and some information about its environment (for example the type of solution surrounding the protein).  Many scientists are approaching this problem from the bottom up—using first principles to model the electrical, chemical, and quantum interactions of the individual atoms that make up the protein and the environment.  One such researcher is Peter Kollman of UC San Francisco (as well as my research advisor Teresa Head-Gordon!).  In his massive 1998 simulation, he considered a small 36 amino-acid protein called “villin headpiece sub-domain”.  The simulation required 12,000 atoms in order to model the protein and the surrounding water.

 

Adding significantly to the complexity of such molecular dynamics simulations is the fact that the calculations of force, position, momentum, etc. must be carried out with a very small time step in order to avoid significant discretization and computational errors.  Typically, simulations are carried out with a time step in the femtosecond range (10-15 sec), which means the simulated quantities must be fully calculated a billion times to simulate only a single microsecond of real time!

 

Fortunately, techniques can be used to efficiently parallelize even the complex particle-particle computations in molecular dynamics simulations.  In Kollman’s group, a researcher by the name of Yong Duan applied many parallel programming techniques to optimize and improve the parallel scalability of the molecular dynamics piece of the AMBER software suite (the acronym stands for Assisted Model Building with Energy Refinement), a popular biomolecule modeling software package.  His changes caused a 70% performance increase in single-processor performance and also improved the load balancing and communication between processors, increasing the performance of the application on a 256-processor Cray T3E six-fold and allowing the group to simulate a full microsecond of folding, 100 times longer than any previous protein folding simulation.  Quite an impressive feat and a great motivation for mastering parallel programming techniques!

 

The Platform:

 

The improved molecular dynamics simulation code developed by Kollman’s group was written and tested on the Cray T3D at the Pittsburgh Supercomputing Center (PSC) in 1997, and was subsequently run on a (4x faster) Cray T3E.  The 256-processor machines used by the group in 1997 of course don’t appear in the latest TOP500 list (published in June 2002) as they have been surpassed by larger Cray T3E systems, the best of which being the T3E1200 at Cray Research, which comes in at #22 on the latest list.

 

The Cray T3D system used by Kollman’s group was the first in a series of massively parallel processing (MPP) systems from CRAY Research but did not appear on the TOP500 lists from 1996 or 1997.  The 256-processor T3E used by the group for the majority of the simulation did appear at #13 on the TOP500 list from November 1996.

 

 

Rank

Manufacturer

Computer

Rmax

Installation Site

Country

Year

Area of Installation

# Proc

Rpeak

Nmax

N1/2

1

Hitachi/Tsukuba

CP-PACS/2048

368200

Center for Computational Physics, Univ of Tsukuba Tsukuba

Japan

1996

Academic

2048

614000

103680

30720

 

2

Fujitsu

Numerical Wind Tunnel

229700

NAL

Japan

1996

Research Aerospace

167

281000

66132

18018

 

3

Hitachi

SR2201/1024

220400

University of Tokyo Tokyo

Japan

1996

Academic

1024

307000

138240

34560

 

4

Intel

XP/S140

143400

Sandia National Labs Albuquerque

USA

1993

Research

3680

184000

55700

20500

 

5

Intel

XP/S-MP 150

127100

Oak Ridge National Laboratory Oak Ridge

USA

1995

Research

3072

154000

86000

17800

 

13

Cray

T3E LC256-128

93200

Pittsburgh Supercomputer Center Pittsburgh

USA

1996

Research

256

154000

53664

11040

 

 

Table 1. The peak LINPACK performance of the T3E LC256-128 at the Pittsburgh Supercomputing Center was 93,200 (Gflops) and the overall rank was #13 on the TOP500 list from November 1996.

 

Kollman’s group was able to realize performance 170 times faster on the 256-processor machine than the single-processor performance, equivalent to an efficiency of 66%.  This was a marked departure from previous molecular dynamics applications, which tended to level off at 40, 50, or 60 processors.

 

The Techniques:

 

The largest challenge to parallel scalability for this application was the enormous communication cost driven by the fact that the original algorithm required all processors to have information about all atoms in the system in order to calculate the full set of forces on each atom.  In earlier versions of AMBER, this requirement drove up communication costs and latency as this bulk information was passed between processors which lay idle waiting for these huge information transfers to complete.

 

Duan broke the problem into smaller pieces, making each processor solely responsible for a different rectangular block.  Instead of each processor maintaining the full position and force information for all atoms in the system, the acts of distributing updated positions and of gathering force information were broken into isolated tasks that could be performed by each processor on an as-needed basis.  This presumably meant the processors could swap coordinate/force information on a pair-wise basis, incrementally updating the resultant force calculations for their own atoms while simultaneously waiting for information transfers from different neighboring blocks (this is my speculation; I had a hard time finding detailed papers on Duan’s precise algorithms).

 

References:

 

http://www.cray.com/products/applications/amber.pdf

 

http://www.psc.edu/science/Kollman98/kollman98.html

 

http://www.psc.edu/machines/cray/t3d/t3d.html