David Beck
Case Study Worksheet
Project Information - Molecular Dynameomics
| Document Prepared By | David Beck |
|---|---|
| Project Title | Molecular Dynameomics |
| Principal Investigator | Valerie Daggett |
| Participating Organizations | University of Washington |
| Science Category | Climate Environmental Science Biological Sciences |
| Funding Agencies | DOE SC DOE NSA NSF NOAA NIH Other: Technical Computing Initiative of Microsoft |
Project Summary (Scientific Objectives)
Please give a brief description of your project and its scientific objectives for the next 3-5 years.
The Protein Data Bank (PDB) has been a tremendously useful repository of experimentally derived, static protein structures that have stimulated many important scientific discoveries. While the utility of static physical representations of proteins is not in doubt, as these molecules are fluid in vivo, there is a larger universe of knowledge to be tapped regarding the dynamics of proteins. Thus, we propose to construct a complementary database comprised of molecular dynamics (MD) structures for representatives of all protein folds including their unfolding pathways. We are calling this effort 'dynameomics.' Our goal is to simulate the native (biologically active) state and complete unfolding pathways by MD. We will do this for representatives from all protein folds. There are approximately 1641 known non-redundant folds, of which we have simulated ~800 for a combined total of ~100 microseconds. The protocols employed in these simulations have been developed over the last 15 years in our lab. With continued access to DOE resources, we will be able to simulate all of our 1641 targets.
With the data resulting from the MD simulations, we will identify patterns and general features of transition, intermediate and denatured states to improve structure prediction algorithms. Structure prediction remains one of the elusive goals of protein science. It is necessary to successfully predict native states of proteins, in order to translate the current deluge of genomic information into a form appropriate for better functional identification of proteins and for drug design. This is a data mining endeavor to identify similarities and differences between native and unfolded states across all secondary and tertiary structure types and sequences. This aim represents our immediate scientific goal for the data resulting from the dynameomics project; however, as with the PDB after its conception, there will certainly be much more to come of it and areas of inquiry by outside users that we cannot anticipate.
Current HPC Usage and Methods
| Facilities Used |
|
NCCS | ACLF | NSF Centers |
|
|---|---|---|---|---|---|
| Architectures Used |
|
|
BlueGene |
|
|
| Total Computational Hours Used per Year | ~8 million Core-Hours | NERSC Hours Used per Year | ~4 million Core-Hours | ||
| Number of Cores Used in Typical Production Run | 1152 | Wallclock Hours of Single Typical Production Run | 288 | ||
| Total Memory Used per Run | 768 GB | Minimum Memory Required per Core | <1 GB | ||
| Total Data Read & Written per Run | 1,536 GB | Size of Checkpoint File(s) | 2 GB | ||
| Amount of Data Moved In/Out of NERSC | 5120 GB | How Often | 30 days | ||
| On-Line File Storage Required (Directly Accesible from a Running Job) | 6 GB | 8488 Files | |||
| Off-Line Archival Storage Required | GB | Files | |||
Please list any required or important software, services, or infrastructure (beyond supercomputing and standard storage infrastructure) provided by HPC centers or system vendors.
GNU Scientific Library, rsync, gnuplot, PERL
Please list your current primary codes and their main mathematical methods and/or algorithms. Include quantities that characterize the size or scale of your simulations or numerical experiments; e.g., size of grid, number of particles, basis sets, etc. Also indicate how parallelism is expressed (e.g., MPI, OpenMP, MPI/OpenMP hybrid)
in lucem Molecular Mechanics (ilmm) is a scalable parallel molecular mechanics kernel. It has an optimized force field evaluation that is designed specifically for clusters of multi-processor (i.e. SMP) nodes. The forces, as a function of the coordinates, are used in the time-dependent treatment of classical equations of motion for a condensed phase molecular system (i.e. proteins in water). ilmm uses explicit solvent, all-atom treatments of its systems to realistically model their time dependent dynamics.
Numerical integration of the classical equations of motion for molecular systems (Allen & Tildesly Computer simulation of liquids. 1989. Oxford University Press).
A modified version of the Brooks-Beeman integration algorithm is employed for our molecular dynamics simulations (Beck & Daggett, Methods for molecular dynamics simulations of protein folding / unfolding in solution, METHODS, 34, 112-120, 2004).
Please list the known limitations/obstacles/bottleneck of resources currently available HPC systems, and in particular, those at NERSC.
Disk space
Memory bandwidth
HPC Usage and Methods for the Next 3-5 Years
Anticipated changes to codes, mathematical methods and/or algorithms needed to achieve this project's scientific objectives.
Our simulation codes are well positioned for the multi- to many-core transition and the return of SIMD.
The next challenge will be analyzing the ensemble of 100s of TB of data as a whole. This is now not only a problem of HPC simulation but of HPC analytics.
| Computational Hours Required per Year | 4 to 6 million | |
|---|---|---|
| Anticipated Number of Cores to be Used in a Typical Production Run | 1000s | |
| Anticipated Wallclock to be Used in a Typical Production Run Using the Number of Cores Given Above | 300 | |
| Anticipated Total Memory Used per Run | 4000 GB | |
| Anticipated Minimum Memory Required per Core | 4 GB | |
| Anticipated total data read & written per run | 12000 GB | |
| Anticipated size of checkpoint file(s) | 500 GB | |
| Anticipated On-Line File Storage Required (Directly Accesible from a Running Job) | 10s to 100s GB | 100000s Files |
| Anticipated Off-Line Archival Storage Required | GB | Files |
Known or Anticipated architectural requirements (e.g., 2 GB memory/core).
More cores per node.
Please list any additional required or important software, services, or infrastructure beyond those listed in the previous section.
Virtualization to support our GrayWulf analytics techniques. Database instances for GrayWulf & tools like Dryad. Possibly map / reduce implementations.
It is believed that the dominant HPC architecture in the next 3-5 years will incorporate processing elements composed of 10s-1,000s of individual cores. It is unlikely that a programming model based solely on MPI will be effective, or even supported, on these machines. Do you have a strategy for computing in such an environment? If so, please briefly describe it.
Threads. OpenMP. Our code was written for this model of computation.
What Do You Need from NERSC?
Please tell us what you need from NERSC to meet your project's computing needs over the next 3-5 years. Also please feel free to make any general comments.
The Dynameomics project continues to have a significant HPC simulation component. However, we are facing new HPC challenges in our analytical computing when we consider the 100 TB+ data set as a whole. The HPC simulation community is facing the problem of how to manage and analyze massive ensembles of data. The computational requirements of analysis are rapidly becoming as significant as simulation. We believe that relational database solutions are an ideal way to leverage existing technologies for delivering precise streams of data to a large number of analytical consumers. We would like to see NERSC embrace analysis as a part of supercomputing.


