NERSCPowering Scientific Discovery Since 1974

Thomas Miller

BES Requirements Worksheet

1.1. Project Information - Reaction Dynamics in Complex Molecular Systems

Document Prepared By

Thomas Miller

Project Title

Reaction Dynamics in Complex Molecular Systems

Principal Investigator

Thomas Miller

Participating Organizations

California Institute of Technology

Funding Agencies

 DOE SC  DOE NSA  NSF  NOAA  NIH  Other:

2. Project Summary & Scientific Objectives for the Next 5 Years

Please give a brief description of your project - highlighting its computational aspect - and outline its scientific objectives for the next 3-5 years. Please list one or two specific goals you hope to reach in 5 years.

In this project, we employ path integral methods, rare-event sampling methods, and classical molecular dynamics to simulate reactive processes in complex molecular systems. Areas of primary focus include (i) proton-coupled electron transfer (PCET) dynamics in enzymes and photo-catalysts and (ii) long-timescale dynamics in protein-conducting transmembrane channels. The simulation methods are used to investigate the kinetics and regulation of these processes. This work provides improved understanding of the fundamental mechanisms that govern biomolecular energy conversion and protein transport. 
 
From the computational perspective, these applications are unified by the challenge of bridging dynamical lengthscales and timescales in moleculear simulations. PCET dynamics feature the coupling of electron transfer, proton transfer, and solvent fluctuation dynamics; whereas protein translocation involves the coupling of channel gating motions with the transport of proteins across the membrane. The PCET studies face the additional challenge of simulating coupled quantum and classical dynamics. Path integral methods and rare-event sampling methods offer a computationally scalable approach to mitigating these challenges. 
 
Specific Goal #1: Direct simulations of PCET dynamics in a symmetric, mixed-valence iron bi-imidazoline system.  
 
Iron bi-imidazolines have been extensively studied as a prototype for PCET dynamics and as a model for the tyrosine reduction step in phototsystem II. By combining path-integral molecular dynamics and with the transition path sampling method of studying chemical reactions, we will perform the first direct simulations of these reactions, we will determine the chronology of the ET/PT events, and we will explore the origin of the unexpectedly low experimental kinetic isotope effect in the PCET reaction rate. Using a modified version of the DL_POLY molecular simulation package, we will sample ~5,000 trajectories PIMD trajectories at an estimated cost of 1M CPU hours. 
 
Specific Goal #2: Regulation of protein translocation and membrane integration via the Sec translocon.  
 
A critical step in the biosynthesis of many proteins involves either translocation across a cellular membrane or integration into a cellular membrane. Both processes proceed via the Sec translocase - a ubiquitous and highly conserved transmembrane channel. We will test the hypothesis that Sec-facilitated protein translocation and membrane integration is regulated by a mechanism in which the translocon acts as a substrate-controlled conformational switch between pathways for membrane integration and secretion. Using the string method in collective variables, we will characterize energetics, mechanism, and dynamics of protein translocation and membrane integration. To perform these calculations, a modified version of the NAMD package will be used to simulate calculations will simulate 40 independent replicas of the system along the transition path, each for a time of approximately 40 ns. The estimated cost is 2M CPU hours.  

3. Current HPC Usage and Methods

3a. Please list your current primary codes and their main mathematical methods and/or algorithms. Include quantities that characterize the size or scale of your simulations or numerical experiments; e.g., size of grid, number of particles, basis sets, etc. Also indicate how parallelism is expressed (e.g., MPI, OpenMP, MPI/OpenMP hybrid)

We typically perform large numbers (1E3-1E5) of nearly-independent MD or PIMD trajectories using molecular mechanics force fields and electron pseudopotentials. Systems sizes range from 10,000-150,000 atoms. 
 
Calculations are performed using versions of the DL_POLY, GROMACS, and NAMD that we have modified to perform the path-integral and rare-event sampling techniques. 
 
Heavy communication between processors is needed for the force-evalulation step of the MD and PIMD trajectories. Small, but non-zero, communication is typically needed between the independent trajectories. Parallelization is achieved primarily using MPI. 

3b. Please list known limitations, obstacles, and/or bottlenecks that currently limit your ability to perform simulations you would like to run. Is there anything specific to NERSC?

1) Wait time in the queue is always a limiting factor. 
2) The speed of communication between processors limits the degree of parallelizability individual MD and PIMD trajectories. 

3c. Please fill out the following table to the best of your ability. This table provides baseline data to help extrapolate to requirements for future years. If you are uncertain about any item, please use your best estimate to use as a starting point for discussions.

Facilities Used or Using

 NERSC  OLCF  ACLF  NSF Centers  Other:  NCCS

Architectures Used

 Cray XT  IBM Power  BlueGene  Linux Cluster  Other:  

Total Computational Hours Used per Year

 2-4M Core-Hours

NERSC Hours Used in 2009

1.5M Core-Hours

Number of Cores Used in Typical Production Run

 128-5000

Wallclock Hours of Single Typical Production Run

 6-12

Total Memory Used per Run

 40-1024 GB

Minimum Memory Required per Core

 2 GB

Total Data Read & Written per Run

 4 GB

Size of Checkpoint File(s)

 2 GB

Amount of Data Moved In/Out of NERSC

 20 GB per  week

On-Line File Storage Required (For I/O from a Running Job)

 0.5 GB and  10000 Files

Off-Line Archival Storage Required

 10 GB and  100000 Files

Please list any required or important software, services, or infrastructure (beyond supercomputing and standard storage infrastructure) provided by HPC centers or system vendors.

NAMD, FFTW 

4. HPC Requirements in 5 Years

4a. We are formulating the requirements for NERSC that will enable you to meet the goals you outlined in Section 2 above. Please fill out the following table to the best of your ability. If you are uncertain about any item, please use your best estimate to use as a starting point for discussions at the workshop.

Computational Hours Required per Year

 30 M

Anticipated Number of Cores to be Used in a Typical Production Run

 5000-50000

Anticipated Wallclock to be Used in a Typical Production Run Using the Number of Cores Given Above

 6-12

Anticipated Total Memory Used per Run

 400-10240 GB

Anticipated Minimum Memory Required per Core

 2 GB

Anticipated total data read & written per run

 40 GB

Anticipated size of checkpoint file(s)

 2 GB

Anticipated On-Line File Storage Required (For I/O from a Running Job)

 1 GB and  10000 Files

Anticipated Amount of Data Moved In/Out of NERSC

 200 GB per  week

Anticipated Off-Line Archival Storage Required

 50 GB and  100000 Files

4b. What changes to codes, mathematical methods and/or algorithms do you anticipate will be needed to achieve this project's scientific objectives over the next 5 years.

We anticipate increased use of electronic structure theory codes, such plane-wave or DVR-basis DFT codes and Molpro.

4c. Please list any known or anticipated architectural requirements (e.g., 2 GB memory/core, interconnect latency < 3 #s).

2 GB is generally fine for our purposes. Reduced latency will improve the strong scaling of the MD.

4d. Please list any new software, services, or infrastructure support you will need over the next 5 years.

We will considerably expand the number of nodes used in a given run. 

4e. It is believed that the dominant HPC architecture in the next 3-5 years will incorporate processing elements composed of 10s-1,000s of individual cores, perhaps GPUs or other accelerators. It is unlikely that a programming model based solely on MPI will be effective, or even supported, on these machines. Do you have a strategy for computing in such an environment? If so, please briefly describe it.

MD and electronic structure theory methods are proving amenable to architectures that incorporate GPUs. 

New Science With New Resources

To help us get a better understanding of the quantitative requirements we've asked for above, please tell us: What significant scientific progress could you achieve over the next 5 years with access to 50X the HPC resources you currently have access to at NERSC? What would be the benefits to your research field if you were given access to these kinds of resources?

Please explain what aspects of "expanded HPC resources" are important for your project (e.g., more CPU hours, more memory, more storage, more throughput for small jobs, ability to handle very large jobs).

The primary way in which our work would benefit from expanded HPC resources is to have access to more CPU hours. 
 
A 50X increase in the number of CPU hours would benefit us in several ways. 
1) This increase in computational resources would open the door for us to use electronic structure methods (such as DFT or embedded DFT) instead of molecular mechanics force fields and electron pseudopotentials. This would have a major impact on the accuracy of the calculations performed and on the range of systems that can be studied, with significant societal benefit. 
 
2) This increase would allow us to look at larger systems, to construct free energy profiles in more dimensions, and to more fully converge challenging rate calculations. Many new systems will be brought within the realm of rigorous study.