NERSCPowering Scientific Discovery Since 1974

Brian Austin

BES Requirements Worksheet

1.1. Project Information - Quantum Monte Carlo for the Electronic Structure of Molecules

Document Prepared By

Brian Austin

Project Title

Quantum Monte Carlo for the Electronic Structure of Molecules

Principal Investigator

W. A. Lester, Jr.

Participating Organizations

 

Funding Agencies

 DOE SC  DOE NSA  NSF  NOAA  NIH  Other:

2. Project Summary & Scientific Objectives for the Next 5 Years

Please give a brief description of your project - highlighting its computational aspect - and outline its scientific objectives for the next 3-5 years. Please list one or two specific goals you hope to reach in 5 years.

This project uses quantum Monte Carlo (QMC) to compute properties of chemical systems that are of major importance to basic energy sciences. The use of QMC methods is warranted when less demanding methods such as density functional theory lack the precision required to answer the question at hand or when discrepancies among other ab initio theories leave the issue unresolved. Unlike other quantum chemical methods, which depend strongly on the single-particle basis set and the extent of the many-particle representation of the wave-function, the accuracy of DMC has only a weak and indirect dependence on its trial wave-function. This allows QMC to resolve questions that cannot be adequately addressed otherwise. Projects currently pursued by this group include calculation of the potential energy curve for the interaction between lithium atoms and graphene sheets, computation of the O-H bond dissociation energy in phenol, determination of the S0-S1 excitation energy of a retinal protonated Schiff base. 
 
As more computing resources become available, the systems that we study will increase in both number and size. The O(N) algorithms for wave function evaluation that have been developed in this group will leverage the expanded resources so that, within the next five years, we will be able to examine systems in the 250-350 atom range. Second, our QMC calculations will be extended to include both electronic and nuclear motion so that thermochemical properties can be computed immediately at the QMC level without requiring the Born Oppenheimer approximation or the use of lower level theories to determine molecular geometries or vibrational modes. In addition, we plan to increase the value of our QMC calculations by developing new tools, such as the electron-pair localization function, to analyze our MC 'trajectories' and improve the understanding of exotic binding motifs or electron correlation.  

3. Current HPC Usage and Methods

3a. Please list your current primary codes and their main mathematical methods and/or algorithms. Include quantities that characterize the size or scale of your simulations or numerical experiments; e.g., size of grid, number of particles, basis sets, etc. Also indicate how parallelism is expressed (e.g., MPI, OpenMP, MPI/OpenMP hybrid)

The primary code used by this project is Zori, which has been written this group.  
Zori performs a sophisticated Monte Carlo integration with importance sampling to solve the full many-body Schrodinger equation (PDE) in imaginary time. The time-dependent Schroedinger equation for N electrons is a 3-N dimensional partial differential equation. Solutions to the time-dependent equation indicate that coefficients of excited states decrease exponentially in imaginary time. Isomorphism between Schrodinger's equation and the diffusion equation allows the imaginary-time evolution of the wave- function to be simulated by a random-walk, so that after sufficient simulation time, random walkers sample only the ground state. An estimate of the eigenvalue can then be obtained from Monte Carlo integration.  
 
Parallelization of the QMC algorithm is trivial because each walker's movement is independent of the others. Individual cores propagate only the walkers distributed to it, so communication is needed only for averaging and occasional load balancing. The simplicity of this mode of parallelism allows excellent parallel performance using MPI, even for calculations involving tens of thousands of cores. This has allowed us to focus on serial efficiency to improve the performance of our code. Likewise, the transparency of this high level parallelism will provide a great deal of flexibility when modifying our code to take advantage of low-level parallelism provided by GPU's.  
 
Nearly all of the CPU time used by our QMC calculations is used to evaluate the trial wave function and its derivatives. Our trial wave-functions are the product of a Slater determinant (or a linear combination of determinants) and three-body Jastrow correlation function of the Schmidt-Moskowitz / Boys-Handy (SMBH) form. The molecular orbitals (MOs) used to compute the Slater determinant are computed from linear combinations of atom-centered basis functions. The SMBH function can be written as the trace over the product of three matrices, each composed of simple functions of the inter-particle distances. We use a combination of matrix compression and BLAS libraries to make the MO and SMBH evaluation routines linear scaling and while maintaining high flop rates.  
 
Parameters for the wave-function are optimized by minimizing the energy of the wave-function, or the variance or absolute deviation of its local energies. The preferred approach minimizes the energy using the recently developed linearized wave-function approximation, which involves requires the diagonalization of a (potentially large) matrix of wave-function derivatives. Alternatively, a variety of conjugate gradient methods provided by the GSL library can be used for the optimization. 
 
The largest calculations that we have performed to-date have involved 314 electrons, while calculations involving 50-75 electrons are routine. Simulations of larger systems are limited primarily by the availability of CPU cycles. Jobs in this size range use less than 200 MB of memory per core.  
 
Typical concurrencies are in the 2,000-4,000 core range. This is chosen to balance several needs. QMC calculations must use a large total number of walkers to minimize population bias. A small number of walkers per core allows walkers to propagate for many steps to equilibration and proper sampling. Increasing the number of walkers per core minimizes load imbalance and amortizes communication costs. Jobs in this size range benefit from low charge factors and reasonable queue turnaround times. 
 
 

3b. Please list known limitations, obstacles, and/or bottlenecks that currently limit your ability to perform simulations you would like to run. Is there anything specific to NERSC?

Historically, our QMC calculations have been entirely CPU bound; the memory communication, and I/O requirements are quite modest. The number of systems that we study has been limited primarily by the overall size of the allocation.  
 
 

3c. Please fill out the following table to the best of your ability. This table provides baseline data to help extrapolate to requirements for future years. If you are uncertain about any item, please use your best estimate to use as a starting point for discussions.

Facilities Used or Using

 NERSC  OLCF  ACLF  NSF Centers  Other:  

Architectures Used

 Cray XT  IBM Power  BlueGene  Linux Cluster  Other:  

Total Computational Hours Used per Year

 4.5 million Core-Hours

NERSC Hours Used in 2009

4.5 million Core-Hours

Number of Cores Used in Typical Production Run

4,000

Wallclock Hours of Single Typical Production Run

2-5

Total Memory Used per Run

 200-400 GB

Minimum Memory Required per Core

 0.1 GB

Total Data Read & Written per Run

 20 GB

Size of Checkpoint File(s)

 1 GB

Amount of Data Moved In/Out of NERSC

 5 GB per  month

On-Line File Storage Required (For I/O from a Running Job)

 0.1 GB and 1001 Files

Off-Line Archival Storage Required

0.1 GB and  100 Files

Please list any required or important software, services, or infrastructure (beyond supercomputing and standard storage infrastructure) provided by HPC centers or system vendors.

 

4. HPC Requirements in 5 Years

4a. We are formulating the requirements for NERSC that will enable you to meet the goals you outlined in Section 2 above. Please fill out the following table to the best of your ability. If you are uncertain about any item, please use your best estimate to use as a starting point for discussions at the workshop.

Computational Hours Required per Year

50,000,000

Anticipated Number of Cores to be Used in a Typical Production Run

32,000

Anticipated Wallclock to be Used in a Typical Production Run Using the Number of Cores Given Above

4

Anticipated Total Memory Used per Run

32,000 GB

Anticipated Minimum Memory Required per Core

 1 GB

Anticipated total data read & written per run

 500 GB

Anticipated size of checkpoint file(s)

 20 GB

Anticipated On-Line File Storage Required (For I/O from a Running Job)

2 GB and 100 Files

Anticipated Amount of Data Moved In/Out of NERSC

50 GB per  month

Anticipated Off-Line Archival Storage Required

 1 GB and  500 Files

4b. What changes to codes, mathematical methods and/or algorithms do you anticipate will be needed to achieve this project's scientific objectives over the next 5 years.

The linear scaling algorithms that we have developed in the last 5 years have already increased the sizes of systems that we will be able to study. Wave-function optimization will be essential to pushing this limit higher. Recently developed methods for wave function optimization are more robust than earlier schemes and show great promise for optimizing a much larger number of parameters than before. The memory requirements required for optimizing larger parameter sets are substantially larger than what we have encountered previously and the optimization procedure now presents a serial bottleneck and encounters memory limitations. The implementation of these methods in our code is quite new and will likely be remedied by the use of parallel eigensolvers or Davidson diagonalization methods. 
 
The ease of extending QMC methods to both electronic and nuclear motion on equal footing is a distinctive benefit of the QMC formalism. However, the efficiency of the method will hinge on the availability of accurate vibronic wave function. Some exploration will be required to identify suitable forms for the wave function and new code will be required for their evaluation. 

4c. Please list any known or anticipated architectural requirements (e.g., 2 GB memory/core, interconnect latency < 3 #s).

Our linear scaling code for molecular orbital evaluation require significantly more memory than simpler approaches. Memory requirements for jobs that are currently “large” are still modest (125 MB/core), but this could increase to 5 GB/core for the “large” jobs of the near future. The code changes needed to reduce this requirement to 1 GB or less, without changing CPU scaling, are simple and can be made quickly and easily when the need arises. 

4d. Please list any new software, services, or infrastructure support you will need over the next 5 years.

 

4e. It is believed that the dominant HPC architecture in the next 3-5 years will incorporate processing elements composed of 10s-1,000s of individual cores, perhaps GPUs or other accelerators. It is unlikely that a programming model based solely on MPI will be effective, or even supported, on these machines. Do you have a strategy for computing in such an environment? If so, please briefly describe it.

The transparency of the high level parallelism inherent in QMC methods will provide a great deal of flexibility when modifying our code to take advantage of low-level parallelism provided by GPU's.  
Several groups have reported significant (6x) speed-ups when their QMC codes were ported to GPUs and we intend to port Zori to GPUs as well. We expect the rate limiting steps in the evaluation of the wave-function (evaluation of basis functions, molecular orbitals and correlation functions) will transfer well to GPUs.  
 

New Science With New Resources

To help us get a better understanding of the quantitative requirements we've asked for above, please tell us: What significant scientific progress could you achieve over the next 5 years with access to 50X the HPC resources you currently have access to at NERSC? What would be the benefits to your research field if you were given access to these kinds of resources?

Please explain what aspects of "expanded HPC resources" are important for your project (e.g., more CPU hours, more memory, more storage, more throughput for small jobs, ability to handle very large jobs).

A 50-fold increase in HPC resources would allow the simulation of systems with as many as 800-1600 electrons. If effective core potentials are used, this is sufficient to model protein reaction centers or small nanoparticles using a fully correlated electronic structure theory. Larger systems could be studied if the environment surrounding the QMC simulation is included via molecular mechanics (MM); the theory and code needed for such QMC/MM calculations are in a late stage of development within this group. Advances in this direction will require roughly 50x increases in CPU hours and storage. With minor code changes, memory requirements can be kept below 1-2 GB per core. Queuing and charge factor policies that encourage the highest possible concurrencies would also facilitate this research. 
 
Alternatively, increased HPC resources could be used to develop more accurate treatments of small molecules. Methods for optimizing molecular geometries and computing vibrational frequencies are available for many ab initio methods, but are missing from the QMC toolkit. With a larger CPU allocation we could begin to examine molecular geometries at the QMC level. In particular, QMC can be used without the Born-Oppenheimer approximation to sample the complete electron-nuclear wave function and compute high accuracy thermochemical data. Calculations of this type would require similar increases in CPU hours and storage, but minimal increases in memory.