Brian Austin
BES Requirements Worksheet
1.1. Project Information  Quantum Monte Carlo for the Electronic Structure of Molecules
Document Prepared By 
Brian Austin 
Project Title 
Quantum Monte Carlo for the Electronic Structure of Molecules 
Principal Investigator 
W. A. Lester, Jr. 
Participating Organizations 

Funding Agencies 
DOE SC DOE NSA NSF NOAA NIH Other: 
2. Project Summary & Scientific Objectives for the Next 5 Years
Please give a brief description of your project  highlighting its computational aspect  and outline its scientific objectives for the next 35 years. Please list one or two specific goals you hope to reach in 5 years.
This project uses quantum Monte Carlo (QMC) to compute properties of chemical systems that are of major importance to basic energy sciences. The use of QMC methods is warranted when less demanding methods such as density functional theory lack the precision required to answer the question at hand or when discrepancies among other ab initio theories leave the issue unresolved. Unlike other quantum chemical methods, which depend strongly on the singleparticle basis set and the extent of the manyparticle representation of the wavefunction, the accuracy of DMC has only a weak and indirect dependence on its trial wavefunction. This allows QMC to resolve questions that cannot be adequately addressed otherwise. Projects currently pursued by this group include calculation of the potential energy curve for the interaction between lithium atoms and graphene sheets, computation of the OH bond dissociation energy in phenol, determination of the S0S1 excitation energy of a retinal protonated Schiff base.
As more computing resources become available, the systems that we study will increase in both number and size. The O(N) algorithms for wave function evaluation that have been developed in this group will leverage the expanded resources so that, within the next five years, we will be able to examine systems in the 250350 atom range. Second, our QMC calculations will be extended to include both electronic and nuclear motion so that thermochemical properties can be computed immediately at the QMC level without requiring the Born Oppenheimer approximation or the use of lower level theories to determine molecular geometries or vibrational modes. In addition, we plan to increase the value of our QMC calculations by developing new tools, such as the electronpair localization function, to analyze our MC 'trajectories' and improve the understanding of exotic binding motifs or electron correlation.
3. Current HPC Usage and Methods
3a. Please list your current primary codes and their main mathematical methods and/or algorithms. Include quantities that characterize the size or scale of your simulations or numerical experiments; e.g., size of grid, number of particles, basis sets, etc. Also indicate how parallelism is expressed (e.g., MPI, OpenMP, MPI/OpenMP hybrid)
The primary code used by this project is Zori, which has been written this group.
Zori performs a sophisticated Monte Carlo integration with importance sampling to solve the full manybody Schrodinger equation (PDE) in imaginary time. The timedependent Schroedinger equation for N electrons is a 3N dimensional partial differential equation. Solutions to the timedependent equation indicate that coefficients of excited states decrease exponentially in imaginary time. Isomorphism between Schrodinger's equation and the diffusion equation allows the imaginarytime evolution of the wave function to be simulated by a randomwalk, so that after sufficient simulation time, random walkers sample only the ground state. An estimate of the eigenvalue can then be obtained from Monte Carlo integration.
Parallelization of the QMC algorithm is trivial because each walker's movement is independent of the others. Individual cores propagate only the walkers distributed to it, so communication is needed only for averaging and occasional load balancing. The simplicity of this mode of parallelism allows excellent parallel performance using MPI, even for calculations involving tens of thousands of cores. This has allowed us to focus on serial efficiency to improve the performance of our code. Likewise, the transparency of this high level parallelism will provide a great deal of flexibility when modifying our code to take advantage of lowlevel parallelism provided by GPU's.
Nearly all of the CPU time used by our QMC calculations is used to evaluate the trial wave function and its derivatives. Our trial wavefunctions are the product of a Slater determinant (or a linear combination of determinants) and threebody Jastrow correlation function of the SchmidtMoskowitz / BoysHandy (SMBH) form. The molecular orbitals (MOs) used to compute the Slater determinant are computed from linear combinations of atomcentered basis functions. The SMBH function can be written as the trace over the product of three matrices, each composed of simple functions of the interparticle distances. We use a combination of matrix compression and BLAS libraries to make the MO and SMBH evaluation routines linear scaling and while maintaining high flop rates.
Parameters for the wavefunction are optimized by minimizing the energy of the wavefunction, or the variance or absolute deviation of its local energies. The preferred approach minimizes the energy using the recently developed linearized wavefunction approximation, which involves requires the diagonalization of a (potentially large) matrix of wavefunction derivatives. Alternatively, a variety of conjugate gradient methods provided by the GSL library can be used for the optimization.
The largest calculations that we have performed todate have involved 314 electrons, while calculations involving 5075 electrons are routine. Simulations of larger systems are limited primarily by the availability of CPU cycles. Jobs in this size range use less than 200 MB of memory per core.
Typical concurrencies are in the 2,0004,000 core range. This is chosen to balance several needs. QMC calculations must use a large total number of walkers to minimize population bias. A small number of walkers per core allows walkers to propagate for many steps to equilibration and proper sampling. Increasing the number of walkers per core minimizes load imbalance and amortizes communication costs. Jobs in this size range benefit from low charge factors and reasonable queue turnaround times.
3b. Please list known limitations, obstacles, and/or bottlenecks that currently limit your ability to perform simulations you would like to run. Is there anything specific to NERSC?
Historically, our QMC calculations have been entirely CPU bound; the memory communication, and I/O requirements are quite modest. The number of systems that we study has been limited primarily by the overall size of the allocation.
3c. Please fill out the following table to the best of your ability. This table provides baseline data to help extrapolate to requirements for future years. If you are uncertain about any item, please use your best estimate to use as a starting point for discussions.
Facilities Used or Using 
NERSC OLCF ACLF NSF Centers Other: 
Architectures Used 
Cray XT IBM Power BlueGene Linux Cluster Other: 
Total Computational Hours Used per Year 
4.5 million CoreHours 
NERSC Hours Used in 2009 
4.5 million CoreHours 
Number of Cores Used in Typical Production Run 
4,000 
Wallclock Hours of Single Typical Production Run 
25 
Total Memory Used per Run 
200400 GB 
Minimum Memory Required per Core 
0.1 GB 
Total Data Read & Written per Run 
20 GB 
Size of Checkpoint File(s) 
1 GB 
Amount of Data Moved In/Out of NERSC 
5 GB per month 
OnLine File Storage Required (For I/O from a Running Job) 
0.1 GB and 1001 Files 
OffLine Archival Storage Required 
0.1 GB and 100 Files 
Please list any required or important software, services, or infrastructure (beyond supercomputing and standard storage infrastructure) provided by HPC centers or system vendors.
4. HPC Requirements in 5 Years
4a. We are formulating the requirements for NERSC that will enable you to meet the goals you outlined in Section 2 above. Please fill out the following table to the best of your ability. If you are uncertain about any item, please use your best estimate to use as a starting point for discussions at the workshop.
Computational Hours Required per Year 
50,000,000 
Anticipated Number of Cores to be Used in a Typical Production Run 
32,000 
Anticipated Wallclock to be Used in a Typical Production Run Using the Number of Cores Given Above 
4 
Anticipated Total Memory Used per Run 
32,000 GB 
Anticipated Minimum Memory Required per Core 
1 GB 
Anticipated total data read & written per run 
500 GB 
Anticipated size of checkpoint file(s) 
20 GB 
Anticipated OnLine File Storage Required (For I/O from a Running Job) 
2 GB and 100 Files 
Anticipated Amount of Data Moved In/Out of NERSC 
50 GB per month 
Anticipated OffLine Archival Storage Required 
1 GB and 500 Files 
4b. What changes to codes, mathematical methods and/or algorithms do you anticipate will be needed to achieve this project's scientific objectives over the next 5 years.
The linear scaling algorithms that we have developed in the last 5 years have already increased the sizes of systems that we will be able to study. Wavefunction optimization will be essential to pushing this limit higher. Recently developed methods for wave function optimization are more robust than earlier schemes and show great promise for optimizing a much larger number of parameters than before. The memory requirements required for optimizing larger parameter sets are substantially larger than what we have encountered previously and the optimization procedure now presents a serial bottleneck and encounters memory limitations. The implementation of these methods in our code is quite new and will likely be remedied by the use of parallel eigensolvers or Davidson diagonalization methods.
The ease of extending QMC methods to both electronic and nuclear motion on equal footing is a distinctive benefit of the QMC formalism. However, the efficiency of the method will hinge on the availability of accurate vibronic wave function. Some exploration will be required to identify suitable forms for the wave function and new code will be required for their evaluation.
4c. Please list any known or anticipated architectural requirements (e.g., 2 GB memory/core, interconnect latency < 3 #s).
Our linear scaling code for molecular orbital evaluation require significantly more memory than simpler approaches. Memory requirements for jobs that are currently “large” are still modest (125 MB/core), but this could increase to 5 GB/core for the “large” jobs of the near future. The code changes needed to reduce this requirement to 1 GB or less, without changing CPU scaling, are simple and can be made quickly and easily when the need arises.
4d. Please list any new software, services, or infrastructure support you will need over the next 5 years.
4e. It is believed that the dominant HPC architecture in the next 35 years will incorporate processing elements composed of 10s1,000s of individual cores, perhaps GPUs or other accelerators. It is unlikely that a programming model based solely on MPI will be effective, or even supported, on these machines. Do you have a strategy for computing in such an environment? If so, please briefly describe it.
The transparency of the high level parallelism inherent in QMC methods will provide a great deal of flexibility when modifying our code to take advantage of lowlevel parallelism provided by GPU's.
Several groups have reported significant (6x) speedups when their QMC codes were ported to GPUs and we intend to port Zori to GPUs as well. We expect the rate limiting steps in the evaluation of the wavefunction (evaluation of basis functions, molecular orbitals and correlation functions) will transfer well to GPUs.
New Science With New Resources
To help us get a better understanding of the quantitative requirements we've asked for above, please tell us: What significant scientific progress could you achieve over the next 5 years with access to 50X the HPC resources you currently have access to at NERSC? What would be the benefits to your research field if you were given access to these kinds of resources?
Please explain what aspects of "expanded HPC resources" are important for your project (e.g., more CPU hours, more memory, more storage, more throughput for small jobs, ability to handle very large jobs).
A 50fold increase in HPC resources would allow the simulation of systems with as many as 8001600 electrons. If effective core potentials are used, this is sufficient to model protein reaction centers or small nanoparticles using a fully correlated electronic structure theory. Larger systems could be studied if the environment surrounding the QMC simulation is included via molecular mechanics (MM); the theory and code needed for such QMC/MM calculations are in a late stage of development within this group. Advances in this direction will require roughly 50x increases in CPU hours and storage. With minor code changes, memory requirements can be kept below 12 GB per core. Queuing and charge factor policies that encourage the highest possible concurrencies would also facilitate this research.
Alternatively, increased HPC resources could be used to develop more accurate treatments of small molecules. Methods for optimizing molecular geometries and computing vibrational frequencies are available for many ab initio methods, but are missing from the QMC toolkit. With a larger CPU allocation we could begin to examine molecular geometries at the QMC level. In particular, QMC can be used without the BornOppenheimer approximation to sample the complete electronnuclear wave function and compute high accuracy thermochemical data. Calculations of this type would require similar increases in CPU hours and storage, but minimal increases in memory.