NERSCPowering Scientific Discovery Since 1974

Thomas Devereaux

BES Requirements Worksheet

1.1. Project Information - Computation and Simulation of Photon-Based Spectroscopies

Document Prepared By

Thomas Devereaux

Project Title

Computation and Simulation of Photon-Based Spectroscopies

Principal Investigator

Thomas Devereaux

Participating Organizations

Stanford University 
SLAC National Accelerator Laboratory 
Advanced Light Source, Lawrence Berkeley National Lab

Funding Agencies


2. Project Summary & Scientific Objectives for the Next 5 Years

Please give a brief description of your project - highlighting its computational aspect - and outline its scientific objectives for the next 3-5 years. Please list one or two specific goals you hope to reach in 5 years.

New DOE facilities available in the near future, such as the SPEAR3 storage ring and Linear Coherent Light Source at SLAC, MERLIN/QUERLIN at the Advanced Light Source, the IXS beamline at the Advanced Photon Source, and NSLS II at Brookhaven, are expected to improve experimental accuracy and hold the promise of examining new phenomena and materials. Although progress has been made in the interpretation of data, there is a serious lack of coherent theoretical support for these experiments. Developments in theory will be crucial to provide meaning to and understanding of results, to provide insights into the new physics revealed from both theory and experiments, and to chart new directions where both theory and experiment can proceed in-step.  
Simulations of photon-based spectroscopies for transition metal oxides are performed in our group with a specific emphasis on angle-resolved photoemission spectroscopy (ARPES) using determinant Quantum Monte Carlo (DQMC) techniques and resonant x-ray scattering (RXS/REXS/RIXS) using exact diagonalization (ED) techniques. Specifically, these simulations will be carried out on both single- and multi-orbital Hubbard models in order to construct the RXS spectrum and, also on models that explicitly incorporate atomic displacement fields to construct the photocurrent or spectral function. For different values of doping, temperature, electron-electron (e-e) and electron-phonon (e-ph) interactions, the resulting calculations will be directly compared to RXS and ARPES data obtained.  
Our work also focuses on ultrafast nonequilibrium phenomena in strongly correlated materials using non-equilibrium (Keldysh) extensions of dynamical mean field theory (DMFT). We want to understand how electron-electron correlations and large driving fields can change the way materials behave and how studying this behavior can provide insight into the microscopic theories for such complex systems and their emergent strongly correlated behavior. We will study materials near a Mott metal to insulator transition and materials with an electron-driven charge density wave instability. Experiments have already been completed on some of these systems and more are likely to come in the near future as photon science moves into the ultrafast regime for probing strongly correlated electron materials. These systems are believed to have the potential for wide applications within energy science due to their high tunability and the change of their properties as a response to external perturbations or driving forces. Our work, however, focuses entirely on the basic science aspects of these systems and of these experimental probes. Beyond improving our understanding of strongly correlated electrons in new materials, the results obtained in this research program have direct consequences on our global understanding of the behavior of any dynamical system with strong interactions, and is thus applicable to all of the grand challenge questions. 

3. Current HPC Usage and Methods

3a. Please list your current primary codes and their main mathematical methods and/or algorithms. Include quantities that characterize the size or scale of your simulations or numerical experiments; e.g., size of grid, number of particles, basis sets, etc. Also indicate how parallelism is expressed (e.g., MPI, OpenMP, MPI/OpenMP hybrid)

The determinant quantum Monte Carlo (DQMC) code used to simulate the Hubbard model on a finite-size cluster of size N is based on a path integral formulation of the partition function, obtained from a trace over the 4^N dimensional Hilbert space associated with the fermionic degrees of freedom. This formulation of the DQMC algorithm is performed in the grand canonical ensemble implying a Hilbert space including states of all possible occupation numbers in which the density is controlled by the chemical potential. A Trotter approximation is employed to disentangle the interacting and non-interacting pieces of the Hamiltonian. The imaginary time interval is divided into L "slices" or subintervals over which the commutator between the interacting and non-interacting pieces of the Hamiltonian may be neglected. This step introduces errors associated with the neglected commutator which are reduced as the number of subintervals increases. In order to exploit simplifying identities that replace the trace over the quantum mechanical Hilbert space by the determinant of a matrix of size NxN, we must express the interaction term in the Hubbard Hamiltonian in a quadratic form in the fermion operators. This is accomplished by a discrete Hubbard-Stratonovich transformation which introduces an auxiliary Ising field of size NxL, reducing the problem to a classical Monte Carlo sampling of this Ising field with a weight governed by the product of determinants. Correlation functions are measured periodically in this Markov process. The single-particle Green functions are elements of the 
inverse of the matrix used to determine the Monte Carlo weight. Multi-particle correlation functions, following from Wick's theorem, are obtained from products of the elements in the Green function matrix. Comparison to photon spectroscopies is accomplished by analytically continuing the correlation functions to real frequency using a maximum entropy method (MEM). 
Our state-of-the-art exact diagonalization (ED) code for small clusters modeled via a multi-orbital Hubbard Hamiltonian requires the use of PARPACK (Parallel Arnoldi PACKage) for diagonalizing the matrix via iterative methods (a Lanczos/Arnoldi process) with the Hermitian matrix stored in sparse form as an input. The input Hamiltonian matrix is generated from a code that utilizes a binary search algorithm to manipulate the binary representation of the Hilbert space basisfunctions in the occupation number representation. This algorithm improves the scaling for formation of the Hamiltonian matrix from NxN to Nxln(N) which is an extremely important improvement, given that the relative size of the Hilbert space grows exponentially, and our current intent is to perform runs with Hilbert space dimensions in excess of 100 million elements. For the largest problems, the code to generate the sparse matrix is combined with the PARPACK eigensolver routine to mitigate I/O associated with writing the sparse input matrix to a file. While a direct, full diagonalization is possible for small problems on the order of thousands of Hilbert space 
elements, larger problems require a more sophisticated treatment. We have chosen to utilize PARPACK libraries for the diagonalization of sparse matrices relying on a Lanczos iterative algorithm. A truncated number of low or high energy eigenstates is produced from each run, the total number of which is dictated by walltime restrictions on the computing platform. This approach will be limited eventually by the available memory resources. Taking the two-dimensional copper-oxide plane as an example, due to it's relevance for high 
temperature superconducting materials, the multi-orbital Hubbard code we intend to use can access cluster sizes on the order of 10 copper sites, with the associated oxygen atoms. 
For our ED codes, the matrix-vector product used in the iterative Arnoldi process is a key component which has an important impact on overall performance of the program. The Hamiltonian matrix for this problem is very large and also sparse; therefore, it is more efficient to calculate the matrix-vector product using graph partitioning provided by the ParMETIS libraries. 
Our time-domain DMFT calculations run on a discretized Kadanoff-Baym-Keldysh time contour. The number of discretization points, which ranges from 900 to 5700 determines the matrix sizes for the calculation. The main limiting factor are matrix inversion and multiplication routines, which scale like the cube of this matrix size. We perform finite-size scaling to scale our results to the continuum limit. The more correlated the system, or the higher the frequency of the driving fields, the larger the number of points needed for the discretization. Hence the majority of the computational time is concentrated on the largest simulated system sizes. 
The basic algorithm for all DMFT-based codes is an iterative solution of a set of coupled nonlinear equations. Starting with an initial guess for the self-energy, one calculates the local Green's function on the lattice, then extracts the effective medium by using the local Dyson equation, then solves the impurity problem in that effective medium for the impurity Green's function, and finally extracts the impurity self-energy via Dyson's equation. This loop is iterated until it converges. Because of its iterative nature, we cannot tell a priori how many iterations will be needed to achieve convergence (although experience tells us it is usually between 10 and 200 iterations with 
more iterations needed for larger interaction strengths or smaller fields). Within this basic structure, there are two critical elements. The first is the determination of the local Green's function. For nonequilibrium problems, this is achieved via a two-dimensional matrix-valued quadrature routine where each quadrature point requires one matrix inversion and two matrix multiplications to 
determine the value of the integrand. Standard LAPACK and BLAS routines are used because the matrices are general complex dense matrices and the codes are well optimized. 

3b. Please list known limitations, obstacles, and/or bottlenecks that currently limit your ability to perform simulations you would like to run. Is there anything specific to NERSC?

For determinant quantum Monte Carlo (DQMC), the code architecture makes it almost perfectly parallel under weak scaling. The majority of our production runs on NERSC concentrate on providing adequate data for post-processing analytic continuation. This involves spawning a large number of independent Markov chains of the same length. However, this code will not be run on more than 128 or 256 processors keeping the “warm-up” overhead to a minimum. 
The exact diagonalization (ED) code has been analyzed under strong scaling for several small cases used for actual production runs. We have used two variations of the same ED code: one simply calling the PARPACK sparse matrix libraries, based on parallel Arnoldi iterative methods, and a second that also 
includes calls to ParMETIS (Parallel Graph Partitioning and Fill-reducing Matrix Ordering), an MPI-based parallel library that implements a variety of algorithms for partitioning and repartitioning unstructured graphs. In both cases, the performance is limited at the low end of the scaling curve by memory considerations. This means that the minimum number of processors used in production has a lower limit. The number is based on the number of rows or columns of the sparse matrix that can fit in memory which is dependent on the sparsity of the matrix for any single problem. The vectors storing the eigenfunctions of the first N eigenstates of the sparse matrix, each one the length of the Hilbert space dimension, must also fit in the memory allocated to a single core. With ParMetis and PARPACK, the code scales superlinearly up to 
tens of thousands of processors.  
Due to the fact that our DMFT code involves both parallel and serial parts in the dynamical meanfield theory algorithm, the scaling begins to drop off 
as one uses 1500 or more processors, and we get the best performance in the 500-1000 range. Hence all runs will need to be completed at the full charging rate with no discounts. In cases when we need to share memory across the motherboard (and hence keep some cpus idle), the number of cpus employed will grow up to about twice as many as we use in cases when the memory entirely 
fits into the memory of one core.  

3c. Please fill out the following table to the best of your ability. This table provides baseline data to help extrapolate to requirements for future years. If you are uncertain about any item, please use your best estimate to use as a starting point for discussions.

Facilities Used or Using

 NERSC  OLCF  ACLF  NSF Centers  Other: local cluster, Canadian Grid

Architectures Used

 Cray XT  IBM Power  BlueGene  Linux Cluster  Other:  

Total Computational Hours Used per Year

 8,000,000 Core-Hours

NERSC Hours Used in 2009

 3,500,000 Core-Hours

Number of Cores Used in Typical Production Run


Wallclock Hours of Single Typical Production Run


Total Memory Used per Run

2000 GB

Minimum Memory Required per Core

1 GB

Total Data Read & Written per Run

 10 GB

Size of Checkpoint File(s)

 10 GB

Amount of Data Moved In/Out of NERSC

 8000GB per  year

On-Line File Storage Required (For I/O from a Running Job)

 0.01 GB and  10 Files

Off-Line Archival Storage Required

 1 GB and 500 Files

Please list any required or important software, services, or infrastructure (beyond supercomputing and standard storage infrastructure) provided by HPC centers or system vendors.


4. HPC Requirements in 5 Years

4a. We are formulating the requirements for NERSC that will enable you to meet the goals you outlined in Section 2 above. Please fill out the following table to the best of your ability. If you are uncertain about any item, please use your best estimate to use as a starting point for discussions at the workshop.

Computational Hours Required per Year


Anticipated Number of Cores to be Used in a Typical Production Run


Anticipated Wallclock to be Used in a Typical Production Run Using the Number of Cores Given Above


Anticipated Total Memory Used per Run

 8000 GB

Anticipated Minimum Memory Required per Core

 4 GB

Anticipated total data read & written per run

 40 GB

Anticipated size of checkpoint file(s)

 40 GB

Anticipated On-Line File Storage Required (For I/O from a Running Job)

0.04 GB and 10 Files

Anticipated Amount of Data Moved In/Out of NERSC

 32000 GB per  year

Anticipated Off-Line Archival Storage Required

 4 GB and  2000 Files

4b. What changes to codes, mathematical methods and/or algorithms do you anticipate will be needed to achieve this project's scientific objectives over the next 5 years.

4c. Please list any known or anticipated architectural requirements (e.g., 2 GB memory/core, interconnect latency < 3 #s).

The more memory per core the better.

4d. Please list any new software, services, or infrastructure support you will need over the next 5 years.


4e. It is believed that the dominant HPC architecture in the next 3-5 years will incorporate processing elements composed of 10s-1,000s of individual cores, perhaps GPUs or other accelerators. It is unlikely that a programming model based solely on MPI will be effective, or even supported, on these machines. Do you have a strategy for computing in such an environment? If so, please briefly describe it.

We have ported our codes over to a GPU environment on our local cluster machines, and are planning to devote more resources with the release of the Nvidia Fermi Cuda architecture. 

New Science With New Resources

To help us get a better understanding of the quantitative requirements we've asked for above, please tell us: What significant scientific progress could you achieve over the next 5 years with access to 50X the HPC resources you currently have access to at NERSC? What would be the benefits to your research field if you were given access to these kinds of resources?

Please explain what aspects of "expanded HPC resources" are important for your project (e.g., more CPU hours, more memory, more storage, more throughput for small jobs, ability to handle very large jobs).