Quantum ESPRESSO Exact Exchange Case Study
Quantum ESPRESSO is an OpenSource density functional theory (DFT) code and widely used in Materials Science and Quantum Chemistry to compute states of complex systems. It can be used for numerous purposes, including calculating the adsorption properties of porous materials, computing the thermodynamic and optical properties of existing and new materials, and simulating X-ray absorption spectra.
This article focuses on calculations involving hybrid functionals, which differ from local and semi-local functionals by the inclusion of a fraction of exact exchange. Hybrid functionals are known to improve the accuracy of calculated band gaps, reaction barriers, and structural properties for many systems, and often avoid certain unphysical behaviors, such as excessive charge delocalization, that are commonly observed when using local and semi-local functionals. Widespread application of hybrid DFT calculations in the context of plane-wave DFT remains limited, primarily due to the large computational cost of calculating the exact exchange. In this case study, we discuss improvements to the scalability and performance of hybrid DFT calculations in Quantum ESPRESSO.
The most costly part of an exact exchange calculation is the calculation of the exact exchange functional,
The original algorithm for calculating these integrals is:
Quantum ESPRESSO implements hybrid parallelism based on OpenMP and MPI. We have optimized both parts and will split the discussion as the changes we applied to the code are almost completely orthogonal to each other.
Improving OpenMP scaling
The inner element-wise products over r and g in the algorithm described above are performed in loops over all points of the (dual) grid. In the naive approach, one would spawn OpenMP sections around the loops over those grids and destroy them after the loops are done. However, that produces very bad scaling as shown in the figure below. This figure shows thread scaling vs. MPI scaling for the unoptimized code (left panel) as well as the total wall time for the point which showed the best performance (right panel) for a system comprised of 16 water molecules.
Profiling tools demonstrate that the reason for poor scaling performance is due to the overhead from fork-join operations because the loop trip counts are very small. The best strategy in that case is to enlarge the parallel regions. In a first approach, we merged the loops over j and r or g respectively. In order to do so, we dressed all the loops over grid points with loops over j, i.e. we technically contracted the outer loop over j around all the inner loops over r and g. The loops were then collapsed using appropriate OpenMP 4.X SIMD statements. We furthermore replaced the distributed FFT's by local 3D FFT's. This requires some global data re-arrangement which was made possible by implementing data layout conversions which employs global data transformations to make all grid points involved in the FFT for a set of i and j band indices node-local (for more details about that, please read the next section). This improvement lead to much better scaling
The code now scales nicely up to 8 threads, but shows still inferior performance compared to the all-MPI implementation. The next step we performed was timing the sections of the calculation separately. We expected the FFT's to dominate the timings, but - to our surprise - the element-wise products were as expensive as the FFT's. Digging into this with profiling tools exhibited that the compiler generated vector-gather and vector-scatter instructions for data which was aligned and contiguous. In order to help the compiler to recognize this and replacing the gather-scatter-operations with contiguous loads, we applied cache blocking to the innermost layer of these loops. This is how the calculation of the density rho looks like after this improvement
nblock = 2048
nrt = nr / nblock
!$omp parallel do collapse(2) ...
do ist = 1, nrt
do j=1, n
ir_start = (irt - 1) * nblock + 1
ir_end = min(ir_start+nblock-1,nr)
do ir = ir_start, it_end
rho(ir,j) = conjg(psi(ir,j)) * psi(ir,i)
!$omp end parallel do
The plot below shows how that improvement improved scaling performance
Now, the OpenMP parallelized code outperforms the MPI parallelized code by a large fraction. However, there is another improvement which can be made: by dressing all the loops with loops over band-index j, we can now also accelerate the computation of the FFT's. Instead of calling a 3D FFT for each index j, we can use batch-processing (or pooling) to allow the FFT libraries to perform multiple FFT's at once (common FFT libraries such as FFTW or Intel's DFTI support these functionality). By employing this, we achieve very good scaling and impressive performance compared to the original all-MPI code
The scaling is still not optimal but that is mainly attributed to the fact that a system with 16 water molecules is rather small. Scaling studies on larger systems show better scaling and speedups for up to 64 cores.
This concludes our section of OpenMP improvements for vexx. We will now have a look at multi-rank scaling and how that can be improved as well.
Improving MPI scaling
Most of the changes described in this section are orthogonal to the changes applied above, with some minor exceptions such as the enablement of local 3D-FFT's due to data-layout rearrangements. We will first look at out-of-the-box performance of Quantum ESPRESSO's MPI parallelization. The figures below depict the strong scaling behavior of that code for different types of parallelizations in QE, denoted as task-group and band-group parallelization.
Those implementations are mutually exclusive and the former only benefits the local (plain DFT) part of the code, and the latter only benefits the EXX part of the code. We can see that band-group parallelization is preferred but also that only scales to about 16 ranks. The cartoon below shows how espresso 5.2.0 (and also up to version 6.0) implements band-group parallelism in QE.
This diagram depicts the different parts of the calculation, i.e. SCF loop and the EXX loops over band pairs i,j. Regions colored in blue are parallelized while regions colored in red are not. In the original code, the SCF loop is basically duplicated among each rank in the partition. For that reason, this part of the calculation dominates the wall clock time for large partition sizes, i.e. see the plot below.
In this plot, we employed one bandgroup per node and it can be seen that the fraction of time spent in SCF is independent of the system size. Since we aimed to enable users to run calculations of big systems on many nodes, we targeted improving that situation. The goal here is to move parallelism out as much as possible: instead of parallelizing over inner bands j, we parallelized the application over pairs of bands i,j. That means, that each node gets a chunk of i-indices as was done in naive band-group parallelization, and then every nodes additionally only gets a fraction of the j-indices. This significantly reduces memory footprint as not all components for all bands need to be stored on every node. However, it introduces a little more complexity as now each node has to pass it's j-components around to all other nodes. Furthermore, the number of i,j-pairs involved in the EXX step depends on the previous SCF step (and is usually reduced), so that this number needs to be recomputed in every step in order to avoid duplication of work.
These changes improve scaling for the EXX part of the calculation, but we are still left with improving the local part. In the original code, the local calculation was replicated on every node and thus significantly limiting scaling to many nodes. In the new approach, we parallelize the local part by distributing g-vector components. The layouts used in the different parts of the calculation are shown below
The panel on the left shows the distribution over g-vectors. Here, all bands reside on one node and thus the diagonalization can be performed in an embarassingly parallel way. For the EXX calculation, we distribute band pairs and keep g-vectors and r-components local to any node, so that we can employ local 3D FFT's and the loop merging described above, thus facilitating OpenMP scaling. The diagram for the optimized code shows that every part of the calculation is parallelized
Using different layouts for different parts of the calculation requires in-flight data transpositions. However, despite the associated overhead, our changes significantly improve scaling performance as shown below.
The plots show that the EXX-part of the improved code (right) scales close to ideal. The overall calculation (left) scales much better than the unimproved code and is currently limited by non-optimal scaling of the diagonalization routine. We are currently looking into different methods and libraries for also optimizing this part of the claculation.