Math Library Performance
Fully optimizing a given application’s performance often requires a deep understanding of the source, an accurate profile for a representative run, and the ability to have changes to the source accepted upstream. However, in many cases, significant performance gains can be achieved by simply optimizing the code over the matrix of possible compilers, compiler options and libraries available on a given machine. Here, we explore the performance variability of common materials science applications at NERSC with respect to the compilers and libraries available on Edison, NERSC’s Cray XC30.
NERSC currently supports compilers from three different vendors on the XC30 system, Edison: Intel, GNU and Cray. Materials science applications generally rely heavily on math libraries such as FFTW, BLAS, LAPACK and ScaLAPACK. NERSC provides several library options for these routines on Edison: FFTW2, FFTW3, LibSci and MKL. We compare the performance of BerkeleyGW, Quantum ESPRESSO, VASP, LAMMPS, NAMD and NWChem with the compilers and libraries listed above.
BerkeleyGW was tested with all of GNU, Intel and Cray compilers and FFTW, LibSci and MKL. A summary of the overall performance of the code on our example benchmark calculation is shown in Fig. 1. As seen in the figure, the Intel Compiler + MKL library result is the best overall combination for both 1 and 4 OpenMP threads per MPI task. For a single thread, the MKL library is the best choice for each compiler. The combination of the Cray compiler + MKL actually outperforms the Intel + MKL combination when using a single thread. However, as we discuss more below, our Cray compiler + MKL build attempts yielded poor performance when multiple threads are used. The ( threads) label in the figure refers to the use of the provided libfftw3 threads library. The GNU compiler results have an approximately 100 second IO overhead when compared to the Intel and Cray results that we were not able eliminate - as shown in Fig. 2. This IO overhead occurred for all libraries used. This overhead is illustrated by the off-set present in Fig. 1. Thus, the Intel compiler + MKL combination is the only compiler and Library combination that performs well across the different number of MPI tasks and OpenMP threads that we considered.
In order to more fairly compare the FFT and Linear Algebra library performance it is useful to compare the library performance for a single compiler. We were only able to successfully compile and run our benchmark with all the available libraries for the case of the GNU compiler. The results are summarized in Fig. 2.
It is seen in Fig. 2 that MKL consistently outperforms the FFTW + LibSci combinations. The label ( threads) in the figure refers to use of the provided libfftw3 threads library while ( omp) refers to the use of our manually compiled libfftw3 omp library. The plots show there is a walltime reduction for using multiple OpenMP threads per MPI Task (for a fixed number of cores) with the MKL build but a significant overhead when using the libfftw3 threads library. This could potentially be explained by the thread implementation in libfftw threads library not performing well when combined with explicit OpenMP in the code, seemingly even when the threaded regions are distinct from each other. The use of the libfftw3 omp library, which was built for each compiler individually, and therefore contains the same OpenMP implementation as the linear algebra libraries and the rest of the code, appears to mitigate this conflict.
Fig. 3 shows a summary of the BerkeleyGW results for the Cray compiler. In this case, we show results only for Cray’s provided libfftw3 threads and for MKL. From Fig. 3, we see that neither the threaded FFTW library, nor MKL, perform well with multiple threads with the Cray compiler in our tests. This might again be attributed to conflicts arising from multiple OpenMP implementations when other OpenMP regions exist in the code. The Cray compiler + MKL library combination proved to be particularly problematic. It should be noted that Intel does not provide a Cray compiler specific MKL library. For the numbers shown, we linked against the threaded MKL libraries intended for use with the GNU compiler. This may help explain the particularly poor performance in the Cray + MKL combination when multiple threads are used. One may use the sequential MKL libraries, but due to BerkeleyGW’s heavy use of threaded libraries, the result is also poor. We additionally tried using MKL with libiomp5 (while removing the cce libomp.a from the build link line) without better success. We were unable to find an MKL library fully compatible with the Cray compiler or a suitable workaround, which limited the applicability of the Cray compiler for BerkeleyGW, since MKL outperforms FFTW and LibSci substantially. If a workaround could be found, the Cray compiler + MKL option could potentially outperform Intel+MKL.
We next consider the performance of the various libraries with the Intel compiler for BerkeleyGW. As of the time of writing, Cray has not released a version of LibSci compatible with the Intel compiler; so we limited our study to the comparison of the performance of BerkeleyGW using the MKL library for linear algebra while varying the FFT libraries.
From Fig. 4, we see that the code with MKL FFTs generally outperforms the code with FFTW. This is particularly evident with 4 threads, where, once again, the provided libfftw3 threads library performs poorly. While the manually compiled libfftw3 omp library does not gain significant overhead with the use of threads, it is still generally outperformed by the MKL library. In order to provide a clearer picture of library performance within BerkeleyGW, Fig 5 shows the wall-time spent in FFT routines and ZGEMM for the various libraries with a single OpenMP thread per MPI task. The results shown are from BerkeleyGW compiled with the GNU compiler. However, the library timings are expected to be insensitive to the compiler. Because the performance of libfftw3 threads and libfftw omp is nearly identical for a single thread, the FFTW curves have been combined. The MKL FFTW interface outperform the FFTW3 interface across all core counts and the MKL ZGEMM out-performs the LibSci ZGEMM across all core counts. The latter difference is very significant leading to a reduction in ZGEMM walltime by nearly a factor of 2.
Quantum ESPRESSO was tested with all of GNU, Intel and Cray compilers and FFTW, LibSci and MKL. A summary of the overall performance of the code on our example benchmark calculation is shown in Fig. 6. As seen in the figure, the Intel Compiler + MKL library and GNU Compiler + MKL library result in the best overall combinations for both 1 and 4 OpenMP threads per MPI task. The combination of the Cray compiler + MKL slightly outperforms the other combinations when using a single thread. However, as was the case in BerkeleyGW, the Cray compiler + MKL again yielded less performance when multiple threads were used, again potentially attributable to conflicting OpenMP implementations (see this discussion in the BerkeleyGW section). As above, the ( threads) label refers to the use of the provided libfftw3 threads.
VASP was tested with the Intel and Cray compilers as well as FFTW, MKL FFTs and an internal FFT library denoted ”FURTH”. A summary of the overall performance of the code on our example benchmark calculations is shown in the figure. The internal "FURTH" fft library performed the worst, and is excluded from the figure for simplicity. The figure shows different combinations of compiler, linear algebra library and FFT library. As in the case for BerkeleyGW and Quantum ESPRESSO above, we see that the best compiler and library combination was the Intel compiler with MKL used for both linear algebra and FFTs. Once again, MKL proved to provide the most performant FFT libs.
LAMMPS was tested with all of GNU, Intel and Cray compilers. A summary of the overall performance of the code on our example benchmark calculations is shown in Fig. 9. Since the LAMMPS examples described above do not make significant use of math libraries (the Rodo example does utilize FFTW, but the fraction of time spent in FFTW is a small fraction of the total runtime), we did not perform an extensive analysis of library performance in LAMMPS. In all the example cases, the Intel compiler had the best performance, closely followed by the GNU compiler.
NAMD was tested with the Intel and GNU compilers as well as FFTW. A summary of the overall performance of the code on our example benchmark calculations is shown in the figure.. From the figure, we once again see that the Intel compiler provided the highest performing compilation. The NAMD is benchmark is overall less sensitive to FFT libraries, so FFT tests are not shown in the figure for simplicity.