Compiler Comparisons on Hopper
There are five compilers available to users on Hopper, the NERSC XE6. All of the compilers on this system are provided by Cray, and they are invoked with wrapper modules that ensure that each compiler links with the proper system and MPI libraries.
Each of the compilers have a wide variety of options that control the level of optimization of the exectuable code they produce. We have collected several optimization recommendations for each compiler from Cray applications experts and various other sources and tested them on several benchmarks.
We provide a summary of the results below:
The benchmarks used to test these compilers fall into 2 categories: NERSC benchmarks that were devised for the NERSC6 procurement that led to the selection of Hopper and the publicly available NAS Parallel MPI 3.3 benchmarks.
We used these benchmarks from the NERSC6 procurement:
|GTC||Fusion||PIC, finite difference||2048 (weak scaling)||f90|
|IMPACT-T||Accelerator Physics||PIC, FFT||1024 (strong scaling)||f90|
|MAESTRO||Astrophysics||Block structured-grid multiphysics||2048 (weak scaling)||f90|
|MILC||Lattice Guage Physics||Conjugate gradient, sparse matrix, FFT||8192 (weak scaling)||c, assembly|
|PARATEC||Material Science||DFT; FFT, BLAS||1024 (strong scaling)||f90|
NPB 3.3 MPI Parallel Benchmarks
The following NPB 3.3 MPI Benchmarks will be run, all at a concurrency of 1024 processes:
|BT||Block Tridiagonal||Solve a synthetic system of nonlinear PDEs using a block tridiagonal algorithm||E|
|CG||Conjugate Gradient||Estimate the smallest eigenvalue of a large sparse symmetric positive-definite matrix using the inverse iteration with the conjugate gradient method as a subroutine for solving systems of linear equation||E|
|EP||Embarassingly Parallel||Generate independent Gaussian random variates using the Marsaglia polar method||E|
|FT||Fast Fourier Transform||Solve a three-dimensional PDE using FFT||E|
|LU||Lower-Upper Symmetric Gauss-Seidel||Solve a synthetic system of nonlinear PDEs using a symmetric successive over-relaxation algorithm||E|
|MG||MultiGrid||Approximate the solution to a three-dimensional discrete Poisson equation using the V-cycle multigrid method||E|
|SP||Scalar Pentadiagonal||Solve a synthetic system of nonlinear PDEs using a scalar pentadiagonal algorithm||E|
PGI Compiler Recommendations
In Chapter 3, Optimizing and Parallelizing, of the PGI Compiler User's Guide, the section "Getting Started with Optimizations" recommends "-fast -Mipa=fast" as "a good set of options to use with any of the PGI compilers". In addition, Cray recommends -Mfprelaxed which provides additional optimizations at the possible cost of a loss of floating point precision.
Our benchmarks runs will compare the performance of runs compiled with these options
-fast - A level of optimization which chooses generally optimal flags for the target platform.
-fast -Mipa=fast - Enables interprocedural analysis and chooses generally optimal interprocedural options for the target platform.
-fast -Mfprelaxed - Generates relaxed precision code for those floating point operations that generate a significant performance improvement, depending on the target processor.
-fast -Mipa=fast -Mfprelaxed
and the wall clock time of each run will be normalized against the "-fast" time, so if a job compiled with "-fast" completes in 20 seconds, a job that completes in 23 seconds would be shown on the graph as 1.15, and one that completes in 18 seconds would be shown as .9.
This is a bar chart of the results.
Note: The Maestro benchmark does not run successfully with optimization options beyond "-fast"
Generally speaking, the other options do not significantly improve the performance over that obtained with "-fast", and in some cases worsen it. The only significant exception to this is the NPB FTD benchmark whose performance is greatly improved by each of the other three options.
Cray Compiler Recommendations
Cray recommends using the default optimization (-O2) which is equivalent to the higher levels of optimization with other compilers. In addition, the -O3 and -Ofp3 options can improve performance on some codes.
Our benchmarks runs will compare the performance of runs compiled with these options:
-Ofp3 - This gives the compiler maximum freedom to optimize floating-point operations, even at the expense of not conforming to the IEEE floating point standard.
The wall clock time is normalized against the default (-O2) time.
These are the results of the runs made with different optimizations.
Note: The GTC benchmark does not run when compiled with -fp3 as one of the optimizations. The FTD benchmark has much worse performance with -O3 and -O3,fp3 options than with the other two, so much so that including them in the graph would seriously distort the mean.
Only one of the benchmarks shows a significant improvement over the default optimization, Paratec with -Ofp3. For all the other benchmarks, the higher levels of optimization give little or no improvement in performance.
Gnu Compiler Recommendations
NERSC has found that for this compiler, -O3 produces well optimized code for many benchmarks. In addition Cray recommends these options for additional performance optimizations, -ffast-math and -funroll-loops.
Our benchmarks runs will compare the performance of runs compiled with these options:
-O3 - This compiles with a high level of optimization.
-O3 -ffast-math - This performs optimizations at the expense of an exact implementation of IEEE or ISO rules/specifications for math functions.
-O3 -funroll-loops - This unrolls loops whose number of iterations can be determined at compile time or upon entry to the loop. It also turns on complete loop peeling (i.e. complete removal of loops with a small constant number of iterations). This option makes code larger, and may or may not make it run faster.
-O3 -ffast-math -funroll-loops
The wall clock time is normalized against the -O3 time.
Note: The Maestro benchmark does not run successfully when compiled with the -ffast-math option.
-O3 generally give a good level of optimization, but it seems to be worthwhile to try the -ffast-math option, since in many cases it does improve a code's performance significantly.
The wall clock time in the graph below is normalized against the default (no optimization options) wall clock time.
Intel Compiler Recommendations
Based on Intel documentation and discussions with developers and benchmarkers we tested the following options with the Intel compilers. The quotations are from the online ifort man page.
default (no optimization flags) - By default the Intel compiler has a high level of optimization. It is comparable to the -O2 optimization level.
-O2 - This "enables optimizations for speed", and is the recommended option for codes in the online man page.
-O3 - This performs all of the -O2 options as well as additional more aggressive loop transformations.
-O3 -unroll-aggressive -opt-prefetch - This was recommended to us by benchmarkers as being a good supplement to the -O3 optimizations.
-fast - This "maximizes speed across the entire program". It is a very high level of optimization, much more aggressive than that provided by the pgi "-fast" option, and includes interprocedural optimizations across files. It increases compilation time significantly, and occasionally compiles will fail with this option which succeed with the other options, probably due to the greater processor and memory requirements.
Notes: The MAESTRO benchmark would not compile with the Intel compiler, claiming a Fortran standards violation. The MILC and FTD -fast compiles failed.
In general, as with the Cray compiler, the default, no optimization argument compilation gives very good performance for all the benchmarks.
In this section, for each benchmark, the best results for each compiler with the NERSC recommended optimization arguments are compared against each other. Pathscale results are not included since Cray no longer fully supports this compiler.
The results are normalized against the Intel compiler.
In general, the Cray and the Intel compilers outperform the others, so if performance is your prime consideration, it would repay you to test these compilers on your application with the recommended optimization options. The Gnu and PGI compilers generally produce code 5-10% slower, but on some benchmarks one or the other outperform all other compilers.
Optimizing C++ Codes
C++ presents a great challenge to compilers because of its complexity and rich feature set, and many compilers that optimize more straightforward C and Fortran codes very well are not very effective at identifying optimization opportunities in C++ codes.
At C++ Benchmarks, there are a suite of C++ benchmarks that make an attempt to quantify how well vendors optimize a wide variety of C++ operations and language features.
There are six test programs. These are the programs and the developer's descriptions:
|stepanov_abstraction||"What happens to performance when I wrap a value in curly braces”? Almost all compilers do well on the original summation tests, but they don’t do nearly so well on simple sort routines using the same abstractions.|
|stepanov_vector||"What happens to performance when I replace a pointer with a vector iterator? And what happens if I use reverse iterators?" This is a test of the compiler and of the STL implementation shipped with the compiler.|
|functionobjects||This is a benchmark for instantiation of simple functors, and partly a demonstration of the relative performance of function pointers, functors and inline operators. When a compiler works well, functors and inline operators should perform identically.|
|simple_types_constant_folding||Most developers assume that their compiler will do a good job of folding constant math expressions on simple data types. But do developers verify that assumption? One compiler does a decent job of folding the constants, but sometimes issues empty loops after removing constant calculations from the loops. Other compilers simplify some calculations but not other, similar calculations.|
|simple_types_loop_invariant||A test to see if the compiler will move loop invariant calculations out of the loop. This is something that a lot of developers assume that the compiler does for them.|
|loop_unroll||This is almost a straightforward test to see if compilers will correctly unroll loops to hide instruction latency. “Almost” because if I hand unrolled the loops it would be several hundred pages of source (I did it, it’s big). So, I used templates to do the unrolling — and found that some compilers have problems with such templates (which is yet another performance bug).|
Each of these programs consist of a dozen of so groups of similar operations with the total time of the tests in a group summed. There are 49 total times in the 6 benchmark test codes.
It should be emphasized that these tests do not test the floating point or communication abilitly of the compilers. They are serial codes that test the ability of a compiler to optimize the performance of C++ features.
For each compiler the codes were compiled with the optimization recommendations given in the previous sections and run on the compute nodes with the aprun job launcher.
This table shows the comparative performance of the different compilers. For each program the total elapsed time in seconds for all of the tests of that program is summed up, so the lower the number, the better the compiler performed. The stepanov_vector program would not compile with the PGI compiler.
As you can see, there is a considerable gap between the performance of the two best performing compilers, Intel and GNU, and the others, with Pathscale being the worst performer in virtually every category.