Early application case studies
The Babbage test system was used to study representative applications and kernels in various scientific fields to gain experience with the challenges and strategies needed to optimize code performance on the MIC architecture. Below we highlight a few examples:
The BerkeleyGW package is a materials science application that calculates electronic and optical properties with quantitative accuracy, a critical need in materials design for more efficient and cost-effective solar light-harvesting and energy conversion.
GW calculations within BerkeleyGW occur across two executables. The first depends heavily on dense linear algebra and FFT math libraries, while the second depends on custom code that expresses large summations in terms of reduction loops. We focused predominantly on optimizing the performance of the main kernel in the custom-coded executable.
The original code (revision 4770 in the figure below) was an MPI-only code. Using CPU cores in an MPI-only model, we were unable to fit our sample problem into memory on the Xeon-Phi. The first step taken during optimization (Rev. 4896) was to refactor the code to have a 3-loop form: 1) a large loop to be parallelized with MPI tasks, 2) a large middle loop to be parallelized with OpenMP threads, and 3) a large inner loop that can be vectorized.
We then added the appropriate OMP reduction pragmas to the loops targeted for OpenMP. We noted that the code scaled well to tens of threads, but still performed poorly on the Xeon-Phi because the compiler was not vectorizing the innermost loops due to loop dependency. We simplified the flow (splitting into multiple loops in some cases) and were able to ensure that critical inner loops were vectorized by the compiler. After this optimization, the performance on the Xeon-Phi is comparable to the host Sandy-Bridge CPUs for our test example. It should be noted, that the example walltime decrease between revision 5338 and 5349 in the figure below is not entirely attributable to vectorization, but a combination of vectorization, code simplification and restructuring of loops.
After optimizing the code targeting the Xeon-Phi, we went back and compared performance on Hopper for a production run of the code. The results are shown in the below figure. The optimized kernel is represented by the the CH/SX time in the table.
This BerkeleyGW kernel and example is, in many ways, an ideal case for the Xeon-Phi: reduction loops can benefit from the increased memory bandwidth and the code was able to be restructured in a way that left very long inner loops (1000-10,000 iterations) ideal for vectorization. While this is fortuitous, not all codes can be structured in such a way.
FLASH is an adaptive mesh refinement code used for the simulation and modeling of various astrophysical phenomena. In the application readiness program, we experimented with the Sedov explosion problem, which is a pure hydrodynamics test problem distributed with the FLASH release. We ran FLASH in "native" mode on a single Xeon-Phi and obtained the best performance when using an MPI+OpenMP FLASH configuration with 15 MPI ranks per card and 8 OpenMP threads per MPI rank.
Significantly, however, we found that the test problem took 2.9x longer to run on 1 Xeon-Phi co-processor than it did on 2 Sandy-Bridge host processors. We identified the reason for the poor relative performance by running an OpenMP and vectorization experiment; the results are shown in the figures below. The figure on the left is an OpenMP speedup graph obtained by running a fixed-size problem with 1 MPI rank per Xeon-Phi and various numbers of OpenMP threads placed on separate cores. The speedup is not ideal; however, it is not the main reason for the poor performance when considering that a production FLASH simulation would use up to 15-way MPI parallelism on a single Xeon-Phi (based on memory overhead). The figure on the right shows how compiler auto-vectorization affects time to solution over a range of job configurations. Results indicate that disabling compiler auto-vectorization (red data points) leads to better performance in 4 out of 5 job configurations. Failure to extract benefit from vector instructions is the main reason for the poor relative performance, highlighting the need to investigate this issue.
We found that approximately half of the run time is spent in a large data reconstruction subroutine in the hydrodynamics solver. This subroutine uses various characteristic tracing methods to advance fluid state in each cell of the local mesh by half a time step. A detailed look at the code reveals that there is minimal opportunity for vectorization because the subroutine updates the fluid state one cell at a time. Two code transformations are necessary to draw benefit from vectorization. Firstly, the subroutine needs to be passed fluid data for multiple cells at once, and secondly, the fluid fields such as density, pressure and temperature for each cell should no longer be in contiguous memory locations. If we assume that we have applied the first code transformation of passing multiple grid points into the subroutine then an example code layout looks as follows
do i = 1, GRID_PTS
do n = 1, HY_WAVENUM
prod(HY_DENS:HY_PRES) = leig(HY_DENS:HY_PRES,n,i) * &
dp = sum(prod)
sigL(HY_DENS:HY_PRES,i) = sigL(HY_DENS:HY_PRES,i) + (dp * reig(HY_DENS:HY_PRES,n,i))
Here, the outer loop iterates over grid points (cells) along one dimension of a single AMR patch (typically 16 or 32) and the inner loop iterates over wave numbers (5). The upper case labels HY_DENS and HY_PRES are used to select the density and pressure fluid fields; the range HY_DENS:HY_PRES selects 5 fluid fields. It is possible to vectorize over the inner loop, however, any gains would be minimal because the inner loop is very short. In addition the code fragment shows a non-common case where the same operation is applied to all fluid fields of a single grid point. We make the code more amenable to vectorizaion by changing the data layout and loop order as follows
do n = 1, HY_WAVENUM
!dir$ vector aligned
do i = 1, GRID_PTS
dp(i) = &
(leig_dens(i,n) * (constC*(delW_dens(i)+W6_dens(i))+constD*W6_dens(i))) + &
(leig_velx(i,n) * (constC*(delW_velx(i)+W6_velx(i))+constD*W6_velx(i))) + &
(leig_vely(i,n) * (constC*(delW_vely(i)+W6_vely(i))+constD*W6_vely(i))) + &
(leig_velz(i,n) * (constC*(delW_velz(i)+W6_velz(i))+constD*W6_velz(i))) + &
(leig_pres(i,n) * (constC*(delW_pres(i)+W6_pres(i))+constD*W6_pres(i)))
sigL_dens(i) = sigL_dens(i) + (dp(i) * reig_dens(i,n))
sigL_velx(i) = sigL_velx(i) + (dp(i) * reig_velx(i,n))
sigL_vely(i) = sigL_vely(i) + (dp(i) * reig_vely(i,n))
sigL_velz(i) = sigL_velz(i) + (dp(i) * reig_velz(i,n))
sigL_pres(i) = sigL_pres(i) + (dp(i) * reig_pres(i,n))
We use separate arrays for each fluid field so that we can guarantee that the first grid point value of each fluid field array is aligned on a 64-byte boundary. If we decided to keep the data for all fluid fields in a single array such as sigL(1:GRID_PTS,HY_DENS:HY_PRES) and GRID_PTS=20 then we could guarantee that sigL(1,HY_DENS) is aligned on a 64-byte boundary, however, the first element of the next fluid field would begin at a memory offset of GRID_PTS * sizeof(double) which is not divisible by 64 and hence not aligned on a 64-byte boundary. Another option, although untested, would be to manually pad the 1:GRID_PTS dimension in a single multidimensional array.
These code transformations require a reasonable investment of time because the data reconstruction subroutine is over 1000 lines of code, and contains a large amount of logic to change things like the order of the method and type of Riemann solver. The logic in this monolithic subroutine is an impediment to vectorization and makes it hard for us to investigate the value of different code transformations, and so, in this project we tested the code transformations in a stand-alone application with a stripped-down version of the subroutine and show results below.
We first obtained a baseline time by running the application on Babbage (Intel Xeon-Phi), Edison (Intel Ivy-bridge) and Mira (IBM BG/Q) platforms. We then measure runtime for the baseline code with compiler auto-vectorization enabled and the new code with and without compiler auto-vectorization. Results show that vectorization hardly changes the performance of the baseline code, in agreement with our observations of the full FLASH application, but significantly improves the performance of the new code. This shows that transformations for vectorization can be portable across architectures and gives assurance to both the Flash Center and the wider community that a restructuring effort of this kind can be worthwhile. In parallel with our exploration, the Flash Center recently released FLASH-4.2 which now uses smaller subroutines for different order characteristic tracing methods rather than a single monolithic subroutine. This removes a lot of control flow (if statements) from inner loops and will make it easier for the Flash Center to usefully incorporate our code transformation suggestions into future versions of the code.
Multigrid linear solver kernel
The multigrid linear solver kernel of the Colorado State University’s atmospheric model was used for a scalability study on the KNC architecture. The original code is a pure MPI code that uses a 2D horizontal decomposition. This study uses the numerical grid for a horizontal resolution of about 30 km and 20 vertical layers with 40 decomposition subdomains, meaning that up to 40 MPI processes can be used.
Our first task was to add OpenMP threading to the code in order to understand threading scalability on Xeon-Phi. To achieve and maintain the coarsest granularity for a large number of cores, as in the case of Xeon-Phi, we adopted a threading strategy that collapses the outermost loops so that the loop count remains large enough for the number of threads.
The OpenMP scaling for a single MPI task shown in the first plot below indicates a good scaling up to about 64 threads. The second plot below compares runtimes of hybrid MPI+OpenMP runs with a fixed concurrency count of 40. Note that the number of OpenMP threads increases from left to right while the number of MPI processes increases from right to left. It shows a moderate 30% degradation in the pure OpenMP case (the far right one), compared to the pure MPI case (the far left one).
We noticed that we can increase vectorization efficiency by changing the index order of an array which would ensure unit-strided access inside loops. The plot below compares runtimes of the original and modified codes with and without vectorization compile flag, for a fixed concurrency count of 40. Without vectorization, the original code runs faster. However, since the vectorization effect becomes much larger in the modified code, the modified code runs faster than the original one, at least for up to 40 threads considered in this study.
However, despite restructuring, performance still lags on the KNC architecture in comparison to the host Sandy Bridge node as shown in the next plot. Further work will need to be conducted to gain more performance on many core architectures.
3D Stencil Diffusion Kernel
The code examples used in the following tests are from the book by Jim Jeffers and James Reinders, "Intel Xeon Phi Coprocessor High Performance Programming". A 3D 7-point stencil kernel is to simulate the diffusion effects of volume of liquid over time within a 3D container. It is run on Babbage with no vectorization and no OpenMP first, on the host and MIC card separately, to establish the baseline performance.
Then OpenMP paralleisim is added to the outer loop. On the host, 16 threads with KMP_AFFINITY option of "scatter" were used. On the MIC card, we tested with different numbers of threads: 60, 120, 180, 236, 240, combined with various KMP_AFFINITY options. The best speedup on MIC is obtained via 180 threads with scatter affinity. The code still runs faster on host than on the MIC with base option and OpenMP only option.
Then vectorization is introduced (via the "#pragma simd" directive) on top of OpenMP. It now runs faster on MIC than on the host. When more advanced loop optmization, loop peel and loop tiling, are used, even more performance gains are obtained.
The basic message is that OpenMP and vectorization both play significant roles on MIC performance optimization, and more advanced loop optimization techniques (such as loop peel and tiling) can improve performance further. The speedup plot below also shows various OpenMP affinity options should be tried out on different loop algorithms.