NERSCPowering Scientific Discovery Since 1974

Getting Started and Optimization Strategy

The purpose of this page is to get you started thinking about how to optimize your application for the Knights Landing (KNL) Architecture that will be on Cori.  This page will walk you through the high level steps and give an example using a real application that runs at NERSC.

How Cori Differs From Edison

There are several important differences between the Cori (Knight's Landing) node architecture and the Edison (Ivy Bridge) node architecture that require special attention from application developers hoping to get optimal performance on Cori. A list of differences is summarized in the table below:

Edison Intel Xeon Ivy BridgeCori Intel Xeon Phi (Knight's Landing)
12 core per CPU 68 physical cores per CPU
24 virtual cores per CPU 272 virtual cores per CPU
2.4 GHz 1.4 GHz
4 double precision operations per cycle 8 double precision operations per cycle
64 GB of DDR memory per node 16 GB of MCDRAM fast memory & 96 GB of DDR memory
~100 GB/s memory bandwidth  MCDRAM: ~500 GB/sec memory bandwidth


Optimization Areas

There are three important areas of improvement to consider for Cori:

  1. Identifying and adding more node-level parallelism and exposing it in your application. An MPI+X programming approach is encouraged where MPI represents a layer of internode communication and X represents a conscious intra-node parallelization layer where X could again stand for MPI or for OpenMP, pthreads, PGAS etc...
  2. Evaluating and improving your Vector Processing Unit (VPU) utilization and efficiency. As shown in the table above, the Cori processors have an 8 double-precision wide vector unit. Meaning, if your code produces scalar, rather than vector instructions, you miss on a potential 8x speedup.
  3. Evaluating and optimizing for your memory bandwidth and latency requirements. Many codes run at NERSC are performance limited not by the CPU clock speed or vector width but by waiting for memory accesses. As described in detail below, a lot can be done to optimize such codes for Cori including memory locality improvements and use of the on-package high-bandwidth memory (HBM). 

In the below sections of this page. We describe in detail an optimization strategy for these three target areas.

Optimization Concepts

On-Node Parallelism

On Cori, it will be necessary to specifically think both about inter-node parallelism as well as on-node parallelism. The baseline programming model for Cori is MPI+X where X represents some conscious level of on-node parallelism, which could also be expressed as MPI or a shared memory programming model like OpenMP, pthreads etc. For a lot of codes, running without changes on 72 (or up to 288) MPI tasks per node could be troublesome. Examples are codes that are MPI latency sensitive like 3D FFTs  and codes that duplicate data structures on MPI ranks (often MPI codes don't perfectly distribute every data structure across ranks) which could more quickly exhaust the HBM if running in pure MPI mode.

The difference between the typical pure MPI application targeting inter-node communication and an OpenMP parallel code targeting on-node parallelism is shown in the following figure:

In the following figure we show the advantage of moving to an MPI+OpenMP programming model for a materials science application dominated in walltime by a MPI latency sensitive 3D parallel FFT routine. 

(Figure Courtesy of Andrew Canning).  The performance improves dramatically when reducing the number of MPI tasks (and messages) in favor of OpenMP threads. The jump in walltime between 6 and 12 OpenMP threads corresponds to threads from a process extending beyond a single NUMA domain on a Cray XE6.

More details on using OpenMP to optimize your application for Cori can be found https://www.nersc.gov/users/computational-systems/cori/preparing-for-cori/improving-openmp-scaling/

Vectorization

Vectorization is actually another form of on-node parallelism. Modern CPUs have Vector Processing Units (VPUs) that allow the processor to do the same instruction on multiple data (SIMD) per cycle. Consider the figure below. The loop can be written in "vector" notation:

On KNL, the VPU will be capable of computing the operation on 8 rows of the vector concurrently. This is equivalent to computing 8 iterations of the loop at a time.

The compilers on Cori want to give you this 8x speedup whenever possible. However some things commonly found in codes stump the compiler and prevent it from vectorizing. The following figure shows examples of code the compiler won't generally choose to vectorize

In the top case, compilers are unable to give you vector code because the ith iteration of the loop depends on the i-1'th iteration. In other words, it is not possible to compute both these iterations at once. The bottom case shows an example where the execution of the loop forks based on an if statement. Depending on the exact situation, the compiler may or may not vectorize the loop. Compilers typically use heuristics to decide if they can execute both paths  and achieve a speedup over the sequential code.

Memory Bandwidth

Processors have a finite amount of Memory Bandwidth - that is: there is maximum number of bytes per second they can read from memory. This limits the performance of many applications at NERSC. Consider the following block of code:

Assuming the arrays a(:) and b(:) are very large - too large to fit in any level of cache on the processor - then, every time this block of code is executed, n*m + n data elements must be streamed into the processor from memory. The n*m factor comes from streaming in the complete array b(:) from memory n times, and the extra factor of n comes from the need to stream array a(:) in from memory once.

Lets assume for a minute that both a(:) and b(:) are arrays of double precision numbers (i.e. 8 bytes per element). Since there are 2 *n*m flops (one for add and one for multiply) the above loop has an approximate flop to byte ratio of 1/4. The flop to byte ratio as defined above is what we call the "Operational Intensity" of the code.

Looking at the Edison configuration page tells us that the maximum bandwidth between memory and the two CPUs per node we can achieve on a single Edison node is 100 GB/s. Given the flop to byte ratio of 1/4, this means the loop above can, at most, achieve a performance of 25 GFlops. On the same Edison configuration page, though, note that the maximum performance per node you are supposed to be able to achieve is over 450 Gflops! What gives?

The code above is "memory bandwidth bound" - i.e. its performance is limited not by the CPU clock speed, or ability to vectorize, but by the rate at which it can bring data from memory into the processor. Scientists at Berkeley Lab have developed a graphical model (termed the "Roofline Model") to illustrate the maximum performance an application can achieve on a given hardware based on its operational intensity (flops/byte ratio). Below is the roofline model for Edison and the marker for the code block above:

The red dashed line represents the code block above that is limited to at best 25 GFlops. Codes the fall to the left on the roofline model are memory bandwidth bound, codes that fall to right are limited by a number of other ceilings. The green line represents the best performance you can achieve if your code doesn't vectorize (see above) and the blue line represents the best performance you can achieve without balancing multiply-add operations. The red line, is theoretically the best performance you can achieve on the system. 

Consider now the following transformation from our code block above that represents a functionally equivalent change:

In the code on the right, for each i, we now access chunks of array b(jout:jout+block) at a time. Assuming these chunks fit into the last level of cache on the processor, the amount of data we now need to stream into the CPU from memory during execution is n*m/block + m. The first term comes from the fact that we stream in all of array a(:) m/block times. Thus, the amount of data accessed from memory has gone down significantly and our operational intensity is now 2*block/8 - a big improvement! This corresponds to the following change on our roofline plot.

The code has moved significantly to the right in the roofline model, allowing it access to a higher ceiling.

 

Optimization Strategy

Now that we have reviewed the basic concepts needed for optimizing code on Cori, lets discuss our optimization strategy.

We spent a long time at NERSC deciding how to best describe the optimization process for Cori. We considered models of stair cases, space elevators and an inescapable dungeon. The below figure shows the model that we think best describes our strategy.

Ant Farm

Yes, an Ant Farm. The lawn mower represents a profiler that is continuously being run over the code and throwing into the air the tallest blades of grass that need to be considered further. The ants come along and take the grass into their underground lair for investigation. What the ant farm really represents, though, is a flow chart to follow for understanding and optimizing your application:

 

We will describe in detail some of the steps in the optimization strategy workflow for Cori Below. We have prepared a worksheet for code teams to work through in order to hit the important points of the above workflow.

 

Optimization Worksheet

Below are the steps in the Cori optimization strategy as a series of questions and experiments that can be run on your own applications.  Italic text corresponds to sample answers from the Quantum ESPRESSO application.

1. Defining your problem

Design a problem that you can’t solve today but could if you have access to a machine 5-10x more powerful than Hopper or Edison and could run across the whole machine. A problem composed of multiple of smaller problems is OK.

QE Answer:

10,000 Atom Transition Metal Oxide System. MgO, ZnO

1000 Atom Realistic Organic Photovoltaic at Hybrid Functional Level. See below figure for example system P3HT

 

 

Create a sample input deck of a typical system you can run today on Edison (at realistic scale) that resembles a smaller version of the type of problem you want to run in 2017. This should be shareable with the NESAP team (NERSC Staff and Engineers at Intel and Cray).

QE Answer:

Using the QE based MiniDFT “Large” Input deck. 2000 atoms of MgO. The input file is accessible at: https://drive.google.com/file/d/0B1vuU9dbI-YnSTczMWhmanFWekE/view?usp=sharing

2. Profile your application

Use VTune. Describe the results. Where are the hotspots in your calculation? What fraction of your run is serial vs threaded on Edison.

To do this experiment:

    1. Build your executable using:

      1. -g flag throughout

      2. Use -dynamic on your linkline to dynamically link your application

      3. Do “module unload darshan” before building

Instructions for VTune at NERSC

QE Answer:

VTune Top Down View

VTune OMP Regions

VTune Serial Regions

You can see that most of the time is spent in serial regions of the code.  And of this time, a significant fraction is memset. From a topdown view, the routines h_psi and pcdiaghg take the most time. Here is a zoomed in view of the stack starting from these routines:

3. MPI vs. OpenMP Scaling Curves

For a fixed number of CPU cores, run your sample calculation in pure MPI mode and mixed OpenMP-MPI mode varying from 2 to 12 threads per MPI task.

QE Answer:

From the output of #2, you can see significant portions of the sample runtime (with 200 MPI and 12 OpenMP threads) appear in serial regions. Despite this, the MPI vs. OpenMP scaling curve below is relatively flat. The optimal performance occurring with two threads per MPI task. At 12 threads per MPI task performance is about 20% slower than than pure MPI.

For Quantum ESPRESSO, we ran our test on a fixed 2400 cores and varied the number of MPI tasks between 2400 and 200 and the number of OpenMP threads between 1 and 12.

 

Run the vtune advanced-hotspots collection to see thread utilization and spin time information. Are threads active (and not spinning) throughout your sample calculation? Use the instructions in question 2 with “-collect hotspots” replaced by “-collect advanced-hotspots”

QE Answer:

We ran advanced-hotspots. Below is the CPU Usage Histogram. A majority of the time is spent with only 1 thread active.

The below shows thread utilization (brown) and spin(red) over the course of the run. You can see that it is rare when all threads are active. As shown in the appendix, green regions correspond to SCALAPACK.

4. Memory use

The maximum memory per task used for your job can be recovered from the MyNERSC completed jobs we portal. Browse to https://my.nersc.gov/completedjobs.php, click on the job in question, and read off the memory value (Note this value does not include huge-pages, which may affect some users who explicitly utilize large pages).

 

QE Answer:

The “large” input above consumes 10 GB per MPI task when run with 200 MPI tasks and 12 OpenMP threads. This gives roughly 2 TB total memory.

The memory used in Quantum ESPRESSO scales as the number of atoms squared. So, a 10,000 atom calculation would require 50 TB total memory.

5. Bandwidth Sensitivity

Run your example on Edison using 12 tasks/threads per node (using 2 x N nodes) versus 24 tasks/threads per node (using N nodes). If you only utilize half of the cores on a node (and half fill each socket), then the memory bandwidth available to each core will be greater (approaching factor of 2). An example comparison would be:

aprun -n 24 -N 12 -S 6 ...

vs.

aprun -n 24 -N 24 -S 12 ...

If the runtime varies significantly between these two runs, it indicates that your code or kernel is memory bandwidth bound.

QE Answer:

For the Quantum ESPRESSO example above, we ran the following test cases.

Packed: 2400 MPI tasks, 24 MPI tasks per node, 12 MPI tasks per socket

Unpacked: 2400 MPI tasks, 12 MPI tasks per node, 6 MPI tasks per socket

Unpacking leads to approximately 20% improvement. This suggests that our target problem in Quantum ESPRESSO is significantly bandwidth bound.

We can look at how all the top routines fared under the packed to unpacked transition:

Routine      Packed  Half-Packed  Ratio
h_psi        6529    2190x2       67%
-vloc_psi_k  5524    1738x2       63%
--invfft_x   2652    788x2        59%
--fwfft_x    2466    842x2        68%
distmat      2532    1051x2       83%
-zgemm       1845    855x2        93%
pcdiaghg     2235    1194x2       107%
hsi_dot_v    1745    749x2        86%
-zgemm       1401    650x2        93%

We can see from the above table that most of the gains in unpacking occur in the areas of the code (h_psi/vloc_psk_k) that are dominated by the parallel FFTs. The parts of the code dominated by zgemm saw more modest gains (zgemm itself sees only a 7% boost). The pcdiaghg scalapack routine actually slows down, presumably due to more network contention.

Collect the bandwidth report from your application in vtune. Do certain areas of the code use more bandwidth than others? If you compare the vtune output for the packed and unpacked tests, do certain regions slow down more significantly than others?

Instructions for vtune: Repeat instructions in question 2 with “-collect bandwidth”

QE Answer:

We ran the the vtune bandwidth collection and show it next to the advanced hotspots thread activity for comparison. For this test we run with 200 MPI tasks and 12 OpenMP threads per task.

The bandwidth used appears to be under 16 GB a second. However, there are two issues here. One, unless you zoom in on a certain region, the graph shows an averaged value in each interval. If you zoom in on the first solid region in the plot you see a structure like the following:

Many other regions are essentially bandwidth bound because only one thread is active and using roughly the same bandwidth as a single thread of stream triad. If we were to pack 12 MPI tasks into the socket, we would require more bandwidth than available from the socket. 

What can you tell about the memory access patterns in the sensitive regions of your code? Are the memory accesses sequential, strided, random? Do you know your cache utilization or miss rates? Does your code actively use blocking to reuse caches

QE Answer:

For the ZGEMM regions the memory access patterns are sequential. For the FFT regions the access patterns are more complex - usually done in a divide and conquer approach. Need to more work here for QE. 

Wondering what to do if you are sensitive to memory bandwidth? You should consider two things:

a. Try to improve the data-locality and cache re-use of your application as described above

b. Identify and place hot arrays in to high-bandwidth memory. You can do this right now on Edison! https://www.nersc.gov/users/computational-systems/cori/preparing-for-cori/using-high-performance-libraries-and-tools/#toc-anchor-1

6. Clock Speed Sensitivity

Run your example at full vs. reduced clock speed on Edison. This test can tell you whether your code/kernel is CPU bound or not. If reducing the clock speed makes a significant performance difference in your application. Your application is at least partially CPU bound.

To perform this test, run your calculation on Edison and specify the "-p-state" flag to aprun to specify the CPU speed in KHz (max value of 2.4 GHz, min value of 1.2 GHz). For example,

aprun --p-state=1900000 ...

specifies a clock speed of 1.9 GHz.  Compare the value with 1.9 GHz and 2.4 GHz. Don’t go below 1.9 GHz or other effects come into play.

QE Answer:

We performed the half clock speed test and found a walltime increase by approximately 18% after reducing the clock speed by 21%. This suggests while the code is bandwidth sensitive, the overall runtime is also CPU sensitive, presumably due to the dense linear algebra regions.

 

7. Vectorization Efficiency

Compile and run with the flags --no-vec vs -xAVX on Edison. What is the performance difference?

QE Answer:

We ran with and without compiler vectorization. As shown in the plot below, the change makes little difference. This is because the code is heavily dependent on math libraries (FFTW and BLAS) and what is not is bandwidth or communication bound.

8. IO and Communication

Link your application with IPM by doing:

    1. “module load ipm” before building

    2. “module unload darshan” before building

    3. add $IPM to the end of you link line

Run your application and view the IPM report near the end of standard out.

QE Answer:

We performed an MPI scaling study with IPM. Communication rises to about 50% with 4800 MPI tasks.