Measuring and Understanding Memory Bandwidth
Measuring Bandwidth Usage
Measuring memory bandwidth is a good way of understanding how well your application uses cache memory. Today's processors are constructed under the assumption that a small amount of (expensive) fast memory - called "cache" - is used more often than a larger amount of slower memory (DRAM). The extent to which this assumption applies to your code is referred to as memory locality. If your data sets fit entirely in the compute core’s L2 cache, then the memory bandwidth usage numbers (as reported by a tool like vtune for example) will be small. In practice, achieved bandwidth of >140GB/sec (for KNC/Babbage/Xeon Phi) is near the maximum that an application is likely to see.
There are multiple ways to measure memory bandwidth using vtune. The easiest is to perform a "Bandwidth" analysis as described in detail on the following page: https://www.nersc.gov/users/software/debugging-and-profiling/vtune/ - This analysis shows you the bandwidth usage of your application over-time throughout the run (we recommend analyzing results on a kernel by kernel basis for what follows). From the vtune analysis plots, like the one below, you can quickly identify regions of your code that use a lot of (or a little) memory bandwidth.
Interpreting Measured Bandwidth Usage
If your code or kernel is achieiving memory bandwidth utilization values close to the peak, then the code is "memory bandwidth bound." There are two optimization strategies that might apply to your code in this case:
- Consider whether the ratio of flops to bytes requested from DRAM in your algorithm or implementation can be improved. Can you reuse data in the caches more effectively (for example index in right order)? Can a cache-blocking approach (see below) be added?
- Strategize how to effectively use the High-Bandwidth Memory (HBM) present on Cori nodes. Determine the arrays leading to the high-bandwidth utilization (via vtune) and allocate them in the HBM.
If the achieved bandwidth is substantially less than the peak it can be due to a number of reasons:
- Good memory-locality and reuse in the code. I.e. the algorithm and implementation have many flops per byte needed to be fetched from DRAM.
- The code or kernel in question is dependent on the memory subsystem in another way: it is memory latency bound. This occurs when codes have poor locality and access data stored in a non-sequential or non-deterministic way.
- The code or kernel in question is IO or communication bound (including MPI or OpenMP imbalance etc.).
- The code or kernel in question is characterized by some combination of the above.
The figure below outlines a general optimization strategy for Cori. It describes how to determine if your performance is bound by the memory system and what to do about it.
In the above figure, there are several experiments that can be performed on your kernels in order to understand how they depend on the memory system.
(1) Run your example on Edison using 12 tasks/threads per node (using 2 x N nodes) versus 24 tasks/threads per node (using N nodes). If you only utilize half of the cores on a node (and half fill each socket), then the memory bandwidth available to each core will be greater (approaching factor of 2). An example comparison would be:
aprun -n 24 -N 12 - S 6 ...
aprun -n 24 -N 24 -S 12 ...
If the runtime varies significantly between these two runs, it indicates that your code or kernel is memory bandwidth bound.
"Half Clock" Experiment
(2) Run your example at full vs. reduced clock speed on Edison. This test can tell you whether your code/kernel is CPU bound or not. If reducing the clock speed makes a siginficant performance difference in your application. Your application is at least partially CPU bound.
To perform this test, run your calculation on Edison and specify the "-p-state" flag to aprun to specify the CPU speed in KHz (max value of 2.4 GHz, min value of 1.2 GHz). For example,
aprun --p-state=2200000 ...
specifies a clock speed of 2.2 GHz.
Improving Memory Locality
There are different ways to increase the data locality of your algorithm. In general, it is desirable to reuse as much data as possible in various levels of CPU cache before the data is ejected back to main memory. This can involve re-ordering loops, changing data-structures (e.g. struct-of-arrays vs array-of-structs) or "cache-blocking" in linear-algebra type operations.
Consider the following loop:
do i = 1, n
do j = 1, m
c = c + a(i) * b(j)
Assuming m and n are sufficiently large, so that a and b cannot fit entirely in cache, the number of words streamed to the processor from main memory during computation is n*m + n. For each i, we have to stream the entire array b from main memory.
On the other hand, consider the following loop that is functionally equivalent:
do jout = 1, m, block
do i = 1, n
do j = jout, jout+block
c = c + a(i) * b(j)
In this case, if the subset of b(jout:jout+block) fits into the largest level of cache on the processor, the number of values streamed in from main memory is (m/block) * (n+block) ~ m*n/block. So, the total amount of data streaming from main memory to the processor has been reduced by the size of block.