NERSCPowering Scientific Discovery Since 1974

Improving OpenMP Scaling

Introduction

Cori's Knights Landing (KNL) processors will each have more than 60 cores, with each core having the ability to support 4 hardware threads, leading to the possibility of up to 240 threads per single-socket node.  Your application is likely to run on KNL without significant modification, but achieving good performance while using all the cores and threads may be more difficult.  Applications may not fit into the memory of a Cori node if they try to use an MPI task for each core because of the memory overhead for each MPI task. 

To get around this limitation, a mixed MPI/OpenMP programming model is recommended to achieve scalability on the node.  Using a hybrid MPI/OpenMP model can reduce memory usage by having fewer MPI tasks per node, thus reducing the need to store local copies of variables and data structures for each MPI task.  For example, a grid based application that requires ghost cells for each MPI task can use hybrid MPI/OpenMP to reduce memory usage associated with storing ghost cells by using fewer MPI tasks per node. And with hybrid MPI/OpenMP, there will be fewer MPI messages, with larger message sizes, both of which help to improve application performance across the high speed interconnect.

Good thread scaling is important for an application to get good performance on Cori.  The best combination of MPI tasks and threads per task should be determined by experimentation.  You should run you code with various values of MPI tasks and OpenMP threads to find "sweet spots" that minimize the overall run time, such as: 4 MPI tasks * 15 OpenMP threads per task, or 8* MPI tasks * 10 OpenMP threads per task, etc.  We expect that the most common values will be 4, 8, 16 or 32 MPI tasks and some range of OpenMP threads per MPI task between 4 to 16.

You can use Babbage and Edison today to start experimenting with mixed MPI and OpenMP programming.  This web page illustrates some tips and things to consider for efficient scaling hybrid MPI/OpenMP programs, and presents a few case studies on optimization techniques for real applications.

Fine Grain and Coarse Grain Models

There are basically two OpenMP programming models: fine grained and coarse grained. Each has its advantages and disadvantages. 

Below is an example code with the fine grain model. The program is single threaded except when actively using multiple threads, such as loop processing.  A fine grain model is easier to implement within an MPI program, however, there is more thread overhead and serial sections of the a code may become bottlenecks.

! Program fine_grain
!$OMP PARALLEL DO
    do i=1,n
        … computation
    enddo
!$OMP END PARALLEL DO

! … some serial computation …

!$OMP PARALLEL DO
    do i=1,n
        … computation
    enddo
!$OMP END PARALLEL DO
end

Below is an example code with the coarse grain model where most of the code runs within an OMP parallel region.  The coarse gain model has low thread creation overhead and consistent thread affinity.  However it is harder to code, and prone to race conditions.

! Program coarse_grain
!$OMP PARALLEL
!$OMP DO
    do i=1,n
        … computation
    enddo
!$OMP END DO

!$OMP DO
    do i=1,n
        … computation
    enddo
!$OMP END DO
!$OMP END PARALLEL
 end

Memory Affinity: "First Touch" Policy

In order to get optimal performance it is best for each thread of execution to allocate memory "closest" to the core on which it is executing. This is accomplished by allocating and immediately initializing memory from the thread that will be using it.  This is often referred to as a “first touch” policy because the OS allocates memory as close as possible to the thread that "first touches" (initializes) it.

In the following code snippet, both the initialization and compute sections have the same OpenMP parallel regions defined, so the array partitioned to each thread will be local to the memory of this thread.  If the first "#pragma" directive is removed, then the initialization for each array is done by the master thread only and the memory associated with the arrays will be local to the master thread and potentially farther, and thus slower to access, than from the other threads. 

! Initialization
#pragma omp parallel for
for (j=0; j<VectorSize; j++) {
a[j] = 1.0; b[j] = 2.0; c[j] = 0.0;}

! Compute
#pragma omp parallel for
for (j=0; j<VectorSize; j++) {
a[j]=b[j]+d*c[j];}

The figure below shows the effect of "first touch". On Hopper with "touch by all" STREAM bandwidth scale linearly with the number of threads up to 24.  With "touch by one", it only scales to 6 threads, which is the number of cores within a single Hopper NUMA region. 

 

(Advanced notes: In this example, we assume the array size is big enough so that the array section local to one thread will not fit in the same page allocated for another thread.  For smaller size shared variables, it is better for a thread which "potentially owns" the most data within a page to do the first touch.  Or a workaround for small size array is to define "threadprivate" for this variable, then each thread will have a copy of the complete array in own local memory.  Threads can update the whole array with atomic operations afterwards.)

It is very hard to do “perfect touch” for real applications, so a common strategy is to limit the number of threads to fewer than the number of cores in a NUMA domain (assuming one thread per core).  On Hopper, the number of cores in a NUMA domain is 6 and on Edison the number is 12.  For this reason you might expect good performance on Hopper using 4 MPI tasks with 6 OpenMP threads for each MPI task and on Edison 2 or 4 MPI tasks and 12 or 6 OpenMP threads .  When we have more experience with early Knights Landing (KNL) hardware, NERSC will be able to give more precise recommendations on memory affinity.  For now, what we know is that Intel has revealed that there will be multiple NUMA domains on a KNL node and that the many compute cores on the node will be connected in a 2D mesh architecture, so we expect memory affinity will be important. 

False Sharing 

Data from memory are accessed via cache lines.  Threads hold local copies of the same (global) data in their caches. Cache coherence ensures the local copy to be consistent with the global data. Main copy needs to be updated when a thread writes to local copy.  Multiple threads writing to the same cache line is called false sharing or cache thrashing, since it needs to be done in serial to ensure correctness. False sharing hurts parallel performance. 

The following example illustrates a false sharing. Array a is shared.  The chunk size of 1 causes each thread to update one element of a, but since a[0] and a[1] are contiguous and on the same cache line. When a[0] is updated, the main copy of a needs to be updated before a[1] is updated by thread 1, causing the loop to be carried out in sequential. 

int a[N];
#pragma omp parallel for schedule (static, 1)
for (i=0; i<N; i++) {
a[i] += i;

Some tips of avoiding false sharing are using private variables or pad arrays so each thread will update a different cache line.  For example: 

int a[N][cache_line_size];
#pragma omp parallel for schedule (static, 1)
for (i=0; i<N; i++) {
a[i][0] += i;

Race Condition

When multiple threads are updating the same shared variables simultaneously, race condition can occur.  The variable will get the value from the last thread update.  Code can get different results when running multiple times or use different number of threads.  Solutions for race condition include use the "critical", "atomic", or "reduction" directives.

The following example has the race condition: 

#pragma omp parallel private(x,sum) {
#pragma omp for
for (i=0; j<=num_steps; i++) {
x=(i+0.5)*step; sum=sum+4.0/(1.0+x); }
pi = pi + step * sum; }

To avoid the race condition for updating pi from multiple threads, an OpenMP directive "#pragma omp critical" or "#pragma omp atomic" can be added before that line.  The best solution in this case is to use the "reduction" clause as below:

#pragma omp parallel for private(x) reduction(+:sum)
for (i=0; j<=num_steps; i++) {
x=(i+0.5)*step; sum=sum+4.0/(1.0+x); }
pi = pi + step * sum;

Cache Locality

Cache locality is a term that means to use data in cache as much as possible.  Some of the ideas to improve cache locality are: Use a memory stride of 1; Keep in mind that arrays are column-major order in Fortran, and row-major order in C and C++; Access variable elements in the same order as they are stored in memory; and interchange loops or index orders if necessary, which is a tip often used in real applications, same for serial and MPI programs. 

Process and Thread Affinity

Process affinity, also called "CPU pinning", binds MPI process to a CPU or a ranges of CPUs on a node. It is important to spread MPI ranks evenly onto cores in different NUMA domains.  Thread affinity forces each thread (of a certain process) to run on a specific subset of processors to take advantage of local process state. Thread locality is important since it impacts both memory and intra-node performance.  Thread affinity should be based on achieving process affinity first. Threads forked by a certain MPI task have thread affinity binding close to the process affinity binding of their parent MPI task. 

Hopper (Cray XE6) and Edison (Cray XC30)

Please see below the illustrations of Hopper and Edison compute nodes.  Hopper, NERSC's Cray XE6 system, has 4 NUMA domains per node and 6 cores per NUMA domain. Edison, NERSC's Cray XC30 system, has 2 NUMA domains per node, 12 cores per NUMA domain, and 2 hardware threads per core.   Memory bandwidth is non-homogeneous among NUMA domains, meaning processors accessing memory within a local NUMA domain is faster than accessing memory in a remote NUMA domain.

 

On Hopper or Edison, the aprun command option "-S" can be used to control process affinity.  For example, on Hopper, when "aprun -n 4 -d 6 ..." is used, all 4 MPI ranks will be put on the first NUMA node (see Hopper figure below),  and if "aprun -n 4 -S 1 -d 6 ..." is used, the 4 MPI ranks will be allocated one on each NUMA node (see Hopper figure above), so the 6 threads will have the thread affinity within each NUMA node local to their corresponding MPI task.

 

The figure below shows the performance comparison between the GTC hybrid MPI/OpenMP code performance with and without proper process affinity. The green arrow points to the run time using the "aprun -n 8192 -S 2 -d 3" option.  The sweet spot for this GTC application is with 8 MPI tasks (2 per NUMA node) and 3 OpenMP threads per MPI task.

 

On Hopper or Edison, the aprun command option "-cc" can be used to control thread affinity.  For non-Intel compilers, the default option is “-cc cpu” should be used.   For Intel compiler, we need to work around an extra thread that Intel compiler uses to manipulate threads: please use “-cc none” (instead of "-cc cpu") for 1 MPI process per node; or “-cc numa_node” (Hopper) or “-cc depth” (Edison) for multiple MPI processes per node.

Babbage (KNC)

Please see below the illustrations of a MIC card on Babbage compute node.   Babbage is NERSC's Intel Xeon Phi testbed.  There are 45 compute nodes. Each compute node has 1 Xeon host node and 2 MIC cards. Each MIC card has 1 NUMA domain with 60 physical cores, 4 hardware threads per physical core, hence a total of 240 logical cores. Notice that logical core 0 is on physical core 60.

 

There are two mechanisms to define thread affinity on Babbage. An easier mechanism is via a run time environment variable KMP_AFFINITY.  Available options are none, compact, scatter, balanced, and explicit.  "compact" means to bind threads as close to each other as possible.  "scatter" means to bind threads as far apart as possible.  "balanced" means to spread to each core first, then set thread numbers using different hardware thread of same core close to each other, it is only available on the MIC cards (not on the host nodes); "explicit" can be used to set threads to bind to explicit set of logical cores.

Another way of specifying thread affinity is via a run time environment variable KMP_PLACE_THREADS. This is also a new setting on MIC only. Advanced settings such as <n>Cx<m>T,<o>O means: <n> cores time <m> threads with <o> of cores as offset.  For example 40Cx3T,1O means to use 40 cores, and 3 threads with HT 2,3,4 (skip HT0 with 1 core offset).

More details on MIC thread affinity with illustrations of various core binding settings can be found here.

MPI process affinity with hybrid MPI/OpenMP can be controlled by a run time environment variable "O_MPI_PIN_DOMAIN" in which you define the size of your domains, and the layout for each domain.  Where a domain is a group of logical cores, so it must be a multiple of 4. 1 MPI rank will be put in 1 such domain. OpenMP threads can then be pinned inside each domain.  More details on MIC process affinity settings can be found here.

Programming Tips for Adding OpenMP

When adding OpenMP to your existing program, whether it is a serial code or an MPI code, a few programming tips should be considered:

  • Choose between fine grain or coarse grain parallelism implementation.
  • Use profiling tools to find hotspots. Add OpenMP and check correctness incrementally.
  • Parallelize outer loop and collapse loops if possible.
  • Minimize shared variables, minimize barriers.
  • Decide whether to overlap MPI communication with thread computation.
    • Simplest and least error-prone way is to use MPI outside parallel region, and allow only master thread to communicate between MPI tasks.
    • Could use MPI inside parallel region with thread-safe MPI.
  • Consider OpenMP TASK.

What to do if a routine does not scale well?

First understand why a perfect speed up is achieved.  Possible reasons could be: 
  • Serial code sections are not parallelized
  • Thread creation and synchronization cause overhead
  • The application is memory bandwidth limited
  • Increasing the number of threads could be lowering cache reuse
  • The application is not load balancing
  • There is not enough work for each thread.
 
Things to consider to improve OpenMP scaling are:
  • Examine code for serial/critical sections, eliminate if possible.
  • Reduce number of OpenMP parallel regions to reduce overhead costs. 
  • Try collapsing loops, loop fusion or loop permutation to give all threads enough work, and to optimize thread cache locality.  Use NOWAIT clause if possible.
  • Pay attention to load imbalance. If needed, try dynamic scheduling or implement own load balance scheme.
  • Experiment with different combinations of MPI tasks and number of threads per task.  Fewer MPI tasks may not saturate inter-node bandwidth.
  • Test different process and thread affinity options.
  • Leave some cores idle on purpose, for memory capacity or bandwidth capacity.

MPI vs. OpenMP Scaling Analysis

It is useful to understand your code by creating the MPI vs. OpenMP scaling plot to find the sweet spot for your hybrid MPI/OpenMP program, such as a sample figure below.  It can be used for the base setup for further tuning and optimizing.

 

In the above figure, each line represents multiple runs using fixed total number of cores = #MPI_tasks x #OpenMP threads/task.  Scaling may depend on the kernel algorithms and problem sizes.  In this test case, 15 MPI tasks with 8 OpenMP threads per task is optimal. 

Nested OpenMP

Sometimes it is beneficial to use nested OpenMP to allow more fine-grained thread parallelism.  Using best achievable process and thread affinity is crucial in getting good performance with nested OpenMP, yet it is not straightforward to do so.  Please refer to the Nested OpenMP page on how to achieve best affinity on Edison, Hopper, and Babbage.  A combination of OpenMP environment variables and run time flags are needed for different compilers on different systems.

Tools for OpenMP

Adding OpenMP to Your Program

On Hopper or Edison, Cray Reveal helps to perform scope analysis, and suggests OpenMP compiler directives.  It is based on CrayPat performance analysis, and utilizes Cray compiler optimization information. Keep in mind that it can only be used under Cray programming environment. The result of the OpenMP enabled codes can then be compiled and run with any compilers.

On Babbage, Intel Advisor tool helps to guide threading design options.

Performance Analysis for OpenMP Programs

Performance tools available on Hopper and Edison are: Cray performance tools, IPM,  Allinea MAP and Perf Reports, and TAUVTune is also available on Edison.  Debugging tools available on Hopper and Edison are: Allinea DDT, Totalview, LGDB, and Valgrind

Performance tools available on Babbage are VTune, Intel Trace Analyzer and Collector, HPC Toolkit, Allinea MAPDebugging tools available on Babbage are Intel Inspector, Allinea DDT, and GDB.

Case Studies

LBM on TACC Stampede

Lattice Boltzman Method (LBM) is a computational fluid dynamics code. This study is done by Carlos Rosale on TACC Stampede, an Intel Xeon Phi KNC system. 

The figure below shows the total run time with incremental steps to add OpenMP.  There are four major functions for this application: Collision, PostCollision, Stream, and PostStream.   In original serial execution, run time in Collision is > 2500 sec (plotted below as 200 sec only for better display), which takes >95% of total run time. 

 

Step 1: Add OpenMP to the hotspot Collision. 60X speedup is achieved for Collision. Step 2: The new bottleneck is now Stream.  Add OpenMP to Stream and others. 89X speedup in Stream is achieved.  Step 3: Now add vectorization, there is an additional 5X speedup in Collision.

Various OpenMP affinity choices are compared: "balanced" provides best performance overall.

 

A couple of lessons learned from this case study are: add OpenMP incrementally; and compare OpenMP affinity.

MFDn on Hopper

MFDn is a nuclear physics code.  This test case is done by H. Metin Aktulga et. al. on Hopper.  The key technique used in this study is to overlap communication and computation. Below is a pseudo code for illustration:

!$OMP PARALLEL
if (my_thread_rank < 1) then
call MPI_xxx(…)
else
do some computation
endif
!$OMP END PARALLEL

It needs at least the OpenMP support level of MPI_THREAD_FUNNELED.  While master or single thread is making MPI calls, other threads are computing.  This is usually very hard to do, due to the need of separating application codes to run before or after halo info is received.  Automatic compiler optimizations will usually not be obtained.

The figure below shows the speedup of various hybrid MPI/OpenMP implementations compared to pure MPI on Hopper using the total number of 12,096 processors.  Hybrid A uses 2016 MPI tasks, with 6 OpenMP threads per MPI task.  Hybrid B is hybrid A plus merging MPI_Reduce and MPI_Scatter into MPI_Reduce_Scatter, and also merging MPI_Gather and MPI_Bcast into MPI_Allgatherv. Hybrid C is Hybrid B plus overlapping row-group MPI communications with computation.  And Hybrid D is Hybrid C plus overlapping (most) column-group communications with computation.

  

NWChem on Babbage

NWChem is a computational chemistry package. The following two test cases are done by Hongzhang Shan et. al. on Babbage.

CCSD(T)

 

CCSD is one of the methods in the Coupled Cluster (CC) family. The triples algorithm in CCSD is the most computationally expensive component.  Due to memory limitation, it can only run with 1 MPI process per MIC card on Babbage.

There are two computational sections: Loop Nests and GetBlock.  OpenMP is added at the outermost loops of the Loop Nests hotspots.  It scales well up to 120 threads.  GetBlock is not parallelized with OpenMP.  The figure below shows the baseline OpenMP results. Total time has perfect scaling from 1 to 16 threads.  Best total time is reached at 120 threads. Hyper-threading hurts performance.  The "balanced" affinity gives best performance. 

 

Optimization is then applied to the GetBlock function.  Techniques applied include: Parallelize sort, loop unrolling; Reorder array indices to match loop indices; Merge adjacent loop indices to increase number of iterations; Align arrays to 64 bytes boundary; Exploit OpenMP loop control directive, remove loop dependency and provide complier hints.  Figure below shows the the optimized OpenMP run time.  Total speedup from base is 2.3x.

 

A few lessons learned from this case study are: add OpenMP at the outmost loop level; explore loop permutation and loop collapse; Use reduction function, and remove loop dependency.

Fock Matrix Construction (FMC)


Fock Matrix Construction (and associated TEXAS integral package) is the core computation operation for computational approaches to solve Schrodinger's equation in quantum computing.

The figure below shows the total time spent in each of top 10 subroutines (which takes 75% of total CPU time) in the TEXAS integral package using top level threading using different number of OpenMP threads.  Total number of MPI ranks is fixed at 60; OMP=N means N threads per MPI rank.  The original code uses a shared global task counter to deal with dynamic load balancing with MPI ranks. OMP=1 has overhead over pure MPI.  OMP=2 has overall best performance in many routines.  

 

Three OpenMP implementations were attempted. OpenMP #1 uses flat MPI up to 60 MPI processes, then uses 2, 3, and 4 threads per MPI rank.  OpenMP #2 and #3 are pure OpenMP.   OpenMP #2 uses module-level parallelism.  and OpenMP #3 uses OpenMP task implementation.  The pseudo code below illustrates the OpenMP tasks.  To avoid two threads updating Fock matrix simultaneously, a local copy is used per thread. Reduction at the end.

 

c$OMP parallel
myfock() = 0
c$OMP master
current_task_id = 0
mytid = omp_get_thread_num()
My_task = global_task_counter(task_block_size)
for ijkl = 2∗ntype to 2 step −1 do
for ij = min(ntype, ijkl − 1) to max(1, ijkl − ntype) step −1 do
kl = ijkl − ij
if (my_task .eq. current_task_id) then
c$OMP task firstprivate(ij,kl) default(shared)
create_task(ij,kl, ...)
c$OMP end task
my_task=global_task_counter(task_block_size)
end if
current_task_id = current_task_id + 1
end for
end for
c$OMP end master
c$OMP taskwait
c$OMP end parallel
Perform Reduction on myfock to Fock matrix

OpenMP task model is flexible and powerful. The task directive defines an explicit task. Threads share work from all tasks in the task pool. Master thread creates tasks. The taskwait directive makes sure all child tasks created for the current task finish. It helps to improve load balance.

The figure below shows the run time comparisons between flat MPI and the three different OpenMP implementations as mentioned above. OpenMP #2 module-level parallelism saturates at 8 threads (critical and reduction related). Then when over 60 threads, hyper-threading helps. OpenMP #3 Task implementation continues to scale over 60 cores, and reaches 1.33x faster (with 180 threads) than pure MPI.

 

The OpenMP Task implementation benefits both MIC and Host.  See the performance result on the host below.

 

There is another way of showing the MPI/OpenMP scaling analysis result. See the illustration below. The horizontal shows the number of MPI tasks used, and the vertical axis is the total number of MPI tasks times number of OpenMP threads per task.  So each row stands for the constant total number of processes.  Sweet spot is either 4 MPI tasks with 60 OpenMP threads per task, or 6 MPI tasks with 40 OpenMP threads per task.  It is 1.64x faster than original flat MPI;  22% faster than 60 MPI tasks with 4 OpenMP threads per task.

 

A few of lessons learned from this case study are: add OpenMP to the most time consuming functions; use OpenMP task; and find sweet scaling spot with hybrid MPI/OpenMP.

Summary

Edison/Babbage can be used to help you to prepare for Cori regarding thread scalability with hybrid MPI/OpenMP implementation.  Please keep in mind that MPI performance across nodes or MIC cards on Babbage is not optimal so it is better to concentrate on optimization on single MIC card.

Case studies showed effectiveness of OpenMP, and lessons learned are:

  • Add OpenMP incrementally. Conquer one hotspot at a time.
  • Experiment with thread affinity choices. Balanced is optimal for most applications. Low hanging fruit.
  • Pay attention to cache locality and load balancing. Adopt loop collapse, loop permutation, etc.
  • Find sweet spot with MPI/OpenMP scaling analysis.
  • Consider OpenMP TASK. Major code rewrite.
  • Consider overlapping communication with computation.
  • Optimizations targeted for one architecture (XE6, XC30, KNC) can help performance for other architectures (Xeon, XC30, KNL).

Related presentation

Explore MPI/OpenMP Scaling on NERSC Systems, Helen He, NERSC OpenMP and Vectorization Training, October 28, 2014,
 

References

OpenMP Resources: https://www.nersc.gov/users/computational-systems/edison/programming/using-openmp/openmp-resources

H. M. Aktulga, C. Yang, E. G. Ng, P. Maris and J. P. Vary, "Improving the Scalability of a symmetric iterative eigensolver for multi-core platforms,” Concurrency and Computation: Practice and Experience 25 (2013).

Carlos Rosale, “Porting to the Intel Xeon Phi TACC paper: Opportunities and Challenges”. Extreme Scaling Workshop 2013 (XSCALE2013), Boulder, CO, 2013.

Hongzhang Shan, Samuel Williams, Wibe de Jong, Leonid Oliker, "Thread-Level Parallelization and Optimization of NWChem for the Intel MIC Architecture", LBNL Technical Report, October 2014, LBNL 6806E.

Jim Jeffers and James Reinders, “Intel Xeon Phi Coprocessor High-Performance Programming”. Published by Elsevier Inc. 2013.

Intel Xeon Phi Coprocessor Developer Zone:http://software.intel.com/mic-developer

Programming and Compiling for Intel MIC Architecture: http://software.intel.com/en-us/articles/programming-and-compiling-for-intel-many-integrated-core-architecture

Interoperability with OpenMP API: http://software.intel.com/sites/products/documentation/hpc/ics/impi/41/win/Reference_Manual/Interoperability_with_OpenMP.htm

Intel Cluster Studio XE 2015: http://software.intel.com/en-us/intel-cluster-studio-xe/