NERSCPowering Scientific Discovery Since 1974

Using OpenMP with MPI

Overview

Adding OpenMP threading to an MPI code is an efficient way to run on multicore processors and nodes like those on Hopper. Since OpenMP operates in a shared memory space, it is possible to reduce the memory overhead associated with MPI tasks and reduce the need for replicated data across tasks.  A collection of OpenMP resources, tutorials, etc can be found at OpenMP Reources.

Codes  typically use one OpenMP thread per compute core. Therefore the maximum number of threads per node on Hopper is 24. OpenMP performance can be very dependent on the underlying architecture. On Hopper, NERSC has found a "sweet spot" for most codes at 4 MPI tasks per node and 6 OpenMP threads per MPI task.  See more information at OpenMP Performance Implications on Hopper.

Compiling to use OpenMP

OpenMP is supported in all four programming environments available on Cray systems; however, each compiler suite has a different syntax for enabling OpenMP.

Compiler SuiteProgramming Environment module nameCommand line option for OpenMP
Portland Group PrgEnv-pgi -mp
Pathscale PrgEnv-pathscale -mp
Cray Compilers PrgEnv-cray none needed
Intel Compilers (Hopper only) PrgEnv-intel -openmp
GNU Compilers PrgEnv-gnu -fopenmp

Running using OpenMP

You must inform the system how you want to run your job.

  1. How many nodes do you want to use?
  2. How many total MPI tasks do you want?
  3. How many MPI tasks do you want to run on each node?
  4. How many OpenMP threads do you want per MPI task?
  5. How do you want the MPI tasks distributed among the "NUMA nodes" on Hopper?
  6. Do you want your OpenMP threads to have access only to the fastest memory pool (closest to it) on the node?

1. How many nodes?

In your batch script, use the directive:

#PBS -lmppwidth=cores_per_node*number_of_nodes

The value of cores_per_node is 24 for Hopper.

If you are running an interactive parallel job, use the following to start your parallel environment

qsub -I -qinteractive -lmppwidth=cores_per_node*number_of_nodes

2. How many total MPI tasks? (-n option to aprun)

In your batch script, or in your interactive parallel environment, launch your parallel job with "aprun" and use the -n option, followed by the total number of  MPI tasks.

3. How many MPI tasks per compute node? (-N option to aprun)

Use the -N option to aprun to specify the number of MPI tasks to run on each node.

4. How many OpenMP threads per MPI task? (-d option to aprun, OMP_NUM_THREADS variable)

Another aprun option, -d, sets the number of OpenMP threads per MPI task. You must also set the environment variable OMP_NUM_THREADS to this same value. Make sure that this value multiplied by the value for -N does not exceed 24 for Hopper.

5. MPI task distribution on the node (-S option to aprun, optional)

You'll get better performance if you distrubute your MPI tasks among the 4 "NUMA nodes" that constitute each Hopper compute node. You do that with the -S option to aprun, which tells the system how many different MPI tasks to run on a single "NUMA node." Valid values are 1-6. 

6. Memory affinity and performance (-ss option to aprun, optional)

You code may perform better if each OpenMP thread is limited to using the memory closest to it on the node. The -ss option to aprun will restrict each thread to using the memory nearest to its NUMA node. The thread will thus also be limited to use 1/4 the memory on the node.  (the defaul behavior witohut the "-ss" option is to use local memory first if possible).

7. Was the code compiled with the Pathscale Compiler?

You must set run time environment variable "PSC_OMP_AFFINITY" to "FALSE" before the aprun command.

8. Was the code compiled with the Intel Compiler?

You must use one of these 2 arguments to aprun:  "-cc none" or "-cc numa_node".

Putting it all together

Here is a simple batch script for running an MPI/OpenMP code names "hybrid.x" using 4 nodes and the NERSC-recommended 4 MPI tasks per node and 6 OpenMP threads per MPI task. One MPI task is run per NUMA node and memory is limited to that closest to each OpenMP thread. This example uses the bash shell.

#!/bin/bash -l 
#PBS -q regular
#PBS -l mppwidth=96
#PBS -l walltime=1:00:00

cd $PBS_O_WORKDIR

export OMP_NUM_THREADS=6
# Pathscale compiler: export PSC_OMP_AFFINITY=FALSE

aprun -n 16 -d 6 -N 4 -S 1 ./hybrid.x
# Intel compiled program: aprun -n 16 -d 6 -N 4 -cc numa_node ./hybrid.x

Sample Batch Scripts

The following example is based on a code at the OpenMP.org site. The code solves a 2D Poisson equation using a finite difference discretization with a Jacobi iterative method as the linear solver.

Fortran Source Code

Below is the sample batch script to run the same executable as above using 1 compute node, with 4 MPI task per node, and 6 OpenMP threads per MPI task on Hopper:

#PBS -N jacobi
#PBS -q debug
#PBS -l mppwidth=24
#PBS -l walltime=00:10:00
#PBS -e jacobijob.out
#PBS -j eo

cd $PBS_O_WORKDIR

setenv OMP_NUM_THREADS 6
aprun -n 4 -N 4 -S 1 -d 6 ./jacobi_mpiomp
 
The following example shows how to run on 2 Hopper nodes, using total of 8 MPI tasks, and 6 threads per MPI task:
#PBS -N jacobi
#PBS -q debug
#PBS -l mppwidth=48
#PBS -l walltime=00:10:00
#PBS -e jacobijob.out
#PBS -j eo

cd $PBS_O_WORKDIR

setenv OMP_NUM_THREADS 6
aprun -n 8 -N 4 -S 1 -d 6 ./jacobi_mpiomp

Supported Thread Levels

MPI defines four “levels” of thread safety.  The default thread support level for all four programming environments on Hopper (pgi, pathscale, cray and gnu) is MPI_THREAD_SINGLE, where only one thread of execution exists.  The maximum thread support level is returned by the MPI_Init_thread() call in the "provided" argument.

You can set an environment variable MPICH_MAX_THREAD_SAFETY to different values to increase the thread safety. 

envronment variable
MPICH_MAX_THREAD_SAFETY value
Supported Thread Level
not set MPI_THREAD_SINGLE
single MPI_THREAD_SINGLE
funneled MPI_THREAD_FUNNELED
serialized MPI_THREAD_SERIALIZED
multiple MPI_THREAD_MULTIPLE

Nested OpenMP

 Nested OpenMP is supported on Babbage.  Please see more informaiton on example code and thread affinity control settings here.

Further Related Information

OpenMP Performance Implications on Hopper

Hopper Multi-Core FAQs