NERSCPowering Scientific Discovery Since 1974

Changes for Running on Haswell Nodes

October 31, 2016

Introduction

Process affinity (or CPU pinning) means to bind MPI process to a CPU or a range of CPUs on the node.  It is important to spread MPI ranks evenly onto different NUMA nodes. Thread affinity forces each process or thread to run on a specific subset of processors, to take advantage of local process state. Correct process and thread affinity is the basis for getting optimal performance

(Note: if you are moving from Edison to Cori, please check the page Cori Phase 1 for Edison users)

Affinity haswell layoutAfter the merge of Cori Phase 1 Haswell and Phase 2 KNL cabinets, running jobs on Haswell for getting the process and thread affinity has changed.  This change came from the use of a different task affinity configuration in SLURM.  This is required for KNL jobs in order to achieve optimal job launch scalability and use of high bandwidth memory (HBM) with memory binding. 

Summary of Changes After Cori Merge

Change in Usage of "-c" Flag

  • Use of the "-c" flag for srun is recommended for most jobs. Set the value as "number of of logical CPUs (hyperthreads) to reserve per MPI task" for MPI and hybrid MPI/OpenMP jobs, not "number of OpenMP threads" as was used previously. Your job does not need to use every hyperthread, in fact by reserving CPUs (hyperthreads) with -c you can prevent OpenMP threads from sharing the same physical core
  • The "-c" flag is optional for fully packed pure MPI jobs. (default behavior is to distribute MPI processes evenly over all physical cores)
  • The "-c" value is independent from the number of OpenMP threads the job uses. (-c is how many hardware resources to reserve, not how many OpenMP threads to create)
  • On Haswell, there are a total of 32 physical cores, each with 2 hyperthreads, yielding 64 logical CPUs. So, the value of "-c" should be set to 64/#MPI_per_node.
    • For example, to use 16 MPI tasks per node, the "-c" value should be set to 64/16=4.
    • If the#MPI_per_node is not a divisor of 64, the "-c" value should be set to floor (32/#MPI_per_node)*2. For example, to run with 12 MPI tasks per node, the "-c" value should be set to floor (32/12)*2 = 4.

Process Binding for Configurations Not Fully Packing the Node  

  • If #MPI tasks per node is not a divisor of 64 on Haswell (meaning the node is not fully packed), you need to add an srun flag "--cpu_bind=cores". Add "--cpu_bind=threads" instead if #MPI_per_node > 32. 
  • Otherwise processes and threads bind to all cores instead of getting an optimal binding.

Explicit Request for Haswell Nodes Required

  • It is now required to explicitly request Haswell nodes via "#SBATCH -C haswell" in a batch script or "salloc -C haswell ..." for an interactive batch job.

We also recommend using the following environment variable for thread binding for OpenMP codes:

  • Use OMP_PROC_BIND (set to "spread" for for most jobs) and OMP_PLACES (set to "threads" for most jobs) for fine tuning thread affinity.

Batch Scripts Examples

Below is an example of a batch script and an interactive batch session with the original way and new way to run a hybrid MPI/OpenMP job using 2 nodes, 8 MPI tasks total, and 2 OpenMP threads per MPI task. On Haswell, it runs with 4 MPI tasks per node, and 2 MPI tasks per socket.

Original Way (before Cori merge)

#SBATCH -N 2 
#SBATCH -p debug
#SBATCH -t 30:00

export OMP_NUM_THREADS=2
% srun -n 8 -c 2 check-hybrid.intel.cori # -c value in New Way should be set differently
% salloc -N 2 -p debug -t 30:00
% export OMP_NUM_THREADS=2
% srun -n 8 -c 2 check-hybrid.intel.cori | sort -k4n,6n # -c value in New Way is set differently
Hello from rank 0, thread 0, on nid00102. (core affinity = 0,1,32,33)
Hello from rank 0, thread 1, on nid00102. (core affinity = 0,1,32,33)
Hello from rank 1, thread 0, on nid00102. (core affinity = 16,17,48,49)
Hello from rank 1, thread 1, on nid00102. (core affinity = 16,17,48,49)
Hello from rank 2, thread 0, on nid00102. (core affinity = 2,3,34,35)
Hello from rank 2, thread 1, on nid00102. (core affinity = 2,3,34,35)
Hello from rank 3, thread 0, on nid00102. (core affinity = 18,19,50,51)
Hello from rank 3, thread 1, on nid00102. (core affinity = 18,19,50,51)
Hello from rank 4, thread 0, on nid00114. (core affinity = 0,1,32,33)
Hello from rank 4, thread 1, on nid00114. (core affinity = 0,1,32,33)
...
Hello from rank 7, thread 0, on nid00114. (core affinity = 18,19,50,51)
Hello from rank 7, thread 1, on nid00114. (core affinity = 18,19,50,51)
#SBATCH -N 2 
#SBATCH -p debug
#SBATCH -t 30:00
asdas
export OMP_NUM_THREADS=2
% srun -n 12 -c 2 check-hybrid.intel.cori # -c value in New Way should be set differently

New Way (After Cori Merge)

#SBATCH -N 2 
#SBATCH -p debug
#SBATCH -t 30:00
#SBATCH -C haswell # this is new

% export OMP_NUM_THREADS=2
% export OMP_PROC_BIND=spread   # new recommendations for hybrid MPI/OpenMP
% export OMP_PLACES=threads
% srun -n 12 -c 10 --cpu_bind=cores check-hybrid.intel.cori # -c is set to 64/#MPI_per_node
% salloc -N 2 -p debug -t 30:00 -C haswell # -C haswell is new
% export OMP_NUM_THREADS=2
# -c is set to 64/#MPI_per_node: each MPI task has 16 logical cores (8 physical cores)
% srun -n 8 -c 16 check-hybrid.intel.cori |sort -k4n,6n
Hello from rank 0, thread 0, on nid00021. (core affinity = 0-7,32-39)
Hello from rank 0, thread 1, on nid00021. (core affinity = 0-7,32-39)
Hello from rank 1, thread 0, on nid00021. (core affinity = 16-23,48-55)
Hello from rank 1, thread 1, on nid00021. (core affinity = 16-23,48-55)
Hello from rank 2, thread 0, on nid00021. (core affinity = 8-15,40-47)
Hello from rank 2, thread 1, on nid00021. (core affinity = 8-15,40-47)
...
Hello from rank 7, thread 0, on nid00022. (core affinity = 24-31,56-63)
Hello from rank 7, thread 1, on nid00022. (core affinity = 24-31,56-63)

% export OMP_PROC_BIND=spread # new recommendations for hybrid MPI/OpenMP
% export OMP_PLACES=threads
% srun -n 8 -c 16 check-hybrid.intel.cori |sort -k4n,6n
Hello from rank 0, thread 0, on nid00021. (core affinity = 0)
Hello from rank 0, thread 1, on nid00021. (core affinity = 4)
Hello from rank 1, thread 0, on nid00021. (core affinity = 16)
Hello from rank 1, thread 1, on nid00021. (core affinity = 20)
Hello from rank 2, thread 0, on nid00021. (core affinity = 8)
Hello from rank 2, thread 1, on nid00021. (core affinity = 12)
...
Hello from rank 5, thread 0, on nid00022. (core affinity = 16)
Hello from rank 5, thread 1, on nid00022. (core affinity = 20)
Hello from rank 6, thread 0, on nid00022. (core affinity = 8)
Hello from rank 6, thread 1, on nid00022. (core affinity = 12)
Hello from rank 7, thread 0, on nid00022. (core affinity = 24)
Hello from rank 7, thread 1, on nid00022. (core affinity = 28) 
#SBATCH -N 2 
#SBATCH -p debug
#SBATCH -t 30:00
#SBATCH -C haswell # this is new

% export OMP_NUM_THREADS=2
% export OMP_PROC_BIND=spread  # new recommendations for hybrid MPI/OpenMP
% export OMP_PLACES=threads
% srun -n 12 -c 10 --cpu_bind=cores check-hybrid.intel.cori # -c is set to  floor (32/#MPI_per_node)*2; --cpu_bind=cores is needed when #MPI_per_node is not a divisor of 64.

Please see the updated Example Batch Jobs page for more details and example usages.

Job Script Generator

An interactive Job Script Generator is available at MyNERSC to provide some guidance on getting optimal process and thread binding on Edison, Cori Haswell, and Cori KNL.

Methods to Check Process and Thread Affinity

Pre-built binaries from a small test code (xthi.c) with pure MPI or hybrid MPI/OpenMP can be used to check affinity.   Binaries are in users default path, and named as such: check-mpi.<compiler>.<machine> (pure MPI), or check-hybrid.<compiler>.<machine> (hybrid MPI/OpenMP), for example: check-mpi.intel.cori, check-hybrid.intel.cori, check-mpi.gnu.cori, check-hybrid.gnu.cori, etc.   Run one of the small test binaries using the same choices of number of nodes, MPI tasks, and OpenMP threads as what your application will use, and check if the desired binding is obtained. The Cori binaries can be used to check for both Haswell or KNL, since binaries are compatible.

Alternatively an srun flag "--cpu_bind=verbose" can be added to report process and thread binding.  

Or you can set the following run time environment to obtain affinity information as part of the job stdout:

export KMP_AFFINITY=verbose   #(for Intel compiler)
export CRAY_OMP_CHECK_AFFINITY=TRUE   #(for CCE compiler)