NERSCPowering Scientific Discovery Since 1974

Process and Thread Affinity

Introduction 

In a parallel region, threads are assigned to hardware threads. Thread affinity binds each process or thread to run on a specific subset of processors, to take advantage of memory locality. Improper process/thread affinity could slow down code performance significantly. Some applications run better with fewer threads than hardware threads.

Achieving best process and thread affinity is crucial in getting good performance with nested OpenMP, yet it is not straightforward to do so. A combination of OpenMP environment variables and run time flags are needed for different compilers and different batch schedulers on different systems.

Use num_threads clause in source codes to set threads for nested regions. For most other non-nested regions, use OMP_NUM_THREADS environment variable for simplicity and flexibility. 

Thread Affinity Control in OpenMP 4.0

Using OMP_PLACES Environment Variable

A list of places that threads can be pinned on. It is used for complex layouts of threads. The possible values are:

    • threads: Each place corresponds to a single hardware thread on the target machine.
    • cores: Each place corresponds to a single core (having one or more hardware threads) on the target machine.
    • sockets: Each place corresponds to a single socket (consisting of one or more cores) on the target machine.
    • A list with explicit place values: such as: "{0,1,2,3},{4,5,6,7},{8,9,10,11},{12,13,14,15}”  or “{0:4},{4:4},{8:4},{12:4}” can also be used. It has the following form:
      • { lower-bound : length : stride }: Thus, specifying {0:3:2} is the same as specifying {0,2,4}.
      • Multiple locations can be included in a place.
      • A list within a place creates a mask where a thread can “float”. Thus, a list in a place allows a thread to float on any hardware thread.
      • The list syntax is the same as the place syntax:
        • {0:4:2} = {0,2,4,6}
        • {0:4:2},{1:4:2} = {0,2,4,6},{1,3,5,7}
      • Place lists can be replicated
        • {0,64,128,192}:64  = {0,64,128,192},{1,65,129,193},…
        • {0,64,128,192}:32:2 = {0,64,128,192},{2,66,130,194}....
        • {0:2,64:2,128:2,192:2}:32:2 =  {0,1,64,65,128,129,192,193},{2,3,66,67,130,131,194,195}, …

Using OMP_PROC_BIND Environment Variable

Sets the binding of threads to processors.

  • spread: Bind threads as evenly distributed (spread) as possible.
  • close: Bind threads close to the master thread while still distributing threads for load balancing.
  • master: Bind threads to the same place as the master thread.

Some Useful 'srun' Options

To execute a code on the available compute nodes in Cori and Edison, we need to use the ‘srun’ command. Some srun options and an explanation of how these options affect thread affinity is given below:

  • -n: The number of tasks or MPI ranks to be used
  • -N: Number of nodes to be used for the execution.
  • --cpu_bind=[{quiet,verbose},]type: Binds tasks to CPUs. Some types include:
    • no[ne]: Do not bind tasks to CPUs (default)               
    • rank:   Automatically  bind  by task rank. Task zero is bound to socket (or core or thread) zero, etc. This is not supported unless the entire node is allocated to the job.               
    • map_cpu:<list>: Bind  by  mapping  CPU  IDs  to  tasks as specified where  <list> is  <cpuid1>,<cpuid2>,...<cpuidN>. This is not supported unless the entire node is allocated to the job.
    • sockets: Automatically generate masks binding  tasks  to  sockets. Only the CPUs on the sockets that have been allocated to the job will be used. If the number of tasks differs from the number of allocated sockets this can result in sub-optimal binding.               
    • cores:  Automatically generate masks binding tasks to cores. If the number of tasks differs from the number of allocated cores this can result in sub-optimal binding.               
    • threads: Automatically generate masks binding  tasks  to  threads. If the number of tasks differs from the number of allocated threads this can result in sub-optimal binding.
  • -c or --cpus-per-task=<ncpus>: Request  that n cpus be allocated per process. This may be useful if the job is multithreaded and requires more than one CPU per task for optimal performance. The default is one CPU per process.  If -c is specified without -n, as many tasks will be allocated per node as possible while satisfying the -c restriction. 

OpenMP Thread and Process Affinity on NERSC Computational Systems

Thread and Process Affinity on Cori Phase 1

Compute Nodes

The arrangement of hardware threads cores is as follows:

Screen Shot 2016 07 11 at 11.20.02 AM

 

Using the Environment Variable OMP_PROC_BIND

Setting the environment variable to OMP_PROC_BIND=spread and using 32 OpenMP threads, we get the following binding:  

Process affinity cori 32 s t

Setting the environment variable to OMP_PROC_BIND=close and using 32 OpenMP threads, we get the following binding:  

Process affinity cori 32 c t

Setting the environment variable to OMP_PROC_BIND=spread or OMP_PROC_BIND=close and using 64 OpenMP threads, we get the following binding: 

Process affinity cori 64 correct

Here, LC denotes logical core and TID represents Thread ID.

Example of Thread Affinity using a Sample Code

To see the code, please refer to the file attached to this page. Using the OMP_PLACES=threads and the 'srun' option '--cpus-per-task=16' for 16 OpenMP threads per process, we get the same results for code compiled using Intel and Cray compilers.

Compiling the Code on Edison and Cori

To compile using Intel compiler:

% cc -qopenmp -o affinity affinity.c

To use Cray Compiler, do the following:

% module swap PrgEnv-intel PrgEnv-cray
% cc -o affinity affinity.c
Sample Batch Script
#!/bin/bash -l

#SBATCH -N 1

export OMP_NUM_THREADS=16
export OMP_PLACES=threads
export OMP_PROC_BIND=spread
export MPICH_MAX_THREAD_SAFETY=multiple

srun -n 1 --cpus-per-task=16 ./affinity | sort -k4n -k8n

To execute the following examples, simply change the value assigned to the OMP_PROC_BIND variable in the above batch script and use it for executing the corresponding example.

Output for 16 threads using 1 MPI task, OMP_PLACES=threads, option '--cpus-per-process=16' and OMP_PROC_BIND=spread 
Level 1: rank= 0, thread level 1=   0, on nid00831. (core affinity = 0)
Level 1: rank= 0, thread level 1=   1, on nid00831. (core affinity = 1)
Level 1: rank= 0, thread level 1=   2, on nid00831. (core affinity = 2)
Level 1: rank= 0, thread level 1=   3, on nid00831. (core affinity = 3)
Level 1: rank= 0, thread level 1=   4, on nid00831. (core affinity = 4)
Level 1: rank= 0, thread level 1=   5, on nid00831. (core affinity = 5)
Level 1: rank= 0, thread level 1=   6, on nid00831. (core affinity = 6)
Level 1: rank= 0, thread level 1=   7, on nid00831. (core affinity = 7)
Level 1: rank= 0, thread level 1=   8, on nid00831. (core affinity = 8)
Level 1: rank= 0, thread level 1=   9, on nid00831. (core affinity = 9)
Level 1: rank= 0, thread level 1=  10, on nid00831. (core affinity = 10)
Level 1: rank= 0, thread level 1=  11, on nid00831. (core affinity = 11)
Level 1: rank= 0, thread level 1=  12, on nid00831. (core affinity = 12)
Level 1: rank= 0, thread level 1=  13, on nid00831. (core affinity = 13)
Level 1: rank= 0, thread level 1=  14, on nid00831. (core affinity = 14)
Level 1: rank= 0, thread level 1=  15, on nid00831. (core affinity = 15)

This results in good affinity because we have one thread allocated to each of the first 16 cores, which are on the same socket.

Output for 16 threads using 1 MPI task, OMP_PLACES=threads, option '--cpus-per-task=16' and OMP_PROC_BIND=close
Level 1: rank= 0, thread level 1=   0, on nid00831. (core affinity = 0)
Level 1: rank= 0, thread level 1=   1, on nid00831. (core affinity = 32)
Level 1: rank= 0, thread level 1=   2, on nid00831. (core affinity = 1)
Level 1: rank= 0, thread level 1=   3, on nid00831. (core affinity = 33)
Level 1: rank= 0, thread level 1=   4, on nid00831. (core affinity = 2)
Level 1: rank= 0, thread level 1=   5, on nid00831. (core affinity = 34)
Level 1: rank= 0, thread level 1=   6, on nid00831. (core affinity = 3)
Level 1: rank= 0, thread level 1=   7, on nid00831. (core affinity = 35)
Level 1: rank= 0, thread level 1=   8, on nid00831. (core affinity = 4)
Level 1: rank= 0, thread level 1=   9, on nid00831. (core affinity = 36)
Level 1: rank= 0, thread level 1=  10, on nid00831. (core affinity = 5)
Level 1: rank= 0, thread level 1=  11, on nid00831. (core affinity = 37)
Level 1: rank= 0, thread level 1=  12, on nid00831. (core affinity = 6)
Level 1: rank= 0, thread level 1=  13, on nid00831. (core affinity = 38)
Level 1: rank= 0, thread level 1=  14, on nid00831. (core affinity = 7)
Level 1: rank= 0, thread level 1=  15, on nid00831. (core affinity = 39)

In this case, affinity is not as good as in the previous example because some cores remain idle while others have two threads running on them.

Output for 16 threads using 2 MPI tasks, OMP_PLACES=threads, option '--cpus-per-task=16' and OMP_PROC_BIND=spread
Level 1: rank= 0, thread level 1=   0, on nid00831. (core affinity = 0)
Level 1: rank= 0, thread level 1=   1, on nid00831. (core affinity = 1)
Level 1: rank= 0, thread level 1=   2, on nid00831. (core affinity = 2)
Level 1: rank= 0, thread level 1=   3, on nid00831. (core affinity = 3)
Level 1: rank= 0, thread level 1=   4, on nid00831. (core affinity = 4)
Level 1: rank= 0, thread level 1=   5, on nid00831. (core affinity = 5)
Level 1: rank= 0, thread level 1=   6, on nid00831. (core affinity = 6)
Level 1: rank= 0, thread level 1=   7, on nid00831. (core affinity = 7)
Level 1: rank= 0, thread level 1=   8, on nid00831. (core affinity = 8)
Level 1: rank= 0, thread level 1=   9, on nid00831. (core affinity = 9)
Level 1: rank= 0, thread level 1=  10, on nid00831. (core affinity = 10)
Level 1: rank= 0, thread level 1=  11, on nid00831. (core affinity = 11)
Level 1: rank= 0, thread level 1=  12, on nid00831. (core affinity = 12)
Level 1: rank= 0, thread level 1=  13, on nid00831. (core affinity = 13)
Level 1: rank= 0, thread level 1=  14, on nid00831. (core affinity = 14)
Level 1: rank= 0, thread level 1=  15, on nid00831. (core affinity = 15)
Level 1: rank= 1, thread level 1=   0, on nid00831. (core affinity = 16)
Level 1: rank= 1, thread level 1=   1, on nid00831. (core affinity = 17)
Level 1: rank= 1, thread level 1=   2, on nid00831. (core affinity = 18)
Level 1: rank= 1, thread level 1=   3, on nid00831. (core affinity = 19)
Level 1: rank= 1, thread level 1=   4, on nid00831. (core affinity = 20)
Level 1: rank= 1, thread level 1=   5, on nid00831. (core affinity = 21)
Level 1: rank= 1, thread level 1=   6, on nid00831. (core affinity = 22)
Level 1: rank= 1, thread level 1=   7, on nid00831. (core affinity = 23)
Level 1: rank= 1, thread level 1=   8, on nid00831. (core affinity = 24)
Level 1: rank= 1, thread level 1=   9, on nid00831. (core affinity = 25)
Level 1: rank= 1, thread level 1=  10, on nid00831. (core affinity = 26)
Level 1: rank= 1, thread level 1=  11, on nid00831. (core affinity = 27)
Level 1: rank= 1, thread level 1=  12, on nid00831. (core affinity = 28)
Level 1: rank= 1, thread level 1=  13, on nid00831. (core affinity = 29)
Level 1: rank= 1, thread level 1=  14, on nid00831. (core affinity = 30)
Level 1: rank= 1, thread level 1=  15, on nid00831. (core affinity = 31)

Here, the affinity is good because we have one thread running on each of the cores.

Output for 16 threads using 2 MPI tasks, OMP_PLACES=threads, option '--cpus-per-task=16' and OMP_PROC_BIND=close
Level 1: rank= 0, thread level 1=   0, on nid00831. (core affinity = 0)
Level 1: rank= 0, thread level 1=   1, on nid00831. (core affinity = 32)
Level 1: rank= 0, thread level 1=   2, on nid00831. (core affinity = 1)
Level 1: rank= 0, thread level 1=   3, on nid00831. (core affinity = 33)
Level 1: rank= 0, thread level 1=   4, on nid00831. (core affinity = 2)
Level 1: rank= 0, thread level 1=   5, on nid00831. (core affinity = 34)
Level 1: rank= 0, thread level 1=   6, on nid00831. (core affinity = 3)
Level 1: rank= 0, thread level 1=   7, on nid00831. (core affinity = 35)
Level 1: rank= 0, thread level 1=   8, on nid00831. (core affinity = 4)
Level 1: rank= 0, thread level 1=   9, on nid00831. (core affinity = 36)
Level 1: rank= 0, thread level 1=  10, on nid00831. (core affinity = 5)
Level 1: rank= 0, thread level 1=  11, on nid00831. (core affinity = 37)
Level 1: rank= 0, thread level 1=  12, on nid00831. (core affinity = 6)
Level 1: rank= 0, thread level 1=  13, on nid00831. (core affinity = 38)
Level 1: rank= 0, thread level 1=  14, on nid00831. (core affinity = 7)
Level 1: rank= 0, thread level 1=  15, on nid00831. (core affinity = 39)
Level 1: rank= 1, thread level 1=   0, on nid00831. (core affinity = 16)
Level 1: rank= 1, thread level 1=   1, on nid00831. (core affinity = 48)
Level 1: rank= 1, thread level 1=   2, on nid00831. (core affinity = 17)
Level 1: rank= 1, thread level 1=   3, on nid00831. (core affinity = 49)
Level 1: rank= 1, thread level 1=   4, on nid00831. (core affinity = 18)
Level 1: rank= 1, thread level 1=   5, on nid00831. (core affinity = 50)
Level 1: rank= 1, thread level 1=   6, on nid00831. (core affinity = 19)
Level 1: rank= 1, thread level 1=   7, on nid00831. (core affinity = 51)
Level 1: rank= 1, thread level 1=   8, on nid00831. (core affinity = 20)
Level 1: rank= 1, thread level 1=   9, on nid00831. (core affinity = 52)
Level 1: rank= 1, thread level 1=  10, on nid00831. (core affinity = 21)
Level 1: rank= 1, thread level 1=  11, on nid00831. (core affinity = 53)
Level 1: rank= 1, thread level 1=  12, on nid00831. (core affinity = 22)
Level 1: rank= 1, thread level 1=  13, on nid00831. (core affinity = 54)
Level 1: rank= 1, thread level 1=  14, on nid00831. (core affinity = 23)
Level 1: rank= 1, thread level 1=  15, on nid00831. (core affinity = 55)

In this case, there are two threads running on 50% of the cores and the other cores remain idle. This is not desirable.

Note: For information on process and thread affinity for Nested OpenMP, please visit the page on Nested OpenMP Process and Thread Affinity on Cori Phase 1.
The results on Edison are similar to that of Cori Phase 1, but Edison has 24 physical and 48 logical cores as compared to Cori which has 32 physical and 64 logical cores.

Thread and Process Affinity on KNL White Boxes

Compute Nodes

The login node for the KNL White boxes is carl.nersc.gov. Each of the compute nodes is a KNL with 64 cores. The arrangement of hardware threads is as shown below:

Default Affinity

The following diagrams represent the default affinity of threads to hardware threads.

Default affinity KNL 1

Default affinity KNL 2

 

For some applications:

  •  Putting sequential threads on the same core may be more cache friendly. 

  • Allowing core threads to “float” in the core may be more balance friendly. 

Using the Environment Variable OMP_PROC_BIND

Setting the environment variable to OMP_PROC_BIND=spread and using 64 or 128 OpenMP threads, we get the following bindings: 

Process Affinity KNL spread1

Setting the environment variable to OMP_PROC_BIND=spread and using 192 or 256 OpenMP threads, we get the following bindings:  

 Process affinity KNL spread2

 

Setting the environment variable to OMP_PROC_BIND=close and using 64 or 128 OpenMP threads, we get the following bindings:

Process affinity KNL close 1

Setting the environment variable to OMP_PROC_BIND=close and using 192 or 256 OpenMP threads, we get the following bindings:

Process affinity KNL close 2

 

Using the Environment Variable OMP_PLACES

The following diagram depicts the thread affinity for some values of OMP_PLACES

Screen Shot 2016 07 12 at 5.22.53 PM

 

Here, a 256-bit mask specifies where a thread may execute. 

Also, a list in a place allows a thread to float on any hardware thread.  In this case, the threads float on a core. This can be illustrated by seeing the affinity for 64 and 128 threads with OMP_PLACES={0,64,128,192,256}:64

Process affinity KNL floating1 

Process affinity KNL floating2

 

Using the Environment Variable I_MPI_PIN_DOMAIN

To get good affinity on 68 core KNL nodes, it is recommended to set the I_MPI_PIN_DOMAIN  environment variable. This is necessary to get optimal affinity even for single level of OpenMP mixed with MPI.  It is more relaxed with the 64-core KNL nodes.  I_MPI_PIN_DOMAIN should be set to the total number of threads per MPI process times the number of hardware threads per core, i.e., the number of logical cores per MPI task.

Please see the example below.  The second result is much better with I_MPI_PIN_DOMAIN. The code is available in the file attached to this page.
Affinity when I_MPI_PIN_DOMAIN is not set
% mpirun -n 8 -env OMP_PROC_BIND spread -env OMP_PLACES threads -env OMP_NUM_THREADS 4 ./xthi |sort -k4n,6n
Hello from rank 0, thread 0, on ekm118. (core affinity = 0)
Hello from rank 0, thread 1, on ekm118. (core affinity = 70)
Hello from rank 0, thread 2, on ekm118. (core affinity = 72)
Hello from rank 0, thread 3, on ekm118. (core affinity = 142)
Hello from rank 1, thread 0, on ekm118. (core affinity = 144)
Hello from rank 1, thread 1, on ekm118. (core affinity = 214)
Hello from rank 1, thread 2, on ekm118. (core affinity = 216)
Hello from rank 1, thread 3, on ekm118. (core affinity = 15)
Hello from rank 2, thread 0, on ekm118. (core affinity = 17)
Hello from rank 2, thread 1, on ekm118. (core affinity = 87)
Hello from rank 2, thread 2, on ekm118. (core affinity = 89)
Hello from rank 2, thread 3, on ekm118. (core affinity = 159)
Hello from rank 3, thread 0, on ekm118. (core affinity = 161)
Hello from rank 3, thread 1, on ekm118. (core affinity = 231)
Hello from rank 3, thread 2, on ekm118. (core affinity = 233)
Hello from rank 3, thread 3, on ekm118. (core affinity = 32)
Hello from rank 4, thread 0, on ekm118. (core affinity = 34)
Hello from rank 4, thread 1, on ekm118. (core affinity = 104)
Hello from rank 4, thread 2, on ekm118. (core affinity = 106)
Hello from rank 4, thread 3, on ekm118. (core affinity = 176)
Hello from rank 5, thread 0, on ekm118. (core affinity = 178)
Hello from rank 5, thread 1, on ekm118. (core affinity = 248)
Hello from rank 5, thread 2, on ekm118. (core affinity = 250)
Hello from rank 5, thread 3, on ekm118. (core affinity = 49)
Hello from rank 6, thread 0, on ekm118. (core affinity = 51)
Hello from rank 6, thread 1, on ekm118. (core affinity = 121)
Hello from rank 6, thread 2, on ekm118. (core affinity = 123)
Hello from rank 6, thread 3, on ekm118. (core affinity = 193)
Hello from rank 7, thread 0, on ekm118. (core affinity = 195)
Hello from rank 7, thread 1, on ekm118. (core affinity = 265)
Hello from rank 7, thread 2, on ekm118. (core affinity = 267)
Hello from rank 7, thread 3, on ekm118. (core affinity = 66)

Here, R is the rank and T is the thread ID.

Affinity when I_MPI_PIN_DOMAIN is set
% mpirun -n 8 -env OMP_PROC_BIND spread -env OMP_PLACES threads -env OMP_NUM_THREADS 4 -env I_MPI_PIN_DOMAIN 32 ./xthi |sort -k4n,6n
Hello from rank 0, thread 0, on ekm118. (core affinity = 0)
Hello from rank 0, thread 1, on ekm118. (core affinity = 2)
Hello from rank 0, thread 2, on ekm118. (core affinity = 4)
Hello from rank 0, thread 3, on ekm118. (core affinity = 6)
Hello from rank 1, thread 0, on ekm118. (core affinity = 8)
Hello from rank 1, thread 1, on ekm118. (core affinity = 10)
Hello from rank 1, thread 2, on ekm118. (core affinity = 12)
Hello from rank 1, thread 3, on ekm118. (core affinity = 14)
Hello from rank 2, thread 0, on ekm118. (core affinity = 16)
Hello from rank 2, thread 1, on ekm118. (core affinity = 18)
Hello from rank 2, thread 2, on ekm118. (core affinity = 20)
Hello from rank 2, thread 3, on ekm118. (core affinity = 22)
Hello from rank 3, thread 0, on ekm118. (core affinity = 24)
Hello from rank 3, thread 1, on ekm118. (core affinity = 26)
Hello from rank 3, thread 2, on ekm118. (core affinity = 28)
Hello from rank 3, thread 3, on ekm118. (core affinity = 30)
Hello from rank 4, thread 0, on ekm118. (core affinity = 32)
Hello from rank 4, thread 1, on ekm118. (core affinity = 34)
Hello from rank 4, thread 2, on ekm118. (core affinity = 36)
Hello from rank 4, thread 3, on ekm118. (core affinity = 38)
Hello from rank 5, thread 0, on ekm118. (core affinity = 40)
Hello from rank 5, thread 1, on ekm118. (core affinity = 42)
Hello from rank 5, thread 2, on ekm118. (core affinity = 44)
Hello from rank 5, thread 3, on ekm118. (core affinity = 46)
Hello from rank 6, thread 0, on ekm118. (core affinity = 48)
Hello from rank 6, thread 1, on ekm118. (core affinity = 50)
Hello from rank 6, thread 2, on ekm118. (core affinity = 52)
Hello from rank 6, thread 3, on ekm118. (core affinity = 54)
Hello from rank 7, thread 0, on ekm118. (core affinity = 56)
Hello from rank 7, thread 1, on ekm118. (core affinity = 58)
Hello from rank 7, thread 2, on ekm118. (core affinity = 60)
Hello from rank 7, thread 3, on ekm118. (core affinity = 62)
Here, R refers to rank and T refers to thread ID.
This affinity is much better than the previous one.

Example of Thread Affinity using a Sample Code

To see the code, please refer to the file attached to this page.

The process and thread affinity are similar on both Quadcache and Quadflat partitions, but, in terms of memory, Quadflat has 1 NUMA domain, while Quadcache has 2 NUMA domains.

Compiling the Code
% module purge
% module use /usr/common/software/carl_modulefiles

% module load impi
% module load intel
% module load memkind
% mpiicc -xMIC-AVX512 -qopenmp -o affinity_carl affinity.c
Sample Batch Script
#!/bin/bash -l
 
#SBATCH -p quadcache

export MPICH_MAX_THREAD_SAFETY=multiple
export OMP_NUM_THREADS=16
export OMP_PLACES=threads
export OMP_PROC_BIND=spread
mpirun -np 1 ./affinity_carl | sort -k4n -k8n

To execute the following examples, simply change the value assigned to the OMP_PROC_BIND variable in the above batch script and use it for executing the corresponding example.

Output for 16 threads using 1 MPI task, OMP_PLACES=threads and OMP_PROC_BIND=spread
Level 1: rank= 0, thread level 1=  0, on cc08. (core affinity = 0)
Level 1: rank= 0, thread level 1=  1, on cc08. (core affinity = 4)
Level 1: rank= 0, thread level 1=  2, on cc08. (core affinity = 8)
Level 1: rank= 0, thread level 1=  3, on cc08. (core affinity = 12)
Level 1: rank= 0, thread level 1=  4, on cc08. (core affinity = 16)
Level 1: rank= 0, thread level 1=  5, on cc08. (core affinity = 20)
Level 1: rank= 0, thread level 1=  6, on cc08. (core affinity = 24)
Level 1: rank= 0, thread level 1=  7, on cc08. (core affinity = 28)
Level 1: rank= 0, thread level 1=  8, on cc08. (core affinity = 32)
Level 1: rank= 0, thread level 1=  9, on cc08. (core affinity = 36)
Level 1: rank= 0, thread level 1= 10, on cc08. (core affinity = 40)
Level 1: rank= 0, thread level 1= 11, on cc08. (core affinity = 44)
Level 1: rank= 0, thread level 1= 12, on cc08. (core affinity = 48)
Level 1: rank= 0, thread level 1= 13, on cc08. (core affinity = 52)
Level 1: rank= 0, thread level 1= 14, on cc08. (core affinity = 56)
Level 1: rank= 0, thread level 1= 15, on cc08. (core affinity = 60)

Here, the threads spread out evenly over cores and, though, there are not enough threads to use all the cores, no core has more than one thread running on it. Thus, this affinity is good.

Output for 16 threads using 1 MPI task, OMP_PLACES=threads and OMP_PROC_BIND=close
Level 1: rank= 0, thread level 1=  0, on cc08. (core affinity = 0)
Level 1: rank= 0, thread level 1=  1, on cc08. (core affinity = 64)
Level 1: rank= 0, thread level 1=  2, on cc08. (core affinity = 128)
Level 1: rank= 0, thread level 1=  3, on cc08. (core affinity = 192)
Level 1: rank= 0, thread level 1=  4, on cc08. (core affinity = 1)
Level 1: rank= 0, thread level 1=  5, on cc08. (core affinity = 65)
Level 1: rank= 0, thread level 1=  6, on cc08. (core affinity = 129)
Level 1: rank= 0, thread level 1=  7, on cc08. (core affinity = 193)
Level 1: rank= 0, thread level 1=  8, on cc08. (core affinity = 2)
Level 1: rank= 0, thread level 1=  9, on cc08. (core affinity = 66)
Level 1: rank= 0, thread level 1= 10, on cc08. (core affinity = 130)
Level 1: rank= 0, thread level 1= 11, on cc08. (core affinity = 194)
Level 1: rank= 0, thread level 1= 12, on cc08. (core affinity = 3)
Level 1: rank= 0, thread level 1= 13, on cc08. (core affinity = 67)
Level 1: rank= 0, thread level 1= 14, on cc08. (core affinity = 131)
Level 1: rank= 0, thread level 1= 15, on cc08. (core affinity = 195)

Here, only the first 4 cores are used and each core has 4 threads running on it. This binding is not good.

Output for 16 threads using 2 MPI tasks, OMP_PLACES=threads and OMP_PROC_BIND=spread
Level 1: rank= 0, thread level 1=  0, on cc01. (core affinity = 0)
Level 1: rank= 0, thread level 1=  1, on cc01. (core affinity = 2)
Level 1: rank= 0, thread level 1=  2, on cc01. (core affinity = 4)
Level 1: rank= 0, thread level 1=  3, on cc01. (core affinity = 6)
Level 1: rank= 0, thread level 1=  4, on cc01. (core affinity = 8)
Level 1: rank= 0, thread level 1=  5, on cc01. (core affinity = 10)
Level 1: rank= 0, thread level 1=  6, on cc01. (core affinity = 12)
Level 1: rank= 0, thread level 1=  7, on cc01. (core affinity = 14)
Level 1: rank= 0, thread level 1=  8, on cc01. (core affinity = 16)
Level 1: rank= 0, thread level 1=  9, on cc01. (core affinity = 18)
Level 1: rank= 0, thread level 1= 10, on cc01. (core affinity = 20)
Level 1: rank= 0, thread level 1= 11, on cc01. (core affinity = 22)
Level 1: rank= 0, thread level 1= 12, on cc01. (core affinity = 24)
Level 1: rank= 0, thread level 1= 13, on cc01. (core affinity = 26)
Level 1: rank= 0, thread level 1= 14, on cc01. (core affinity = 28)
Level 1: rank= 0, thread level 1= 15, on cc01. (core affinity = 30)
Level 1: rank= 1, thread level 1=  0, on cc01. (core affinity = 52)
Level 1: rank= 1, thread level 1=  1, on cc01. (core affinity = 54)
Level 1: rank= 1, thread level 1=  2, on cc01. (core affinity = 56)
Level 1: rank= 1, thread level 1=  3, on cc01. (core affinity = 58)
Level 1: rank= 1, thread level 1=  4, on cc01. (core affinity = 60)
Level 1: rank= 1, thread level 1=  5, on cc01. (core affinity = 62)
Level 1: rank= 1, thread level 1=  6, on cc01. (core affinity = 32)
Level 1: rank= 1, thread level 1=  7, on cc01. (core affinity = 34)
Level 1: rank= 1, thread level 1=  8, on cc01. (core affinity = 36)
Level 1: rank= 1, thread level 1=  9, on cc01. (core affinity = 38)
Level 1: rank= 1, thread level 1= 10, on cc01. (core affinity = 40)
Level 1: rank= 1, thread level 1= 11, on cc01. (core affinity = 42)
Level 1: rank= 1, thread level 1= 12, on cc01. (core affinity = 44)
Level 1: rank= 1, thread level 1= 13, on cc01. (core affinity = 46)
Level 1: rank= 1, thread level 1= 14, on cc01. (core affinity = 48)
Level 1: rank= 1, thread level 1= 15, on cc01. (core affinity = 50)

In this case, we have one thread each on 32 cores. This is good binding and we do not have more than one thread on any core.

Output for 16 threads using 2 MPI tasks, OMP_PLACES=threads and OMP_PROC_BIND=close
Level 1: rank= 0, thread level 1=  0, on cc01. (core affinity = 0)
Level 1: rank= 0, thread level 1=  1, on cc01. (core affinity = 64)
Level 1: rank= 0, thread level 1=  2, on cc01. (core affinity = 128)
Level 1: rank= 0, thread level 1=  3, on cc01. (core affinity = 192)
Level 1: rank= 0, thread level 1=  4, on cc01. (core affinity = 1)
Level 1: rank= 0, thread level 1=  5, on cc01. (core affinity = 65)
Level 1: rank= 0, thread level 1=  6, on cc01. (core affinity = 129)
Level 1: rank= 0, thread level 1=  7, on cc01. (core affinity = 193)
Level 1: rank= 0, thread level 1=  8, on cc01. (core affinity = 2)
Level 1: rank= 0, thread level 1=  9, on cc01. (core affinity = 66)
Level 1: rank= 0, thread level 1= 10, on cc01. (core affinity = 130)
Level 1: rank= 0, thread level 1= 11, on cc01. (core affinity = 194)
Level 1: rank= 0, thread level 1= 12, on cc01. (core affinity = 3)
Level 1: rank= 0, thread level 1= 13, on cc01. (core affinity = 67)
Level 1: rank= 0, thread level 1= 14, on cc01. (core affinity = 131)
Level 1: rank= 0, thread level 1= 15, on cc01. (core affinity = 195)
Level 1: rank= 1, thread level 1=  0, on cc01. (core affinity = 52)
Level 1: rank= 1, thread level 1=  1, on cc01. (core affinity = 116)
Level 1: rank= 1, thread level 1=  2, on cc01. (core affinity = 180)
Level 1: rank= 1, thread level 1=  3, on cc01. (core affinity = 244)
Level 1: rank= 1, thread level 1=  4, on cc01. (core affinity = 53)
Level 1: rank= 1, thread level 1=  5, on cc01. (core affinity = 117)
Level 1: rank= 1, thread level 1=  6, on cc01. (core affinity = 181)
Level 1: rank= 1, thread level 1=  7, on cc01. (core affinity = 245)
Level 1: rank= 1, thread level 1=  8, on cc01. (core affinity = 54)
Level 1: rank= 1, thread level 1=  9, on cc01. (core affinity = 118)
Level 1: rank= 1, thread level 1= 10, on cc01. (core affinity = 182)
Level 1: rank= 1, thread level 1= 11, on cc01. (core affinity = 246)
Level 1: rank= 1, thread level 1= 12, on cc01. (core affinity = 55)
Level 1: rank= 1, thread level 1= 13, on cc01. (core affinity = 119)
Level 1: rank= 1, thread level 1= 14, on cc01. (core affinity = 183)
Level 1: rank= 1, thread level 1= 15, on cc01. (core affinity = 247)

In this case, there 4 threads on each of the first 8 cores and the other cores are idle. This is poor affinity.

Note: For information on process and thread affinity for Nested OpenMP, please visit the page on  Nested OpenMP Process and Thread Affinity on KNL White Boxes. 
 
  Text Box: Arrangement of Hardware Threads for 64 Core KNL

Downloads