NERSCPowering Scientific Discovery Since 1974

Job Launch Command: aprun

You must use the aprun command to launch jobs on the Hopper compute nodes.   Use it for serial, MPI, OpenMP, UPC, and hybrid MPI/OpenMP or hybrid MPI/CAF jobs. 

You should view the MAN pages for aprun on Hopper or you can see them here.

Basic aprun Options

OPTIONDESCRIPTION
-n Number of MPI tasks.
-N (Optional) Number of tasks per Hopper Node. Default is 24.
-d (Optional) Depth, or number of threads, per MPI task. Use this very important option in addition to OMP_NUM_THREADS for OpenMP. Values can be 1-24. The default is 1. For OpenMP values of 2-6 are recommended.
-S (Optional) Number of tasks per NUMA node. Values can be 1-6; default 6
-sn (Optional) Number of NUMA nodes to use per Hopper node. Values can be 1-4; default 4
-ss (Optional) Demands strict memory containment per NUMA node. The default is the opposite - to allow remote NUMA node memory access. 
-cc (Optional) Controls how tasks are bound to cores and NUMA nodes. The recommend setting for most codes is -cc cpu which restricts each task to run on a specific core.
-m (Optional) Memory required per task. Three ways to use it: -m size to request size bytes of memory; -m sizeh to request huge pages of size size; and -m sizehs to require huge pages of size size.  See the large pages section below for more details.

Typical usage:

aprun -n 192  ./a.out arg1 arg2

 or

aprun -n 192 ./a.out < in_file > out_file

where arg1 arg2 ... are optional command line arguments to your program and in_file and out_file are optional redirected files. These examples request that 192 instances of your code be run, 24 on each compute node, using 192 / 24 = 8 nodes.  You must request adequate resources for this in your script; i.e., mppwidth must be at least 192

#PBS -l mppwidth=192

NERSC recommends that mppwidthshould always be a multiple of 24 even if aprun will use fewer resources.  There are alternative ways to request resources using other directives, but we've found requesting all 24 cores on the number of nodes desired is the simplest and least error prone way.  In other words, multiply the number of nodes * 24 cores per node and set this value to mppwidth.

For a complete description of all options to aprun, type man aprun on Hopper or view the man page here.

 

Memory affinity and placing MPI tasks optimally on a Hopper node

Each Hopper node consists of two processors and each processor consists of two dies.  Each die is directly connected to one-quarter of the total Hopper node memory.  Each die and its memory is called a NUMA node.  There are four NUMA nodes per Hopper node.  See the figure at the bottom of this page.  Although any core in the Hopper node can access all the memory there are performance and capacity issues associated with the NUMA nodes.  See the figure at the bottom of this page.

The  aprun -S, -sn, and -ss options control how your application uses the NUMA nodes.  If you are running an MPI application and are able to use all 24 cores on the Hopper node then you do not need to use these options.  These options were unimportant for Franklin but they are important on Hopper if you use OpenMP or if you don't fully populate the Hopper nodes.

Examples:

MPI tasks placement for pure MPI codes

Hopper nodes have 24 cores per node.   Many users want to run applications on a power of 2 number of cores, 4, 8, 16, 32, etc.  For these cases, some cores on a Hopper node will be left idle.  By default, aprun will place MPI tasks on cores, in less than optimally when a pure MPI application requests fewer than 24 cores per node.  Here is an example.

#PBS -l mppwidth=24
#PBS -l walltime=00:10:00
#PBS -N my_job
#PBS -q debug
#PBS -V

cd $PBS_O_WORKDIR

aprun -n 4 ./parHelloWorld

This will place all 4 MPI tasks on the same NUMA node, which is not the optimal setup.

In order to place 1 MPI task on each NUMA node, add another flag, -S 1, to the aprun line as shown in the example below.

#PBS -l mppwidth=24
#PBS -l walltime=00:10:00
#PBS -N my_job
#PBS -q debug
#PBS -V

cd $PBS_O_WORKDIR

aprun -n 4 -S 1 ./parHelloWorld

Another example with 64 MPI tasks

#PBS –l mppwidth=72

aprun –n 64 ./a.out


Note that you could use #PBS -l mppwidth=64.However, since 64 is not an even multiple of 24, the batch system will allocate 3 nodes with a total of 72 cores, 48 on two nodes and 16 on the third. 

Run 64 MPI tasks with the nodes under-populated by 1/2 (that is, using only 12 cores per Hopper node).

 

#PBS –l mppwidth=144
aprun –n 64 –N 12 –S 3 ./a.out


You might do this if your code needs more than 1.33GB of memory per core.  The example uses #PBS -l mppwidth=144 because 128 cores are required (64 MPI tasks / 12 tasks used per Hopper node X 24 cores per Hopper node) and the next highest multiple of 24 is 144 .  Use the –S 3 option to place the 12 MPI tasks per Hopper node on cores from all four NUMA nodes to ensure best performance and access to all Hopper node memory.  We need this option because the default is for aprun to pack the NUMA nodes, meaning 12 tasks on just two NUMA nodes.

Note: you are charged for the number of nodes used, not number of cores used, because nodes cannot be used by more than one job simultaneously.

Run 24 MPI tasks with 6 OpenMP threads each.


#PBS –l mppwidth=144
setenv OMP_NUM_THREADS 6
aprun –n 24 –N 4 -S 1 –d 6 –ss ./a.out


This csh example uses the same number of total cores as the first two but uses fewer MPI tasks and instead uses 6 OpenMP threads.  NERSC recommends that you do not use depth > 6, meaning not more than 6 OpenMP threads per MPI task.  Use -N 4 to use 4 MPI tasks.  Use the -S 1 option to require that the MPI tasks are distributed with no more than one per NUMA node.  Also use the –d 6 option to require that one core be reserved for each OpenMP thread and use the ss option to require that all memory allocated by the processes be contained within the NUMA node of those processes.
For 144 total cores:

1 MPI per NUMA node with 6 threads each:

aprun –n 24 –N 4 -S 1 –d 6 -ss ./a.out

2 MPI per NUMA node with 3 threads each:

aprun –n 48 –N 8 -S 2 –d 3 -ss ./a.out

3 MPI per NUMA node with 2 threads each:

aprun –n 72  –N 12 -S 3 –d 2 -ss ./a.out 
  1. To use fewer than 24 cores per node and guarantee that those cores can use all the Hopper node memory.  Use the 
    aprun -m size option, where size is given in megabytes and the K, M, and G suffixes can be used (e.g., 16M, 1G).  Example: to use a single core and 21 GB of memory: aprun m 21G N1 n1 a.out

Note about the aprun -ss option.  This option demands strict memory containment per NUMA node. The default is the opposite - to allow remote NUMA node memory access. Using this option prevents memory access of the remote NUMA node and it should be used for all hybrid MPI/OpeMPI applications but you cannot use this option if you use more than 6 OpenMP threads (which you should not use anyway) or if you underpopulate the nodes with a pure-MPI application to gain access to more memory.

If you find all of this very confusing you can set an environment variable which will display the mapping between your MPI processes and the core.

setenv MPICH_CPUMASK_DISPLAY

Here is an example showing how to run 1 MPI task on each NUMA node.  The cpumask is shown as a binary string of 24 digits, one for each core on the Hopper node.  Where a "1" is shown displays the core on which the MPI task is run.

aprun -n 4 -S 1 ./a.out
[PE_0]: cpumask set to 1 cpu on nid00018, cpumask = 000000000000000000000001
[PE_2]: cpumask set to 1 cpu on nid00018, cpumask = 000000000001000000000000
[PE_3]: cpumask set to 1 cpu on nid00018, cpumask = 000001000000000000000000
[PE_1]: cpumask set to 1 cpu on nid00018, cpumask = 000000000000000001000000

 

Using Large Memory Pages

Memory for your application is allocated in increments of pages.  The default page size on Cray systems is 4 KB.  Some applications may benefit by having larger size pages which can improve memory performance  for common access patterns on large data sets.  Virtual memory page sizes larger than the default are called "huge pages".  See the intro_hugepages online man page for more information about this feature.

This feature is implemented at link and run time by loading the craype-hugepagesSIZE module where SIZE is the desired page size for you application.

These are the available page size modules on Hopper:  craype-hugepages128K, craype-hugepages512K, craype-hugepages2M, craype-hugepages8M, craype-hugepages16M, and craype-hugepages64M.

You should use trial and error to whether this benefits your application, and, if so, which page size is best.

To use huge pages, do the following.

Load the appropriate hugepages module and link your code.

module load craype-hugepages16M
cc -o my_app my_app.c

To run with huge pages, load the cray-hugepages module in your script:

module load craype-hugepages16M

To verify that huge pages are being implemented in your running code you can set the HUGETLB_VERBOSE environment variable:

bash:

export HUGETLB_VERBOSE=3

csh:

setenv HUGETLB_VERBOSE 3

 

Two Common Error Messages

  1. OOM killer terminated this process.  This error message results when your application exceeds the available memory on the Hopper node, which is 32 GB (1.33 GB/core) on most nodes or 64 GB (2.66 GB/core) on a subset of nodes.  "OOM" stands for Out of Memory.  Remember that Hopper node memory includes memory for your application, for the operating system, and for system libraries such as MPI.  The solution may be to try running with a smaller problem size or running the same problem over more MPI ranks.
  2. Claim exceeds reservation's node-count.  This error message results when the combination of PBS mppwidth and aprun options (for example, –N, –S, –ss, –sn, –m)  requires more nodes than were reserved for you by the qsub command.  In general, "hybrid" applications with width MPI tasks, each of which spawns depth additional threads, require a total of width X depth total cores.   The number of nodes required is the ceiling(width * depth / 24).

     

    hopper node schematic

    Hopper 24-core Compute Node

     


    The black lines in the image represent  data paths of varying width.  The links to the memory banks are the fastest (21GB/s).  The links between the NUMA nodes (and to the Gemini Interconnect) are HyperTransport links that allow tasks and threads running on one NUMA node to access memory allocated on the other NUMA nodes. This happens transparently; however, remote-NUMA-node memory references, such as a process running on NUMA node 0 accessing NUMA node 1 memory, can adversely affect performance. The vertical links in this diagram run at about 19 GB/s, the horizontal links run at about 13 GB/s, and the diagonal links run at about 6 GB/s.

     

    More Details

    The aprun command is part of Cray's Application Level Placement Scheduler (ALPS).  ALPS provides the interface for compute node job placement and execution, forwards the users environment to the compute nodes, and manages stdin, stdout, and stderr.  Several other ALPS utilities are available on Hopper, including apstat (provides status information about the Hopper compute system and applications running on it) and xtnodestat (which provides a detailed node allocation view but produces an enormous amount of output).