NERSCPowering Scientific Discovery Since 1974

Job Launch Command: aprun

Overview

You must use the aprun command to launch jobs on the Edison compute nodes.   Use it for serial, MPI, OpenMP, UPC, and hybrid MPI/OpenMP or hybrid MPI/CAF jobs. 

You should view the MAN pages for aprun on Edison.

Basic aprun Options

OPTIONDESCRIPTION
-n Number of MPI tasks.
-N (Optional) Number of MPI tasks per Edison Node. Default is 16.
-d (Optional) Depth, or number of threads, per MPI task. Use this very important option in addition to OMP_NUM_THREADS for OpenMP. Values can be 1-16. The default is 1. For OpenMP values of 2-8 are recommended.
-S (Optional) Number of tasks per NUMA node. Values can be 1-8; default 8
-sn (Optional) Number of NUMA nodes to use per Edison node. Values can be 1-2; default 2
-ss (Optional) Demands strict memory containment per NUMA node. The default is the opposite - to allow remote NUMA node memory access. 
-cc (Optional) Controls how tasks are bound to cores and NUMA nodes. The default setting on Edison is -cc cpu which restricts each task to run on a specific core.
-m (Optional) Memory required per task. Three ways to use it: -m size to request size bytes of memory; -m sizeh to request huge pages of size size; and -m sizehs to require huge pages of size size.  See the large pages section below for more details.
-j (Optional) Use this option only to enable HyperThreading (HT) and use -j 2.  The default is no HyperThreading (-j 1).  Using -j 2 must be done with the Torque directive $PBS -l mppnppn=32 and can be used with aprun option -N 32.  

Typical usage:

aprun -n 256  ./a.out arg1 arg2

 or

aprun -n 256 ./a.out < in_file > out_file

where arg1 arg2 ... are optional command line arguments to your program and in_file and out_file are optional redirected files. These examples request that 256 processing elements be assigned to run your code.  Because there is no "-N " option there will be 16 processing elements on each compute node, using 256/ 16 = 16 nodes.  You must request adequate resources for this in your script; i.e., mppwidth must be at least 256:

#PBS -l mppwidth=256
#PBS -l walltime=00:10:00
#PBS -N my_job
#PBS -q debug
#PBS -V 

cd $PBS_O_WORKDIR

aprun -n 256 ./a.out

NERSC recommends that mppwidth should always be a multiple of 16 even if aprun will use fewer resources.  (For example, if you need only 20 cores, request 32).  There are alternative ways to request resources using other directives, but we've found requesting all 16 cores on the number of nodes desired is the simplest and least error prone way.  In other words, multiply the number of nodes * 16 cores per node and set this value to mppwidth.  This discussion assumes no HyperThreading.  

For a complete description of all options to aprun, type man aprun on Edison.  

Memory affinity and placing MPI tasks optimally on an Edison node

Each Edison node consists of two Intel SandyBridge processors, each with eight cores. Each processor is directly connected to one-half of the total Edison node memory.  Each processor and its memory is called a NUMA node or NUMA domain.  Thus, there are two NUMA nodes per Edison node.  See the figure below.  Although any core in the Edison node can access all the memory in the node there may be performance and capacity issues associated with the NUMA nodes.  

The  aprun -S, -sn, -sl, -cc, and -ss options control how your application uses the NUMA nodes.  If you are running an MPI-only application and are able to use all 16 cores on the Edison node then you do not need to use these options in general, the default setting for these options on Edison should work well with the most of the MPI codes.  These options are important on Edison if you use OpenMP or if you don't fully populate the Edison nodes.


 

Run 64 MPI tasks with the nodes under-populated by 1/2 (that is, using only 8 cores per Edison node)

#PBS –l mppwidth=128
aprun –n 64 –N 8 –S 4 ./a.out

Say you want to run 64 MPI tasks but each task needs more than the default 4 GB of memory per core.  The way to do this (see above) is to use twice as many Edison nodes but with only 1/2 of the cores (only 8) in each node active.  You will need to request 128 cores (64 MPI tasks / 8 tasks used per Edison node X 16 cores per Edison node).  Use the –S 4 option to place the 8 MPI tasks per Edison node on both NUMA nodes to ensure best performance and access to all Edison node memory.  We need this option because the default is for aprun to pack the NUMA nodes, meaning all tasks on just one NUMA node.

Note: you are charged for the number of nodes used, not number of cores used, because nodes cannot be used by more than one job simultaneously.

Run 4 MPI tasks, 2 per node, with 8 OpenMP threads each

#PBS –l mppwidth=32
setenv OMP_NUM_THREADS 8
setenv KMP_AFFINITY compact
aprun –n 4 –N 2 -S 1 –d 8 –cc numa_node ./a.out


This example will use two Edison nodes with 4 MPI tasks, two per node, and 8 OpenMP threads each.  Use the -S 1 option to require that the MPI tasks are distributed with no more than one per NUMA node.  Also, make sure to use the –d 8 option to reserve 8 cores  for each MPI task and its threads and use the –cc numa_node option to bind each MPI task within the assigned NUMA node.  NOTE: In the Intel programming environment, the process/memory affinity works fine for pure MPI jobs, but thread affinity does not work properly just yet. The work around is to also use the env KMP_AFFINITY.  KMP_AFFINITY is needed only within the Intel programming environment but you might experiment with various "-cc", "-ss", and "KMP_AFFINITY" options to see how performance is affected.


For 128 cores (8 nodes):

1 MPI per NUMA node with 8 threads each

setenv OMP_NUM_THREADS 8
aprun –n 16 –N 2 -S 1 –d 8 -ss ./a.out

2 MPI per NUMA node with 4 threads each

setenv OMP_NUM_THREADS 4
aprun –n 32 –N 4 -S 2 –d 4 -ss ./a.out

4 MPI per NUMA node with 2 threads each

setenv OMP_NUM_THREADS 2
aprun –n 64 –N 8 -S 4 –d 2 -ss ./a.out

One way to guarantee that your code can use all the Edison node memory is via the 
aprun -m size option, where size is given in megabytes and the K, M, and G suffixes can be used (e.g., 16M, 1G). 
Example: to use a single core and 21 GB of memory: aprun m 21G N1 n1 ./a.out

Note about the aprun -ss option.  This option demands strict memory containment per NUMA node. The default is the opposite - to allow remote NUMA node memory access. Using this option prevents accessing the memory of the remote NUMA node and it should be used for all hybrid MPI/OpeMPI applications.

You can set an environment variable that will display the mapping between your MPI processes and the cores:

setenv MPICH_CPUMASK_DISPLAY

Here are two examples showing first how to run 1 MPI task on each NUMA node and then two MPI tasks per NUMA.  The cpumask is shown as a binary string of 32 digits, two for each physical core (one for each HyperThreading logical core) per Edison node.  A "1" shown in the mask indicates the core on which the MPI task is run.

aprun -n 4 -S 1 ./a.out
[PE_1]: cpumask set to 1 cpu on nid00037, cpumask = 00000000000000000000000100000000
[PE_0]: cpumask set to 1 cpu on nid00037, cpumask = 00000000000000000000000000000001
[PE_2]: cpumask set to 1 cpu on nid00036, cpumask = 00000000000000000000000000000001
[PE_3]: cpumask set to 1 cpu on nid00036, cpumask = 00000000000000000000000100000000

aprun -n 4 -S 2 ./pi
[PE_0]: cpumask set to 1 cpu on nid00037, cpumask = 00000000000000000000000000000001
[PE_1]: cpumask set to 1 cpu on nid00037, cpumask = 00000000000000000000000000000010
[PE_2]: cpumask set to 1 cpu on nid00037, cpumask = 00000000000000000000000100000000
[PE_3]: cpumask set to 1 cpu on nid00037, cpumask = 00000000000000000000001000000000

Using Large Memory Pages

This section is about a possible way to improve performance by using larger pages.  Memory for your application is allocated in increments of pages.  The default page size is 4 KB.  Some applications may run a little faster using large (or huge) page sizes, especially the largest size of 2 MB. You have to use trial and error to determine if this is of benefit to your application.
To use huge pages, three steps are needed:

  1. Link your code with the hugetlbfs library:
    cc -o my_app my_app.c –lhugetlbfs
  2. Set the huge pages environment variable in your run script:

    For sh, bash:
    export HUGETLB_MORECORE=yes
    For csh:
    setenv HUGETLB_MORECORE yes        
  3. Use the -m sizeh or -m sizehs option with aprun.  The following example requests 700 MB of huge pages per instance of a.out.  If the request cannot be satisfied, the application gets as many huge pages as possible and 4KB pages after that.
    aprun –n 16 –N 2 –m 700h ./a.out
    The following example requires 700 MB of huge pages per instance of a.out.  If the request cannot be satisfied, the application will fail.
    aprun –n 16 –N 2 –m 700hs ./a.out

Hyperthreading

Hyperthreading is a technique that shares a single physical processor core amongst several threads of execution.  It was developed as a way of mitigating the observation that many programs spend a lot of time just stalled, waiting for something.  The most common thing that stalls a program is waiting for data to arrive from memory - in typical science/engineering applications memory stalls amounting to one-third of the total execution time are not uncommon.  Hyperthreading is the idea that, instead of having the processor stall and do nothing, the operating system switches the processor to operate on a different thread of execution and then later, switches back to the first one.  This context switching happens completely automatically, without user intervention, and the idea is that the context switch itself incurs a very miniscule amount of overhead so that overall processor utilization can be higher.   Whether utilization actually is higher depends on the program.

Hyperthreading does not require any changes to source code at all - it is carried out entirely by the operating system.  The choice of whether or not to use HyperThreading is made by the user at runtime.  On Edison HyperThreading can be enabled to a maximum of two threads per physical processor core.  Edison nodes contain 16 physical cores.  If your code needs N cores (for N MPI ranks, for example), and you would normally use N/16 nodes, with HyperThreading you can use N/16/2 nodes, with 32 cores active per node.  You need to test your code to determine if there is a performance improvement. 

To use HyperThreading on Edison add the -j 2 option to the aprun command.  There are other changes required to the batch submission script: #PBS -l mppnppn must be set to 32.

Two Important Error Messages

  1. OOM killer terminated this process.  This error message results when your application exceeds the available memory on the Edison node, which is 64 GB (about 4 GB/core).  "OOM" stands for Out of Memory.  Remember that Edison node memory includes memory for your application, for the operating system, and for system libraries such as MPI (including MPI buffer space).  If you get this error the solution may be to try running with a smaller problem size or running the same problem over more MPI ranks.
  2. Claim exceeds reservation's node-count.  This error message results when the combination of PBS mppwidth and aprun options (for example, –N, –S, –ss, –sn, –m)  requires more nodes than were reserved for your job by the qsub command.  In general, "hybrid" applications with width MPI tasks, each of which spawns depth additional threads, require a total of width X depth cores.   The number of nodes required is the ceiling(width * depth / 16).

Additional Details

The aprun command is part of Cray's Application Level Placement Scheduler (ALPS).  ALPS provides the interface for compute node job placement and execution, forwards the users environment to the compute nodes, and manages stdin, stdout, and stderr.  Several other ALPS utilities are available on Edison, including apstat (provides status information about the Edison compute system and applications running on it) and xtnodestat (which provides a detailed node allocation view but produces an enormous amount of output).