Job Launch Command: aprun
You must use the aprun command to launch jobs on the Edison compute nodes. Use it for serial, MPI, OpenMP, UPC, and hybrid MPI/OpenMP or hybrid MPI/CAF jobs.
You should view the MAN pages for aprun on Edison.
Basic aprun Options
|-n||Number of MPI tasks.|
|-N||(Optional) Number of MPI tasks per Edison Node. Default is 24.|
|-d||(Optional) Depth, or number of threads, per MPI task. Use this very important option in addition to OMP_NUM_THREADS for OpenMP. Values can be 1-24. The default is 1. For OpenMP values of 2-12 are recommended.|
|-S||(Optional) Number of tasks per NUMA node. Values can be 1-12; default 12|
|-sn||(Optional) Number of NUMA nodes to use per Edison node. Values can be 1-2; default 2|
|-ss||(Optional) Demands strict memory containment per NUMA node. The default is the opposite - to allow remote NUMA node memory access.|
|-cc||(Optional) Controls how tasks are bound to cores and NUMA nodes. The default setting on Edison is -cc cpu which restricts each task to run on a specific core.|
|-m||(Optional) Memory required per task. Three ways to use it: -m size to request size bytes of memory; -m sizeh to request huge pages of size size; and -m sizehs to require huge pages of size size. See the large pages section below for more details.|
|-j||(Optional) Use this option only to enable HyperThreading (HT) and use -j 2. The default is no HyperThreading (-j 1).|
aprun -n 384 ./a.out arg1 arg2
aprun -n 384 ./a.out < in_file > out_file
where arg1 arg2 ... are optional command line arguments to your program and in_file and out_file are optional redirected files. These examples request that 384 processing elements be assigned to run your code. Because there is no "-N " option there will be 24 processing elements on each compute node, using 384/24 = 16 nodes. You must request adequate resources for this in your script; i.e., mppwidth must be at least 384:
#PBS -l mppwidth=384
#PBS -l walltime=00:10:00
#PBS -N my_job
#PBS -q debug
aprun -n 384 ./a.out
NERSC recommends that mppwidth should always be a multiple of 24 even if aprun will use fewer resources. (For example, if you need only 40 cores, request 48). There are alternative ways to request resources using other directives, but we've found requesting all 24 cores on the number of nodes desired is the simplest and least error prone way. In other words, multiply the number of nodes * 24 cores per node and set this value to mppwidth. This discussion assumes no HyperThreading.
For a complete description of all options to aprun, type man aprun on Edison.
Memory affinity and placing MPI tasks optimally on an Edison node
Each Edison node consists of two Intel SandyBridge processors, each with eight cores. Each processor is directly connected to one-half of the total Edison node memory. Each processor and its memory is called a NUMA node or NUMA domain. Thus, there are two NUMA nodes per Edison node. See the figure below. Although any core in the Edison node can access all the memory in the node there may be performance and capacity issues associated with the NUMA nodes.
The aprun -S, -sn, -sl, -cc, and -ss options control how your application uses the NUMA nodes. If you are running an MPI-only application and are able to use all 24 cores on the Edison node then you do not need to use these options in general, the default setting for these options on Edison should work well with the most of the MPI codes. These options are important on Edison if you use OpenMP or if you don't fully populate the Edison nodes.
Run 48 MPI tasks with the nodes under-populated by 1/2 (that is, using only 12 cores per Edison node)
#PBS –l mppwidth=96
aprun –n 48 –N 12 –S 6 ./a.out
Say you want to run 48 MPI tasks but each task needs more than the default 2.67 GB of memory per core. The way to do this (see above) is to use twice as many Edison nodes but with only 1/2 of the cores (only 12) in each node active. You will need to request 96 cores (48 MPI tasks / 12 tasks used per Edison node X 24 cores per Edison node). Use the –S 6 option to place the 12 MPI tasks per Edison node on both NUMA nodes to ensure best performance and access to all Edison node memory. We need this option because the default is for aprun to pack the NUMA nodes, meaning all tasks on just one NUMA node.
Note: you are charged for the number of nodes used, not number of cores used, because nodes cannot be used by more than one job simultaneously.
Run 4 MPI tasks, 2 per node, with 12 OpenMP threads each
#PBS –l mppwidth=48
setenv OMP_NUM_THREADS 12
aprun –n 4 –N 2 -S 1 –d 12 –cc numa_node ./a.out #for binaries compiled with Intel compielrs
aprun -n 4 -N 2 -S1 -d 12 ./a.out # for binaries compiled with gnu or cray compilers.
This example will use two Edison nodes with 4 MPI tasks, two per node, and 12 OpenMP threads each. Use the -S 1 option to require that the MPI tasks are distributed with no more than one per NUMA node. Also, make sure to use the –d 12 option to reserve 12 cores for each MPI task and its threads and use the –cc numa_node option to bind each MPI task within the assigned NUMA node. NOTE: In the Intel programming environment, the process/memory aﬃnity works ﬁne for pure MPI jobs, but thread aﬃnity does not work properly just yet. The work around is to also use the env KMP_AFFINITY. KMP_AFFINITY is needed only within the Intel programming environment but you might experiment with various "-cc", "-ss", and "KMP_AFFINITY" options to see how performance is affected.
For 192 cores (8 nodes):
1 MPI per NUMA node with 12 threads each
setenv OMP_NUM_THREADS 12
aprun –n 16 –N 2 -S 1 –d 12 -ss ./a.out
2 MPI per NUMA node with 4 threads each
setenv OMP_NUM_THREADS 6
aprun –n 32 –N 4 -S 2 –d 6 -ss ./a.out
4 MPI per NUMA node with 2 threads each
setenv OMP_NUM_THREADS 2
aprun –n 64 –N 8 -S 4 –d 3 -ss ./a.out
One way to guarantee that your code can use all the Edison node memory is via the
aprun -m size option, where size is given in megabytes and the K, M, and G suffixes can be used (e.g., 16M, 1G).
Example: to use a single core and 21 GB of memory: aprun –m 21G –N1 –n1 ./a.out
Note about the aprun -ss option. This option demands strict memory containment per NUMA node. The default is the opposite - to allow remote NUMA node memory access. Using this option prevents accessing the memory of the remote NUMA node and it should be used for all hybrid MPI/OpeMPI applications.
You can set an environment variable that will display the mapping between your MPI processes and the cores:
Here are two examples showing first how to run 1 MPI task on each NUMA node and then two MPI tasks per NUMA. The cpumask is shown as a binary string of 32 digits, two for each physical core (one for each HyperThreading logical core) per Edison node. A "1" shown in the mask indicates the core on which the MPI task is run.
aprun -n 4 -S 1 ./a.out
[PE_1]: cpumask set to 1 cpu on nid00037, cpumask = 00000000000000000000000100000000
[PE_0]: cpumask set to 1 cpu on nid00037, cpumask = 00000000000000000000000000000001
[PE_2]: cpumask set to 1 cpu on nid00036, cpumask = 00000000000000000000000000000001
[PE_3]: cpumask set to 1 cpu on nid00036, cpumask = 00000000000000000000000100000000
aprun -n 4 -S 2 ./pi
[PE_0]: cpumask set to 1 cpu on nid00037, cpumask = 00000000000000000000000000000001
[PE_1]: cpumask set to 1 cpu on nid00037, cpumask = 00000000000000000000000000000010
[PE_2]: cpumask set to 1 cpu on nid00037, cpumask = 00000000000000000000000100000000
[PE_3]: cpumask set to 1 cpu on nid00037, cpumask = 00000000000000000000001000000000
Using Large Memory Pages
Memory for your application is allocated in increments of pages. The default page size on Cray systems is 4 KB. Some applications may benefit by having larger size pages which can improve memory performance for common access patterns on large data sets. Virtual memory page sizes larger than the default are called "huge pages". See the intro_hugepages online man page for more information about this feature.
This feature is implemented at link and run time by loading the craype-hugepagesSIZE module where SIZE is the desired page size for you application.
These are the available page size modules on Edison: craype-hugepages2M, craype-hugepages4M, craype-hugepages8M, craype-hugepages16M, craype-hugepages32M, craype-hugepages64M, craype-hugepages128M, craype-hugepages256M , and craype-hugepages512M.
You should use trial and error to whether this benefits your application, and, if so, which page size is best.
To use huge pages, do the following.
Load the appropriate hugepages module and link your code.
module load craype-hugepages16M cc -o my_app my_app.c
To run with huge pages, load the cray-hugepages module in your script:
module load craype-hugepages16M
To verify that huge pages are being implemented in your running code you can set the HUGETLB_VERBOSE environment variable:
setenv HUGETLB_VERBOSE 3
Hyperthreading is a technique that shares a single physical processor core amongst several threads of execution. It was developed as a way of mitigating the observation that many programs spend a lot of time just stalled, waiting for something. The most common thing that stalls a program is waiting for data to arrive from memory - in typical science/engineering applications memory stalls amounting to one-third of the total execution time are not uncommon. Hyperthreading is the idea that, instead of having the processor stall and do nothing, the operating system switches the processor to operate on a different thread of execution and then later, switches back to the first one. This context switching happens completely automatically, without user intervention, and the idea is that the context switch itself incurs a very miniscule amount of overhead so that overall processor utilization can be higher. Whether utilization actually is higher depends on the program.
Hyperthreading does not require any changes to source code at all - it is carried out entirely by the operating system. The choice of whether or not to use HyperThreading is made by the user at runtime. On Edison HyperThreading can be enabled to a maximum of two threads per physical processor core. Edison nodes contain 24 physical cores. If your code needs N cores (for N MPI ranks, for example), and you would normally use N/24 nodes, with HyperThreading you can use N/24/2 nodes, with 48 logical cores active per node. You need to test your code to determine if there is a performance improvement.
Two Important Error Messages
- OOM killer terminated this process. This error message results when your application exceeds the available memory on the Edison node, which is 64 GB (about 4 GB/core). "OOM" stands for Out of Memory. Remember that Edison node memory includes memory for your application, for the operating system, and for system libraries such as MPI (including MPI buffer space). If you get this error the solution may be to try running with a smaller problem size or running the same problem over more MPI ranks.
- Claim exceeds reservation's node-count. This error message results when the combination of PBS mppwidth and aprun options (for example, –N, –S, –ss, –sn, –m) requires more nodes than were reserved for your job by the qsub command. In general, "hybrid" applications with width MPI tasks, each of which spawns depth additional threads, require a total of width X depth cores. The number of nodes required is the ceiling(width * depth / 24).
Using the -r cores aprun option enables core specialization, where cores is the number of cores to be used for Linux kernel system services. This might enable your application to more fully utilize the remaining cores in the node assigned to it by restricting all possible overhead processing to the specialized cores within the reservation. The extent to which this will improve application performance is very application dependent.
The aprun command is part of Cray's Application Level Placement Scheduler (ALPS). ALPS provides the interface for compute node job placement and execution, forwards the users environment to the compute nodes, and manages stdin, stdout, and stderr. Several other ALPS utilities are available on Edison, including apstat (provides status information about the Edison compute system and applications running on it) and xtnodestat (which provides a detailed node allocation view but produces an enormous amount of output).