aprun [-a arch ] [-b ] [-B] [-cc cpu_list | keyword ] [-cp cpu_placement_file_name ] [-d depth ]
[-D value ] [-L node_list ] [-m size[h|hs] ] [-n pes ] [-N pes_per_node ][-F access mode ] [-q ]
[-r cores][-S pes_per_numa_node ] [-sl list_of_numa_nodes ] [-sn numa_nodes_per_node ] [-ss ]
[-T ] [-t sec ]
executable [ arguments_for_executable ]
IMPLEMENTATION
Cray Linux Environment (CLE)
DESCRIPTION
To run an application on CNL compute nodes, use the Application Level
Placement Scheduler (ALPS) aprun command. The aprun command specifies
application resource requirements, requests application placement, and
initiates application launch.
The aprun utility provides user identity and environment information
as part of the application launch so that your login node session can
be replicated for the application on the assigned set of compute
nodes. This information includes the aprun current working directory,
which must be accessible from the compute nodes.
Before running aprun, ensure that your working directory has a file
system accessible from the compute nodes. This will likely be a
Lustre-mounted directory, such as /lus/nid00007/user1/.
Do not suspend aprun; it is the local representative of the
application that is running on compute nodes. If aprun is suspended,
the application cannot communicate with ALPS, such as sending exit
notification to aprun that the application has completed.
Cray XT5 and Cray X6 compute node cores are paired to memory in NUMA
(Non-Uniform Memory Access) nodes. Local NUMA node access is defined
as memory accesses within the same NUMA node while remote NUMA node
access is defined as memory accesses between separate NUMA nodes in a
Cray compute node. Remote NUMA node accesses will have more latency as
a result of this configuration. Cray XT5 compute nodes have two NUMA
nodes while Cray X6 compute nodes have four NUMA nodes.
MPMD Mode
You can use aprun to launch applications in Multiple Program, Multiple
Data (MPMD) mode. The command format is:
aprun -n pes [other_aprun_options] executable1 [arguments_for_executable1] :
-n pes [other_aprun_options] executable2 [arguments_for_executable2] :
-n pes [other_aprun_options] executable3 [arguments_for_executable3] :
...
such as:
% aprun -n 12 ./app1 : -n 8 -d 2 ./app2 : -n 32 -N 2 ./app3
respectively). For 12-core Cray X6 processors, NUMA nodes 0 through 3
have six processors each (logical CPUs 0-5, 6-11, 12-17, and 18-23,
respectively).
For quad-core Cray XT5 processors, NUMA node 0 has four processors
(logical CPUs 0-3), and NUMA node 1 has four processors (logical CPUs
4-7). For six-core Cray XT5 processors, NUMA node 0 has six processors
(logical CPUs 0-5), and NUMA node 1 has six processors (logical CPUs
6-11).
Two types of operations — remote-NUMA-node memory accesses and process
migration — can reduce performance. The aprun command provides memory
affinity and CPU affinity options that allow you to control these
operations. For more information, see the Memory Affinity and CPU
Affinity NOTES sections.
Note: Having a compute node reserved for your job does not
guarantee that you can use all NUMA nodes. You have to request
sufficient resources through qsub -l resource options and aprun
placement options (-n, -N, -d, and/or -m) to be able to use all
NUMA nodes. See the aprun option descriptions and the EXAMPLES
section for more information.
Cray XT4 Compute Nodes
Several aprun options apply to Cray XT4 compute nodes. Each Cray XT4
compute node consists of a quad-core processor (logical CPUs 0-3) and
memory. Process migration can reduce performance; you can avoid
process migration by using the aprun CPU affinity options to bind
processes to CPUs. For more information, see the CPU Affinity NOTES
section.
aprun Options
The aprun command accepts the following options:
-b Bypasses the transfer of the application executable to
compute nodes. By default, the executable is transferred to
the compute nodes as part of the aprun process of launching
an application. You would likely use the -b option only if
the executable to be launched was part of a file system
accessible from the compute node. For more information, see
the EXAMPLES section.
-B Tells ALPS to reuse the width, depth, nppn and memory
requests specified with the corresponding batch reservation.
This option obviates the need to specify aprun options -n,
-d, -N, and -m and aprun will exit with an error if the user
specifies these with the -B option.
[dhcp71-19:~] hjw% cat ap.html
aprun [-a arch ] [-b ] [-B] [-cc cpu_list | keyword ] [-cp cpu_placement_file_name ] [-d depth ]
[-D value ] [-L node_list ] [-m size[h|hs] ] [-n pes ] [-N pes_per_node ][-F access mode ] [-q ]
[-r cores][-S pes_per_numa_node ] [-sl list_of_numa_nodes ] [-sn numa_nodes_per_node ] [-ss ]
[-T ] [-t sec ]
executable [ arguments_for_executable ]
IMPLEMENTATION
Cray Linux Environment (CLE)
DESCRIPTION
To run an application on CNL compute nodes, use the Application Level
Placement Scheduler (ALPS) aprun command. The aprun command specifies
application resource requirements, requests application placement, and
initiates application launch.
The aprun utility provides user identity and environment information
as part of the application launch so that your login node session can
be replicated for the application on the assigned set of compute
nodes. This information includes the aprun current working directory,
which must be accessible from the compute nodes.
Before running aprun, ensure that your working directory has a file
system accessible from the compute nodes. This will likely be a
Lustre-mounted directory, such as /lus/nid00007/user1/.
Do not suspend aprun; it is the local representative of the
application that is running on compute nodes. If aprun is suspended,
the application cannot communicate with ALPS, such as sending exit
notification to aprun that the application has completed.
Cray XT5 and Cray X6 compute node cores are paired to memory in NUMA
(Non-Uniform Memory Access) nodes. Local NUMA node access is defined
as memory accesses within the same NUMA node while remote NUMA node
access is defined as memory accesses between separate NUMA nodes in a
Cray compute node. Remote NUMA node accesses will have more latency as
a result of this configuration. Cray XT5 compute nodes have two NUMA
nodes while Cray X6 compute nodes have four NUMA nodes.
MPMD Mode
You can use aprun to launch applications in Multiple Program, Multiple
Data (MPMD) mode. The command format is:
aprun -n pes [other_aprun_options] executable1 [arguments_for_executable1] :
-n pes [other_aprun_options] executable2 [arguments_for_executable2] :
-n pes [other_aprun_options] executable3 [arguments_for_executable3] :
...
such as:
% aprun -n 12 ./app1 : -n 8 -d 2 ./app2 : -n 32 -N 2 ./app3
respectively). For 12-core Cray X6 processors, NUMA nodes 0 through 3
have six processors each (logical CPUs 0-5, 6-11, 12-17, and 18-23,
respectively).
For quad-core Cray XT5 processors, NUMA node 0 has four processors
(logical CPUs 0-3), and NUMA node 1 has four processors (logical CPUs
4-7). For six-core Cray XT5 processors, NUMA node 0 has six processors
(logical CPUs 0-5), and NUMA node 1 has six processors (logical CPUs
6-11).
Two types of operations — remote-NUMA-node memory accesses and process
migration — can reduce performance. The aprun command provides memory
affinity and CPU affinity options that allow you to control these
operations. For more information, see the Memory Affinity and CPU
Affinity NOTES sections.
Note: Having a compute node reserved for your job does not
guarantee that you can use all NUMA nodes. You have to request
sufficient resources through qsub -l resource options and aprun
placement options (-n, -N, -d, and/or -m) to be able to use all
NUMA nodes. See the aprun option descriptions and the EXAMPLES
section for more information.
Cray XT4 Compute Nodes
Several aprun options apply to Cray XT4 compute nodes. Each Cray XT4
compute node consists of a quad-core processor (logical CPUs 0-3) and
memory. Process migration can reduce performance; you can avoid
process migration by using the aprun CPU affinity options to bind
processes to CPUs. For more information, see the CPU Affinity NOTES
section.
aprun Options
The aprun command accepts the following options:
-b Bypasses the transfer of the application executable to
compute nodes. By default, the executable is transferred to
the compute nodes as part of the aprun process of launching
an application. You would likely use the -b option only if
the executable to be launched was part of a file system
accessible from the compute node. For more information, see
the EXAMPLES section.
-B Tells ALPS to reuse the width, depth, nppn and memory
requests specified with the corresponding batch reservation.
This option obviates the need to specify aprun options -n,
-d, -N, and -m and aprun will exit with an error if the user
specifies these with the -B option.
-cc cpu_list | keyword
Binds processing elements (PEs) to CPUs. CNL does not
migrate processes that are bound to a CPU. This option
applies to all multicore compute nodes. The cpu_list is not
application-created process at that location in the fork
sequence should not be bound to a CPU.
Out-of-range cpu_list values are ignored unless all CPU
values are out of range, in which case an error message is
issued. For example, if you want to bind PEs starting with
the highest CPU on a compute node and work down from there,
you might use this -cc option:
% aprun -n 8 -cc 7-0 ./a.out
If the PEs were placed on Cray XT5 8-core compute nodes, the
specified -cc range would be valid. However, if the PEs were
placed on Cray XT4 compute nodes, CPUs 7-4 would be out of
range and therefore not used. See Example 4: Binding PEs to
CPUs (-cc cpu_list options).
The following keyword values can be used:
· The cpu keyword (the default) binds each PE to a CPU
within the assigned NUMA node. You do not have to
indicate a specific CPU.
If you specify a depth per PE (aprun -d depth), the PEs
are constrained to CPUs with a distance of depth between
them so each PE's threads can be constrained to the CPUs
closest to the PE's CPU.
The -cc cpu option is the typical use case for an MPI
application.
Note: If you oversubscribe CPUs for an OpenMP
application, Cray recommends that you not use the
-cc cpu default. Try the -cc none and -cc
numa_node options and compare results to determine
which option produces the better performance.
· The numa_node keyword causes a PE to be constrained to
the CPUs within the assigned NUMA node. CNL can migrate a
PE among the CPUs in the assigned NUMA node but not off
the assigned NUMA node. For example, if PE 2 is assigned
to NUMA node 0, CNL can migrate PE 2 among CPUs 0-3 but
not among CPUs 4-7.
If PEs create threads, the threads are constrained to the
same NUMA-node CPUs as the PEs. There is one exception.
If depth is greater than the number of CPUs per NUMA
node, once the number of threads created by the PE has
exceeded the number of CPUs per NUMA node, the remaining
threads are constrained to CPUs within the next NUMA node
on the compute node. For example, if depth is 5, threads
0-3 are constrained to CPUs 0-3 and thread 4 is
-D value The -D option value is an integer bitmask setting that
controls debug verbosity, where:
· A value of 1 provides a small level of debug messages
· A value of 2 provides a medium level of debug messages
· A value of 4 provides a high level of debug messages
Because this option is a bitmask setting, value can be set
to get any or all of the above levels of debug messages.
Therefore, valid values are 0 through 7. For example, -D 3
provides all small and medium level debug messages.
-d depth Specifies the number of CPUs for each PE and its threads.
ALPS allocates the number of CPUs equal to depth times pes.
The -cc cpu_list option can restrict the placement of
threads, resulting in more than one thread per CPU.
The default depth is 1.
For OpenMP applications, use both the OMP_NUM_THREADS
environment variable to specify the number of threads and
the aprun -d option to specify the number of CPUs hosting
the threads. ALPS creates -n pes instances of the
executable, and the executable spawns OMP_NUM_THREADS-1
additional threads per PE.
Note: For a PathScale OpenMP program, set the
PSC_OMP_AFFINITY environment variable to FALSE
For Cray systems, compute nodes must have at least depth
CPUs. For Cray XT4 systems, depth cannot exceed 4, and for
Cray XT5 systems, depth cannot exceed 12, and for Cray X6
compute blades, depth cannot exceed 24.
See Example 3: OpenMP threads (-d option).
-L node_list
Specifies the candidate nodes to constrain application
placement. The syntax allows a comma-separated list of nodes
(such as -L 32,33,40), a range of nodes (such as -L 41-87),
or a combination of both formats. Node values can be
expressed in decimal, octal (preceded by 0), or hexadecimal
(preceded by 0x). The first number in a range must be less
than the second number (8-6, for example, is invalid), but
the nodes in a list can be in any order. See Example 12:
Using node lists (-L option).
This option is used for applications launched interactively;
use the qsub -lmppnodes=\"node_list\" option for batch and
memory available to each PE equals the minimum value of
(compute node memory size) / (number of CPUs) calculated for
each compute node.
For example, given Cray XT5 compute nodes with 32 GB of
memory and 8 CPUs, the default per-PE memory size is 32 GB /
8 CPUs = 4 GB. For another example, given a mixed-processor
system with 8-core, 32-GB Cray XT5 nodes (32 GB / 8 CPUs = 4
GB) and 4-core, 8-GB Cray XT4 nodes (8 GB / 4 CPUs = 2 GB),
the default per-PE memory size is the minimum of 4 GB and 2
GB = 2 GB. See Example 10: Memory per PE (-m option).
If you want hugepages (2 MB) allocated for a Cray XT
application, use the h or hs suffix. The default and maximum
hugepage size for Cray SeaStar systems is 2 MB. The default
for Cray Gemini systems is 2 MB but can be modified using
the HUGETLB_DEFAULT_PAGE_SIZE environment variable. For more
information on Cray Gemini hugepage sizes, see the NOTES
section.
-m sizeh On Cray SeaStar systems: requests size of
huge pages to be allocated to each PE. All
nodes use as much huge page memory as they
are able to allocate and 4 KB pages
thereafter. On Cray Gemini systems this
option pre-allocates the hugepages and
sets the lowater mark for hugepage size.
See the NOTES section and Example 11:
Using huge pages (-m h and hs suffixes).
-m sizehs Requires size of huge pages to be
allocated to each PE. If the request
cannot be satisfied, an error message is
issued and the application launch is
terminated. See Example 11: Using huge
pages (-m h and hs suffixes).
Note: To use huge pages, you must first load the huge pages
library during the linking phase, such as:
% cc -c my_hugepages_app.c
% cc -o my_hugepages_app my_hugepages_app.o -lhugetlbfs
Then set the huge pages environment variable:
% setenv HUGETLB_MORECORE yes
or
% export HUGETLB_MORECORE=yes
this option to reduce the number of PEs per node, thereby
making more resources available per PE. For Cray systems,
the default is the number of CPUs available on a node.
For Cray systems, the maximum pes_per_node is 24.
-F exclusive|share
exclusive mode specifies affinity options to provide a
program with exclusive access to all the processing and
memory resources on a node. Using this option along with the
cc option will bind processes to those mentioned in the
affinity string. share mode access restricts the application
specific cpuset contents to only the application reserved
cores and memory on NUMA node boundaries, meaning the
application will not have access to cores and memory on
other NUMA nodes on that compute node. The exclusive option
does not need to be specified as exclusive access mode is
enabled by default. However, if nodeShare is set to share in
/etc/alps.conf then you must use the -F exclusive to
override the policy set in this file. You can check the
value of nodeShare by executing apstat -svv | grep access.
-q Specifies quiet mode and suppresses all aprun-generated non-
fatal messages. Do not use this option with the -D (debug)
option; aprun terminates the application if both options are
specified. Even with the -q option, aprun writes its help
message and any fatal ALPS message when exiting. Normally,
this option should not be used.
-r cores Enables core specialization on Cray compute nodes, where the
number of cores specified is the number of system services
cores per node for the application.
-S pes_per_numa_node
Specifies the number of PEs to allocate per NUMA node. This
option applies to both Cray XT5 and Cray X6 compute nodes.
You can use this option to reduce the number of PEs per NUMA
node, thereby making more resources available per PE. The
pes_per_numa_node value can be 1-6. For eight-core Cray XT5
nodes, the default is 4. For 12-core Cray XT5 and 24-core
Cray X6 nodes, the default is 6. A zero value is not allowed
and is a fatal error. For more information, see the Memory
Affinity NOTES section and Example 6: Optimizing NUMA-node
memory references (-S option).
-sl list_of_numa_nodes
Specifies the NUMA node or nodes (comma separated or hyphen
separated) to use for application placement. A space is
required between -sl and list_of_numa_nodes. This option
applies to Cray XT5 and Cray X6 compute nodes. The
list_of_numa_nodes value can be -sl <0,1> on Cray XT5
compute nodes, -sl <0,1,2,3> on Cray X6 compute nodes, or a
The numa_nodes_per_node value can be 1 or 2 on Cray XT5
compute nodes, or 1, 2, 3, 4 on Cray X6 compute nodes. The
default is no placement constraints. You can use this option
to find out if restricting your PEs to one NUMA node per
node affects performance.
A zero value is not allowed and is a fatal error. For more
information, see the Memory Affinity NOTES section and
Example 8: Optimizing NUMA node-memory references (-sn
option).
-ss Specifies strict memory containment per NUMA node. This
option applies to Cray XT5 and Cray X6 compute nodes. When
-ss is specified, a PE can allocate only the memory local to
its assigned NUMA node.
The default is to allow remote-NUMA-node memory allocation
to all assigned NUMA nodes. You can use this option to find
out if restricting each PE's memory access to local-NUMA-
node memory affects performance. For more information, see
the Memory Affinity NOTES section.
-t sec Specifies the per-PE CPU time limit in seconds. The sec time
limit is constrained by your CPU time limit on the login
node. For example, if your time limit on the login node is
3600 seconds but you specify a -t value of 5000, your
application is constrained to 3600 seconds per PE. If your
time limit on the login node is unlimited, the sec value is
used (or, if not specified, the time per-PE is unlimited).
You can determine your CPU time limit by using the limit
command (csh) or the ulimit -a command (bash).
: Separates the names of executables and their associated
options for Multiple Program, Multiple Data (MPMD) mode. A
space is required before and after the colon.
NOTES
Standard I/O
When an application has been launched on compute nodes, aprun forwards
stdin only to PE 0 of the application. All of the other application
PEs have stdin set to /dev/null. An application's stdout and stderr
messages are sent from the compute nodes back to aprun for display.
Signal Processing
The aprun command forwards the following signals to an application:
· SIGHUP
· SIGINT
· SIGQUIT
APRUN_DEFAULT_MEMORY
Specifies default per PE memory size. An explicit aprun -m
value overrides this setting.
APRUN_XFER_LIMITS
Sets the rlimit() transfer limits for aprun. If this is not
set or set to a non-zero string, aprun will transfer the
{get,set}rlimit() limits to apinit, which will use those
limits on the compute nodes. If it is set to 0, none of the
limits will be transferred other than RLIMIT_CORE,
RLIMIT_CPU, and possibly RLIMIT_RSS.
APRUN_SYNC_TTY
Sets synchronous tty for stdout and stderr output. Any non-
zero value enables synchronous tty output. An explicit aprun
-T value overrides this value.
Memory Affinity
Cray XT5 compute blades use dual-socket quad-core or dual-socket, six-
core compute nodes. The Cray X6 compute blades use dual-socket,
twelve-core or eight-core compute nodes. Because Cray systems can run
more tasks simultaneously, they can increase overall performance.
However, remote-NUMA-node memory references, such as a process running
on NUMA node 0 accessing NUMA node 1 memory, can adversely affect
performance. To give you run time controls that can optimize memory
references, Cray has added the following aprun memory affinity
options:
· -S pes_per_numa_node
· -sl list_of_numa_nodes
· -sn numa_nodes_per_node
· -ss
Hugepages for Cray XE Systems
The Cray Gemini MRT (Memory Relocation Table) is a feature of the
interconnect hardware on Cray XE systems that enables application
processes running on different compute nodes to directly access each
other's memory when that memory is backed by hugepages.
Without the Cray Gemini MRT, only 2GB of the application's address
space can be directly accessed from a different compute node. Your
application might not run if you do not take steps to place the
application's memory on hugepages.
Please see intro_hugepages(1) for information about how to back your
application on hugepages.
CPU Affinity
Note: On Cray XT5 and Cray X6 compute nodes, your application
can access only the resources you request on the aprun or qsub
command (or default values). Your application does not have
automatic access to all of a compute node's resources. For
example, if you request four or fewer CPUs per dual-socket, quad-
core compute node and you are not using the aprun -m option, your
application can access only the CPUs and memory of a single NUMA
node per node. If you include CPU affinity options that reference
the other NUMA node's resources, the kernel either ignores those
options or causes the application's termination. For more
information, see Example 4: Binding PEs to CPUs (-cc cpu_list
options) and the Workload Management and Application Placement
for the Cray Linux Environment.
Core Specialization
When you use the -r option, cores are assigned to system services
associated with your application. Using this option may improve the
performance of your application. The width parameter of the batch
reservation (e.g. mppwidth) that you use may be affected. To help you
calculate the appropriate width when using core specialization, you
can use apcount. For more information, see the apcount(1) manpage.
Resolving Claim exceeds reservation's node-count" Errors"
If your aprun command requests more nodes than were reserved by the
qsub command, ALPS displays the Claim exceeds reservation's node-count
error message. For batch jobs, the number of nodes reserved is set
when the qsub command is successfully processed. If you subsequently
request additional nodes through aprun affinity options, apsched
issues the error message and aprun exits. For example, on a Cray XT4
quad-core system, the following qsub command reserves two nodes (290
and 294):
% qsub -I -lmppwidth=4 -lmppnppn=2
% aprun -n 4 -N 2 ./xthi | sort
Application 225100 resources: utime ~0s, stime ~0s
Hello from rank 0, thread 0, on nid00290. (core affinity = 0)
Hello from rank 1, thread 0, on nid00290. (core affinity = 1)
Hello from rank 2, thread 0, on nid00294. (core affinity = 0)
Hello from rank 3, thread 0, on nid00294. (core affinity = 1)
In contrast, the following aprun command fails because the -S 1 option
constrains placement to one PE per NUMA node. Two additional nodes are
required:
% aprun -n 4 -N 2 -S 1 ./xthi | sort
Claim exceeds reservation's CPUs
ERRORS
If all application processes exit normally, aprun exits with zero. If
there is an internal aprun error or a fatal message is received from
ALPS on a compute node, aprun exits with 1. Otherwise, the aprun exit
requirements. For example, the command:
% aprun -n 24 ./a.out
places 24 PEs on:
· Cray XT4 single-socket, quad-core processors on 6 nodes
· Cray XT5 dual-socket, quad-core processors on 3 nodes
· Cray XT5 dual-socket, six-core processors on 2 nodes
· Cray X6 dual-socket, eight-core processors on 2 nodes.
· Cray X6 dual-socket, 12-core processors on 1 node.
The following example runs 12 PEs on three quad-core compute nodes
(nodes 28-30):
% cnselect coremask.eq.15
28-95,128-207
% qsub -I -lmppwidth=12 -lmppnodes=\"28-95,128-207\"
% aprun -n 12 ./xthi | sort
Application 1071056 resources: utime ~0s, stime ~0s
Hello from rank 0, thread 0, on nid00028. (core affinity = 0)
Hello from rank 0, thread 1, on nid00028. (core affinity = 0)
Hello from rank 1, thread 0, on nid00028. (core affinity = 1)
Hello from rank 1, thread 1, on nid00028. (core affinity = 1)
Hello from rank 10, thread 0, on nid00030. (core affinity = 2)
Hello from rank 10, thread 1, on nid00030. (core affinity = 2)
Hello from rank 11, thread 0, on nid00030. (core affinity = 3)
Hello from rank 11, thread 1, on nid00030. (core affinity = 3)
Hello from rank 2, thread 0, on nid00028. (core affinity = 2)
Hello from rank 2, thread 1, on nid00028. (core affinity = 2)
Hello from rank 3, thread 0, on nid00028. (core affinity = 3)
Hello from rank 3, thread 1, on nid00028. (core affinity = 3)
Hello from rank 4, thread 0, on nid00029. (core affinity = 0)
Hello from rank 4, thread 1, on nid00029. (core affinity = 0)
Hello from rank 5, thread 0, on nid00029. (core affinity = 1)
Hello from rank 5, thread 1, on nid00029. (core affinity = 1)
Hello from rank 6, thread 0, on nid00029. (core affinity = 2)
Hello from rank 6, thread 1, on nid00029. (core affinity = 2)
Hello from rank 7, thread 0, on nid00029. (core affinity = 3)
Hello from rank 7, thread 1, on nid00029. (core affinity = 3)
Hello from rank 8, thread 0, on nid00030. (core affinity = 0)
Hello from rank 8, thread 1, on nid00030. (core affinity = 0)
Hello from rank 9, thread 0, on nid00030. (core affinity = 1)
Hello from rank 9, thread 1, on nid00030. (core affinity = 1
The following example runs 12 PEs on one dual-socket, six-core compute
node:
Hello from rank 6, thread 0, on nid00170. (core affinity = 6)
Hello from rank 7, thread 0, on nid00170. (core affinity = 7)
Hello from rank 8, thread 0, on nid00170. (core affinity = 8)
Hello from rank 9, thread 0, on nid00170. (core affinity = 9)
Example 2: PEs per node (-N option)
If you want more compute node resources available for each PE, you can
use the -N option. For example, the following command used on a quad-
core system runs all PEs on one compute node:
% cnselect coremask.eq.15
25-88
% qsub -I -lmppwidth=4 -lmppnodes=\"25-88\"
% aprun -n 4 ./xthi | sort
Application 225102 resources: utime ~0s, stime ~0s
Hello from rank 0, thread 0, on nid00028. (core affinity = 0)
Hello from rank 1, thread 0, on nid00028. (core affinity = 1)
Hello from rank 2, thread 0, on nid00028. (core affinity = 2)
Hello from rank 3, thread 0, on nid00028. (core affinity = 3)
In contrast, the following commands restrict placement to 1 PE per
node:
% qsub -I -lmppwidth=4 -lmppnppn=1 -lmppnodes=\"25-88\"
% aprun -n 4 -N 1 ./xthi | sort
Application 225103 resources: utime ~0s, stime ~0s
Hello from rank 0, thread 0, on nid00028. (core affinity = 0)
Hello from rank 1, thread 0, on nid00029. (core affinity = 0)
Hello from rank 2, thread 0, on nid00030. (core affinity = 0)
Hello from rank 3, thread 0, on nid00031. (core affinity = 0)
Example 3: OpenMP threads (-d option)
For OpenMP applications, use the OMP_NUM_THREADS environment variable
to specify the number of OpenMP threads and the -d option to specify
the depth (number of CPUs) to be reserved for each PE and its threads.
Note: If you are using a PathScale compiler, set the
PSC_OMP_AFFINITY environment variable to FALSE before compiling:
% setenv PSC_OMP_AFFINITY FALSE
or:
% export PSC_OMP_AFFINITY=FALSE
ALPS creates -n pes instances of the executable, and the executable
spawns OMP_NUM_THREADS-1 additional threads per PE.
Hello from rank 0, thread 3, on nid00092. (core affinity = 0)
Hello from rank 1, thread 0, on nid00092. (core affinity = 1)
Hello from rank 1, thread 1, on nid00092. (core affinity = 1)
Hello from rank 1, thread 2, on nid00092. (core affinity = 1)
Hello from rank 1, thread 3, on nid00092. (core affinity = 1)
Hello from rank 2, thread 0, on nid00092. (core affinity = 2)
Hello from rank 2, thread 1, on nid00092. (core affinity = 2)
Hello from rank 2, thread 2, on nid00092. (core affinity = 2)
Hello from rank 2, thread 3, on nid00092. (core affinity = 2)
Hello from rank 3, thread 0, on nid00092. (core affinity = 3)
Hello from rank 3, thread 1, on nid00092. (core affinity = 3)
Hello from rank 3, thread 2, on nid00092. (core affinity = 3)
Hello from rank 3, thread 3, on nid00092. (core affinity = 3)
Because we used the default depth, each PE (rank) and its threads
execute on one CPU of a single compute node.
By setting the depth to 4, each PE and its threads run on separate
CPUs:
% cnselect coremask.eq.255
28-95
% qsub -I -lmppwidth=4 -lmppdepth=4 -lmppnodes=\"28-95\"
% setenv OMP_NUM_THREADS 4
% aprun -n 4 -d 4 ./xthi | sort
Application 225105 resources: utime ~0s, stime ~0s
Hello from rank 0, thread 0, on nid00028. (core affinity = 0)
Hello from rank 0, thread 1, on nid00028. (core affinity = 1)
Hello from rank 0, thread 2, on nid00028. (core affinity = 2)
Hello from rank 0, thread 3, on nid00028. (core affinity = 3)
Hello from rank 1, thread 0, on nid00028. (core affinity = 4)
Hello from rank 1, thread 1, on nid00028. (core affinity = 5)
Hello from rank 1, thread 2, on nid00028. (core affinity = 6)
Hello from rank 1, thread 3, on nid00028. (core affinity = 7)
Hello from rank 2, thread 0, on nid00029. (core affinity = 0)
Hello from rank 2, thread 1, on nid00029. (core affinity = 1)
Hello from rank 2, thread 2, on nid00029. (core affinity = 2)
Hello from rank 2, thread 3, on nid00029. (core affinity = 3)
Hello from rank 3, thread 0, on nid00029. (core affinity = 4)
Hello from rank 3, thread 1, on nid00029. (core affinity = 5)
Hello from rank 3, thread 2, on nid00029. (core affinity = 6)
Hello from rank 3, thread 3, on nid00029. (core affinity = 7)
If you want all of a compute node's cores and memory available for one
PE and its threads, use -n 1 and -d depth. In the following example,
one PE and its threads run on cores 0-11 of a 12-core Cray XT5 compute
node:
% setenv OMP_NUM_THREADS 12
% aprun -n 1 -d 12 ./xthi | sort
Example 4: Binding PEs to CPUs (-cc cpu_list options)
This example uses the -cc option to bind the PEs to CPUs 0-2:
% aprun -n 6 -cc 0-2 ./xthi | sort
Application 225107 resources: utime ~0s, stime ~0s
Hello from rank 0, thread 0, on nid00028. (core affinity = 0)
Hello from rank 1, thread 0, on nid00028. (core affinity = 1)
Hello from rank 2, thread 0, on nid00028. (core affinity = 2)
Hello from rank 3, thread 0, on nid00028. (core affinity = 0)
Hello from rank 4, thread 0, on nid00028. (core affinity = 1)
Hello from rank 5, thread 0, on nid00028. (core affinity = 2)
Normally, if the -d option and the OMP_NUM_THREADS values are equal,
each PE and its threads will run on separate CPUs. However, the -cc
cpu_list option can restrict the dynamic placement of PEs and threads:
% setenv OMP_NUM_THREADS 5
% aprun -n 4 -d 4 -cc 2,4 ./xthi | sort
Application 225108 resources: utime ~0s, stime ~0s
Hello from rank 0, thread 0, on nid00028. (core affinity = 2)
Hello from rank 0, thread 1, on nid00028. (core affinity = 2)
Hello from rank 0, thread 2, on nid00028. (core affinity = 4)
Hello from rank 0, thread 3, on nid00028. (core affinity = 2)
Hello from rank 0, thread 4, on nid00028. (core affinity = 2)
Hello from rank 1, thread 0, on nid00028. (core affinity = 4)
Hello from rank 1, thread 1, on nid00028. (core affinity = 4)
Hello from rank 1, thread 2, on nid00028. (core affinity = 4)
Hello from rank 1, thread 3, on nid00028. (core affinity = 2)
Hello from rank 1, thread 4, on nid00028. (core affinity = 4)
Hello from rank 2, thread 0, on nid00029. (core affinity = 2)
Hello from rank 2, thread 1, on nid00029. (core affinity = 4)
Hello from rank 2, thread 2, on nid00029. (core affinity = 4)
Hello from rank 2, thread 3, on nid00029. (core affinity = 2)
Hello from rank 2, thread 4, on nid00029. (core affinity = 4)
Hello from rank 3, thread 0, on nid00029. (core affinity = 4)
Hello from rank 3, thread 1, on nid00029. (core affinity = 2)
Hello from rank 3, thread 2, on nid00029. (core affinity = 2)
Hello from rank 3, thread 3, on nid00029. (core affinity = 2)
Hello from rank 3, thread 4, on nid00029. (core affinity = 4)
If depth is greater than the number of CPUs per NUMA node, once the
number of threads created by the PE exceeds the number of CPUs per
NUMA node, the remaining threads are constrained to CPUs within the
next NUMA node on the compute node. In the following example, all
threads are placed on NUMA node 0 except thread 6, which is placed on
NUMA node 1:
% setenv OMP_NUM_THREADS 7
Hello from rank 1, thread 5, on nid00262. (core affinity = 5)
Hello from rank 1, thread 6, on nid00262. (core affinity = 6)
Example 5: Binding PEs to CPUs (-cc keyword options)
By default, each PE is bound to a CPU (-cc cpu). For a Cray XT5
application, each PE runs on a separate CPU of NUMA nodes 0 and 1. In
the following example, each PE is bound to a CPU of a 12-core Cray XT5
compute node:
% aprun -n 12 -cc cpu ./xthi | sort
Application 286323 resources: utime ~0s, stime ~0s
Hello from rank 0, thread 0, on nid00514. (core affinity = 0)
Hello from rank 10, thread 0, on nid00514. (core affinity = 10)
Hello from rank 11, thread 0, on nid00514. (core affinity = 11)
Hello from rank 1, thread 0, on nid00514. (core affinity = 1)
Hello from rank 2, thread 0, on nid00514. (core affinity = 2)
Hello from rank 3, thread 0, on nid00514. (core affinity = 3)
Hello from rank 4, thread 0, on nid00514. (core affinity = 4)
Hello from rank 5, thread 0, on nid00514. (core affinity = 5)
Hello from rank 6, thread 0, on nid00514. (core affinity = 6)
Hello from rank 7, thread 0, on nid00514. (core affinity = 7)
Hello from rank 8, thread 0, on nid00514. (core affinity = 8)
Hello from rank 9, thread 0, on nid00514. (core affinity = 9)
For a Cray XT4 quad-core application, each PE runs on a separate CPU
of two nodes:
% cnselect coremask.eq.15
40-45
% qsub -I -lmppwidth=8 -lmppnodes=\"40-45\"
% aprun -n 8 -cc cpu ./xthi | sort
Application 225111 resources: utime ~0s, stime ~0s
Hello from rank 0, thread 0, on nid00040. (core affinity = 0)
Hello from rank 1, thread 0, on nid00040. (core affinity = 1)
Hello from rank 2, thread 0, on nid00040. (core affinity = 2)
Hello from rank 3, thread 0, on nid00040. (core affinity = 3)
Hello from rank 4, thread 0, on nid00041. (core affinity = 0)
Hello from rank 5, thread 0, on nid00041. (core affinity = 1)
Hello from rank 6, thread 0, on nid00041. (core affinity = 2)
Hello from rank 7, thread 0, on nid00041. (core affinity = 3)
Cray XT4 nodes have one NUMA node (NUMA-node 0). In the following
example, each PE of PEs 0-3 is bound to a CPU in node 40, NUMA-node 0
and each PE of PEs 4-7 is bound to a CPU in node 41, NUMA-node 0:
% cnselect coremask.eq.15
40-45
% qsub -I -lmppwidth=8 -lmppnodes=\"40-45\"
NUMA node but not off the assigned NUMA node:
% cnselect coremask.eq.255
28-95
% qsub -I -lmppwidth=8 -lmppnodes=\"28-95\"
% aprun -n 8 -cc numa_node ./xthi | sort
Application 225113 resources: utime ~0s, stime ~0s
Hello from rank 0, thread 0, on nid00028. (core affinity = 0-3)
Hello from rank 1, thread 0, on nid00028. (core affinity = 0-3)
Hello from rank 2, thread 0, on nid00028. (core affinity = 0-3)
Hello from rank 3, thread 0, on nid00028. (core affinity = 0-3)
Hello from rank 4, thread 0, on nid00028. (core affinity = 4-7)
Hello from rank 5, thread 0, on nid00028. (core affinity = 4-7)
Hello from rank 6, thread 0, on nid00028. (core affinity = 4-7)
Hello from rank 7, thread 0, on nid00028. (core affinity = 4-7)
For Cray XT4 nodes, the PEs are bound to NUMA-node 0 of nodes 40 and
41:
% cnselect coremask.eq.15
40-45
% qsub -I -lmppwidth=8 -lmppnodes=\"40-45\"
% aprun -n 8 -cc numa_node ./xthi | sort
Application 225114 resources: utime ~0s, stime ~0s
Hello from rank 0, thread 0, on nid00040. (core affinity = 0-3)
Hello from rank 1, thread 0, on nid00040. (core affinity = 0-3)
Hello from rank 2, thread 0, on nid00040. (core affinity = 0-3)
Hello from rank 3, thread 0, on nid00040. (core affinity = 0-3)
Hello from rank 4, thread 0, on nid00041. (core affinity = 0-3)
Hello from rank 5, thread 0, on nid00041. (core affinity = 0-3)
Hello from rank 6, thread 0, on nid00041. (core affinity = 0-3)
Hello from rank 7, thread 0, on nid00041. (core affinity = 0-3)
The following command specifies no binding; CNL can migrate threads
among all the CPUs of node 28:
% aprun -n 8 -cc none ./xthi | sort
Application 225116 resources: utime ~0s, stime ~0s
Hello from rank 0, thread 0, on nid00028. (core affinity = 0-7)
Hello from rank 1, thread 0, on nid00028. (core affinity = 0-7)
Hello from rank 2, thread 0, on nid00028. (core affinity = 0-7)
Hello from rank 3, thread 0, on nid00028. (core affinity = 0-7)
Hello from rank 4, thread 0, on nid00028. (core affinity = 0-7)
Hello from rank 5, thread 0, on nid00028. (core affinity = 0-7)
Hello from rank 6, thread 0, on nid00028. (core affinity = 0-7)
Hello from rank 7, thread 0, on nid00028. (core affinity = 0-7)
Example 6: Optimizing NUMA-node memory references (-S option)
This example runs all PEs on NUMA node 0; the PEs cannot allocate
remote NUMA node memory:
% aprun -n 8 -sl 0 ./xthi | sort
Application 225118 resources: utime ~0s, stime ~0s
Hello from rank 0, thread 0, on nid00028. (core affinity = 0)
Hello from rank 1, thread 0, on nid00028. (core affinity = 1)
Hello from rank 2, thread 0, on nid00028. (core affinity = 2)
Hello from rank 3, thread 0, on nid00028. (core affinity = 3)
Hello from rank 4, thread 0, on nid00029. (core affinity = 0)
Hello from rank 5, thread 0, on nid00029. (core affinity = 1)
Hello from rank 6, thread 0, on nid00029. (core affinity = 2)
Hello from rank 7, thread 0, on nid00029. (core affinity = 3)
This example runs all PEs on NUMA node 1:
% aprun -n 8 -sl 1 ./xthi | sort
Application 225119 resources: utime ~0s, stime ~0s
Hello from rank 0, thread 0, on nid00028. (core affinity = 4)
Hello from rank 1, thread 0, on nid00028. (core affinity = 5)
Hello from rank 2, thread 0, on nid00028. (core affinity = 6)
Hello from rank 3, thread 0, on nid00028. (core affinity = 7)
Hello from rank 4, thread 0, on nid00029. (core affinity = 4)
Hello from rank 5, thread 0, on nid00029. (core affinity = 5)
Hello from rank 6, thread 0, on nid00029. (core affinity = 6)
Hello from rank 7, thread 0, on nid00029. (core affinity = 7)
Example 8: Optimizing NUMA node-memory references (-sn option)
This example runs four PEs on NUMA node 0 of node 28 and four PEs on
NUMA node 0 of node 29:
% aprun -n 8 -sn 1 ./xthi | sort
Application 225120 resources: utime ~0s, stime ~0s
Hello from rank 0, thread 0, on nid00028. (core affinity = 0)
Hello from rank 1, thread 0, on nid00028. (core affinity = 1)
Hello from rank 2, thread 0, on nid00028. (core affinity = 2)
Hello from rank 3, thread 0, on nid00028. (core affinity = 3)
Hello from rank 4, thread 0, on nid00029. (core affinity = 0)
Hello from rank 5, thread 0, on nid00029. (core affinity = 1)
Hello from rank 6, thread 0, on nid00029. (core affinity = 2)
Hello from rank 7, thread 0, on nid00029. (core affinity = 3)
Example 9: Optimizing NUMA-node memory references (-ss option)
When the -ss option is used, a PE can allocate only the memory local
to its assigned NUMA node. The default is to allow remote-NUMA-node
memory allocation to all assigned NUMA nodes. For example, by default
any PE running on NUMA node 0 can allocate NUMA node 1 memory.
Hello from rank 7, thread 0, on nid00028. (core affinity = 7)
Example 10: Memory per PE (-m option)
The -m option can affect application placement. This example runs all
PEs on node 43. The amount of memory available per PE is 4000 MB:
% aprun -n 8 -m4000m ./xthi | sort
Application 225122 resources: utime ~0s, stime ~0s
Hello from rank 0, thread 0, on nid00043. (core affinity = 0)
Hello from rank 1, thread 0, on nid00043. (core affinity = 1)
Hello from rank 2, thread 0, on nid00043. (core affinity = 2)
Hello from rank 3, thread 0, on nid00043. (core affinity = 3)
Hello from rank 4, thread 0, on nid00043. (core affinity = 4)
Hello from rank 5, thread 0, on nid00043. (core affinity = 5)
Hello from rank 6, thread 0, on nid00043. (core affinity = 6)
Hello from rank 7, thread 0, on nid00043. (core affinity = 7)
In this example, node 43 does not have enough memory to fulfill the
request for 4001 MB per PE. PE 7 runs on node 44:
% aprun -n 8 -m4001 ./xthi | sort
Application 225123 resources: utime ~0s, stime ~0s
Hello from rank 0, thread 0, on nid00043. (core affinity = 0)
Hello from rank 1, thread 0, on nid00043. (core affinity = 1)
Hello from rank 2, thread 0, on nid00043. (core affinity = 2)
Hello from rank 3, thread 0, on nid00043. (core affinity = 3)
Hello from rank 4, thread 0, on nid00043. (core affinity = 4)
Hello from rank 5, thread 0, on nid00043. (core affinity = 5)
Hello from rank 6, thread 0, on nid00043. (core affinity = 6)
Hello from rank 7, thread 0, on nid00044. (core affinity = 0)
Example 11: Using huge pages (-m h and hs suffixes)
This example requests 4000 MB of huge pages per PE:
% cc -o xthi xthi.c -lhugetlbfs
% HUGETLB_MORECORE=yes aprun -n 8 -m4000h ./xthi | sort
%
Application 225124 resources: utime ~0s, stime ~0s
Hello from rank 0, thread 0, on nid00043. (core affinity = 0)
Hello from rank 1, thread 0, on nid00043. (core affinity = 1)
Hello from rank 2, thread 0, on nid00043. (core affinity = 2)
Hello from rank 3, thread 0, on nid00043. (core affinity = 3)
Hello from rank 4, thread 0, on nid00043. (core affinity = 4)
Hello from rank 5, thread 0, on nid00043. (core affinity = 5)
Hello from rank 6, thread 0, on nid00043. (core affinity = 6)
Hello from rank 7, thread 0, on nid00043. (core affinity = 7)
The following example terminates because the required 4000 MB of huge
pages per PE are not available:
% aprun -n 8 -m4000hs ./xthi | sort
[NID 00043] 2009-04-09 07:58:28 Apid 379231: unable to acquire enough huge memory: desired 32000M, actual 31498M
Example 12: Using node lists (-L option)
You can specify candidate node lists through the aprun -L option for
applications launched interactively and through the qsub -lmppnodes
option for batch and interactive batch jobs.
For an application launched interactively, use the cnselect command to
get a list of all Cray XT5 compute nodes. Then use aprun -L option to
specify the candidate list:
% cnselect coremask.eq.255
28-95
% aprun -n 4 -N 2 -L 28-95 ./xthi | sort
Application 225127 resources: utime ~0s, stime ~0s
Hello from rank 0, thread 0, on nid00028. (core affinity = 0)
Hello from rank 1, thread 0, on nid00028. (core affinity = 1)
Hello from rank 2, thread 0, on nid00029. (core affinity = 0)
Hello from rank 3, thread 0, on nid00029. (core affinity = 1)
Example 13: Bypassing binary transfer (-b option)
This aprun command runs the compute node grep command to find
references to MemTotal in compute node file /proc/meminfo:
% aprun -b /bin/ash -c "cat /proc/meminfo |grep MemTotal"
MemTotal: 32909204 kB
For further information about the commands you can use with the aprun
-b option, see Workload Management and Application Placement for the
Cray Linux Environment.
SEE ALSO
intro_alps(1), apkill(1), apstat(1), cnselect(1), qsub(1)
CC(1), cc(1), ftn(1)
Workload Management and Application Placement for the Cray Linux
Environment
Cray Application Developer's Environment User's Guide
Man(1) output converted with man2html