NERSCPowering Scientific Discovery Since 1974

Aprun MAN Page

    
    aprun [-a arch ] [-b ] [-B] [-cc cpu_list | keyword ] [-cp cpu_placement_file_name ] [-d depth ]
     [-D value ] [-L node_list ] [-m size[h|hs] ] [-n pes ] [-N pes_per_node ][-F access mode ] [-q ]
     [-r cores][-S pes_per_numa_node ] [-sl list_of_numa_nodes ] [-sn numa_nodes_per_node ] [-ss ]
     [-T ] [-t sec ]
      executable [ arguments_for_executable ]

IMPLEMENTATION
     Cray Linux Environment (CLE)

DESCRIPTION
     To run an application on CNL compute nodes, use the Application Level
     Placement Scheduler (ALPS) aprun command. The aprun command specifies
     application resource requirements, requests application placement, and
     initiates application launch.

     The aprun utility provides user identity and environment information
     as part of the application launch so that your login node session can
     be replicated for the application on the assigned set of compute
     nodes. This information includes the aprun current working directory,
     which must be accessible from the compute nodes.

     Before running aprun, ensure that your working directory has a file
     system accessible from the compute nodes. This will likely be a
     Lustre-mounted directory, such as /lus/nid00007/user1/.

     Do not suspend aprun; it is the local representative of the
     application that is running on compute nodes. If aprun is suspended,
     the application cannot communicate with ALPS, such as sending exit
     notification to aprun that the application has completed.

     Cray XT5 and Cray X6 compute node cores are paired to memory in NUMA
     (Non-Uniform Memory Access) nodes. Local NUMA node access is defined
     as memory accesses within the same NUMA node while remote NUMA node
     access is defined as memory accesses between separate NUMA nodes in a
     Cray compute node. Remote NUMA node accesses will have more latency as
     a result of this configuration. Cray XT5 compute nodes have two NUMA
     nodes while Cray X6 compute nodes have four NUMA nodes.

   MPMD Mode
     You can use aprun to launch applications in Multiple Program, Multiple
     Data (MPMD) mode. The command format is:

       aprun -n pes [other_aprun_options] executable1 [arguments_for_executable1] :
             -n pes [other_aprun_options] executable2 [arguments_for_executable2] :
             -n pes [other_aprun_options] executable3 [arguments_for_executable3] :
        ...

     such as:

       % aprun -n 12 ./app1 : -n 8 -d 2 ./app2 : -n 32 -N 2 ./app3


     respectively). For 12-core Cray X6 processors, NUMA nodes 0 through 3
     have six processors each (logical CPUs 0-5, 6-11, 12-17, and 18-23,
     respectively).

     For quad-core Cray XT5 processors, NUMA node 0 has four processors
     (logical CPUs 0-3), and NUMA node 1 has four processors (logical CPUs
     4-7). For six-core Cray XT5 processors, NUMA node 0 has six processors
     (logical CPUs 0-5), and NUMA node 1 has six processors (logical CPUs
     6-11).

     Two types of operations — remote-NUMA-node memory accesses and process
     migration — can reduce performance. The aprun command provides memory
     affinity and CPU affinity options that allow you to control these
     operations. For more information, see the Memory Affinity and CPU
     Affinity NOTES sections.

          Note:  Having a compute node reserved for your job does not
          guarantee that you can use all NUMA nodes. You have to request
          sufficient resources through qsub -l resource options and aprun
          placement options (-n, -N, -d, and/or -m) to be able to use all
          NUMA nodes. See the aprun option descriptions and the EXAMPLES
          section for more information.

   Cray XT4 Compute Nodes
     Several aprun options apply to Cray XT4 compute nodes. Each Cray XT4
     compute node consists of a quad-core processor (logical CPUs 0-3) and
     memory. Process migration can reduce performance; you can avoid
     process migration by using the aprun CPU affinity options to bind
     processes to CPUs. For more information, see the CPU Affinity NOTES
     section.

   aprun Options
     The aprun command accepts the following options:

     -b        Bypasses the transfer of the application executable to
               compute nodes. By default, the executable is transferred to
               the compute nodes as part of the aprun process of launching
               an application. You would likely use the -b option only if
               the executable to be launched was part of a file system
               accessible from the compute node. For more information, see
               the EXAMPLES section.

     -B        Tells ALPS to reuse the width, depth, nppn and memory
               requests specified with the corresponding batch reservation.
               This option obviates the need to specify aprun options -n,
               -d, -N, and -m and aprun will exit with an error if the user
               specifies these with the -B option.

[dhcp71-19:~] hjw% cat ap.html


     aprun [-a arch ] [-b ] [-B] [-cc cpu_list | keyword ] [-cp cpu_placement_file_name ] [-d depth ]
     [-D value ] [-L node_list ] [-m size[h|hs] ] [-n pes ] [-N pes_per_node ][-F access mode ] [-q ]
     [-r cores][-S pes_per_numa_node ] [-sl list_of_numa_nodes ] [-sn numa_nodes_per_node ] [-ss ]
     [-T ] [-t sec ]
      executable [ arguments_for_executable ]

IMPLEMENTATION
     Cray Linux Environment (CLE)

DESCRIPTION
     To run an application on CNL compute nodes, use the Application Level
     Placement Scheduler (ALPS) aprun command. The aprun command specifies
     application resource requirements, requests application placement, and
     initiates application launch.

     The aprun utility provides user identity and environment information
     as part of the application launch so that your login node session can
     be replicated for the application on the assigned set of compute
     nodes. This information includes the aprun current working directory,
     which must be accessible from the compute nodes.

     Before running aprun, ensure that your working directory has a file
     system accessible from the compute nodes. This will likely be a
     Lustre-mounted directory, such as /lus/nid00007/user1/.

     Do not suspend aprun; it is the local representative of the
     application that is running on compute nodes. If aprun is suspended,
     the application cannot communicate with ALPS, such as sending exit
     notification to aprun that the application has completed.

     Cray XT5 and Cray X6 compute node cores are paired to memory in NUMA
     (Non-Uniform Memory Access) nodes. Local NUMA node access is defined
     as memory accesses within the same NUMA node while remote NUMA node
     access is defined as memory accesses between separate NUMA nodes in a
     Cray compute node. Remote NUMA node accesses will have more latency as
     a result of this configuration. Cray XT5 compute nodes have two NUMA
     nodes while Cray X6 compute nodes have four NUMA nodes.

   MPMD Mode
     You can use aprun to launch applications in Multiple Program, Multiple
     Data (MPMD) mode. The command format is:

       aprun -n pes [other_aprun_options] executable1 [arguments_for_executable1] :
             -n pes [other_aprun_options] executable2 [arguments_for_executable2] :
             -n pes [other_aprun_options] executable3 [arguments_for_executable3] :
        ...

     such as:

       % aprun -n 12 ./app1 : -n 8 -d 2 ./app2 : -n 32 -N 2 ./app3


     respectively). For 12-core Cray X6 processors, NUMA nodes 0 through 3
     have six processors each (logical CPUs 0-5, 6-11, 12-17, and 18-23,
     respectively).

     For quad-core Cray XT5 processors, NUMA node 0 has four processors
     (logical CPUs 0-3), and NUMA node 1 has four processors (logical CPUs
     4-7). For six-core Cray XT5 processors, NUMA node 0 has six processors
     (logical CPUs 0-5), and NUMA node 1 has six processors (logical CPUs
     6-11).

     Two types of operations — remote-NUMA-node memory accesses and process
     migration — can reduce performance. The aprun command provides memory
     affinity and CPU affinity options that allow you to control these
     operations. For more information, see the Memory Affinity and CPU
     Affinity NOTES sections.

          Note:  Having a compute node reserved for your job does not
          guarantee that you can use all NUMA nodes. You have to request
          sufficient resources through qsub -l resource options and aprun
          placement options (-n, -N, -d, and/or -m) to be able to use all
          NUMA nodes. See the aprun option descriptions and the EXAMPLES
          section for more information.

   Cray XT4 Compute Nodes
     Several aprun options apply to Cray XT4 compute nodes. Each Cray XT4
     compute node consists of a quad-core processor (logical CPUs 0-3) and
     memory. Process migration can reduce performance; you can avoid
     process migration by using the aprun CPU affinity options to bind
     processes to CPUs. For more information, see the CPU Affinity NOTES
     section.

   aprun Options
     The aprun command accepts the following options:

     -b        Bypasses the transfer of the application executable to
               compute nodes. By default, the executable is transferred to
               the compute nodes as part of the aprun process of launching
               an application. You would likely use the -b option only if
               the executable to be launched was part of a file system
               accessible from the compute node. For more information, see
               the EXAMPLES section.

     -B        Tells ALPS to reuse the width, depth, nppn and memory
               requests specified with the corresponding batch reservation.
               This option obviates the need to specify aprun options -n,
               -d, -N, and -m and aprun will exit with an error if the user
               specifies these with the -B option.

     -cc cpu_list | keyword
               Binds processing elements (PEs) to CPUs. CNL does not
               migrate processes that are bound to a CPU. This option
               applies to all multicore compute nodes. The cpu_list is not
               application-created process at that location in the fork
               sequence should not be bound to a CPU.

               Out-of-range cpu_list values are ignored unless all CPU
               values are out of range, in which case an error message is
               issued. For example, if you want to bind PEs starting with
               the highest CPU on a compute node and work down from there,
               you might use this -cc option:

               % aprun -n 8 -cc 7-0 ./a.out

               If the PEs were placed on Cray XT5 8-core compute nodes, the
               specified -cc range would be valid. However, if the PEs were
               placed on Cray XT4 compute nodes, CPUs 7-4 would be out of
               range and therefore not used. See Example 4: Binding PEs to
               CPUs (-cc cpu_list options).

               The following keyword values can be used:

               ·  The cpu keyword (the default) binds each PE to a CPU
                  within the assigned NUMA node. You do not have to
                  indicate a specific CPU.

                  If you specify a depth per PE (aprun -d depth), the PEs
                  are constrained to CPUs with a distance of depth between
                  them so each PE's threads can be constrained to the CPUs
                  closest to the PE's CPU.

                  The -cc cpu option is the typical use case for an MPI
                  application.

                         Note:  If you oversubscribe CPUs for an OpenMP
                         application, Cray recommends that you not use the
                         -cc cpu default. Try the -cc none and -cc
                         numa_node options and compare results to determine
                         which option produces the better performance.

               ·  The numa_node keyword causes a PE to be constrained to
                  the CPUs within the assigned NUMA node. CNL can migrate a
                  PE among the CPUs in the assigned NUMA node but not off
                  the assigned NUMA node. For example, if PE 2 is assigned
                  to NUMA node 0, CNL can migrate PE 2 among CPUs 0-3 but
                  not among CPUs 4-7.

                  If PEs create threads, the threads are constrained to the
                  same NUMA-node CPUs as the PEs. There is one exception.
                  If depth is greater than the number of CPUs per NUMA
                  node, once the number of threads created by the PE has
                  exceeded the number of CPUs per NUMA node, the remaining
                  threads are constrained to CPUs within the next NUMA node
                  on the compute node. For example, if depth is 5, threads
                  0-3 are constrained to CPUs 0-3 and thread 4 is

     -D value  The -D option value is an integer bitmask setting that
               controls debug verbosity, where:

               ·  A value of 1 provides a small level of debug messages

               ·  A value of 2 provides a medium level of debug messages

               ·  A value of 4 provides a high level of debug messages

               Because this option is a bitmask setting, value can be set
               to get any or all of the above levels of debug messages.
               Therefore, valid values are 0 through 7. For example, -D 3
               provides all small and medium level debug messages.

     -d depth  Specifies the number of CPUs for each PE and its threads.
               ALPS allocates the number of CPUs equal to depth times pes.
               The -cc cpu_list option can restrict the placement of
               threads, resulting in more than one thread per CPU.

               The default depth is 1.

               For OpenMP applications, use both the OMP_NUM_THREADS
               environment variable to specify the number of threads and
               the aprun -d option to specify the number of CPUs hosting
               the threads. ALPS creates -n pes instances of the
               executable, and the executable spawns OMP_NUM_THREADS-1
               additional threads per PE.

               Note:  For a PathScale OpenMP program, set the
               PSC_OMP_AFFINITY environment variable to FALSE

               For Cray systems, compute nodes must have at least depth
               CPUs. For Cray XT4 systems, depth cannot exceed 4, and for
               Cray XT5 systems, depth cannot exceed 12, and for Cray X6
               compute blades, depth cannot exceed 24.

               See Example 3: OpenMP threads (-d option).

     -L node_list
               Specifies the candidate nodes to constrain application
               placement. The syntax allows a comma-separated list of nodes
               (such as -L 32,33,40), a range of nodes (such as -L 41-87),
               or a combination of both formats. Node values can be
               expressed in decimal, octal (preceded by 0), or hexadecimal
               (preceded by 0x). The first number in a range must be less
               than the second number (8-6, for example, is invalid), but
               the nodes in a list can be in any order. See Example 12:
               Using node lists (-L option).

               This option is used for applications launched interactively;
               use the qsub -lmppnodes=\"node_list\" option for batch and
               memory available to each PE equals the minimum value of
               (compute node memory size) / (number of CPUs) calculated for
               each compute node.

               For example, given Cray XT5 compute nodes with 32 GB of
               memory and 8 CPUs, the default per-PE memory size is 32 GB /
               8 CPUs = 4 GB. For another example, given a mixed-processor
               system with 8-core, 32-GB Cray XT5 nodes (32 GB / 8 CPUs = 4
               GB) and 4-core, 8-GB Cray XT4 nodes (8 GB / 4 CPUs = 2 GB),
               the default per-PE memory size is the minimum of 4 GB and 2
               GB = 2 GB. See Example 10: Memory per PE (-m option).

               If you want hugepages (2 MB) allocated for a Cray XT
               application, use the h or hs suffix. The default and maximum
               hugepage size for Cray SeaStar systems is 2 MB. The default
               for Cray Gemini systems is 2 MB but can be modified using
               the HUGETLB_DEFAULT_PAGE_SIZE environment variable. For more
               information on Cray Gemini hugepage sizes, see the NOTES
               section.


              -m sizeh           On Cray SeaStar systems: requests size of
                                 huge pages to be allocated to each PE. All
                                 nodes use as much huge page memory as they
                                 are able to allocate and 4 KB pages
                                 thereafter. On Cray Gemini systems this
                                 option pre-allocates the hugepages and
                                 sets the lowater mark for hugepage size.
                                 See the NOTES section and Example 11:
                                 Using huge pages (-m h and hs suffixes).
              -m sizehs          Requires size of huge pages to be
                                 allocated to each PE. If the request
                                 cannot be satisfied, an error message is
                                 issued and the application launch is
                                 terminated. See Example 11: Using huge
                                 pages (-m h and hs suffixes).



               Note:  To use huge pages, you must first load the huge pages
               library during the linking phase, such as:

               % cc -c my_hugepages_app.c
               % cc -o my_hugepages_app my_hugepages_app.o -lhugetlbfs

               Then set the huge pages environment variable:

               % setenv HUGETLB_MORECORE yes

               or

               % export HUGETLB_MORECORE=yes
               this option to reduce the number of PEs per node, thereby
               making more resources available per PE. For Cray systems,
               the default is the number of CPUs available on a node.

               For Cray systems, the maximum pes_per_node is 24.

     -F exclusive|share
               exclusive mode specifies affinity options to provide a
               program with exclusive access to all the processing and
               memory resources on a node. Using this option along with the
               cc option will bind processes to those mentioned in the
               affinity string. share mode access restricts the application
               specific cpuset contents to only the application reserved
               cores and memory on NUMA node boundaries, meaning the
               application will not have access to cores and memory on
               other NUMA nodes on that compute node. The exclusive option
               does not need to be specified as exclusive access mode is
               enabled by default. However, if nodeShare is set to share in
               /etc/alps.conf then you must use the -F exclusive to
               override the policy set in this file. You can check the
               value of nodeShare by executing apstat -svv | grep access.

     -q        Specifies quiet mode and suppresses all aprun-generated non-
               fatal messages. Do not use this option with the -D (debug)
               option; aprun terminates the application if both options are
               specified. Even with the -q option, aprun writes its help
               message and any fatal ALPS message when exiting. Normally,
               this option should not be used.

     -r cores  Enables core specialization on Cray compute nodes, where the
               number of cores specified is the number of system services
               cores per node for the application.

     -S pes_per_numa_node
               Specifies the number of PEs to allocate per NUMA node. This
               option applies to both Cray XT5 and Cray X6 compute nodes.
               You can use this option to reduce the number of PEs per NUMA
               node, thereby making more resources available per PE. The
               pes_per_numa_node value can be 1-6. For eight-core Cray XT5
               nodes, the default is 4. For 12-core Cray XT5 and 24-core
               Cray X6 nodes, the default is 6. A zero value is not allowed
               and is a fatal error. For more information, see the Memory
               Affinity NOTES section and Example 6: Optimizing NUMA-node
               memory references (-S option).

     -sl list_of_numa_nodes
               Specifies the NUMA node or nodes (comma separated or hyphen
               separated) to use for application placement. A space is
               required between -sl and list_of_numa_nodes. This option
               applies to Cray XT5 and Cray X6 compute nodes. The
               list_of_numa_nodes value can be -sl <0,1> on Cray XT5
               compute nodes, -sl <0,1,2,3> on Cray X6 compute nodes, or a
               The numa_nodes_per_node value can be 1 or 2 on Cray XT5
               compute nodes, or 1, 2, 3, 4 on Cray X6 compute nodes. The
               default is no placement constraints. You can use this option
               to find out if restricting your PEs to one NUMA node per
               node affects performance.

               A zero value is not allowed and is a fatal error. For more
               information, see the Memory Affinity NOTES section and
               Example 8: Optimizing NUMA node-memory references (-sn
               option).

     -ss       Specifies strict memory containment per NUMA node. This
               option applies to Cray XT5 and Cray X6 compute nodes. When
               -ss is specified, a PE can allocate only the memory local to
               its assigned NUMA node.

               The default is to allow remote-NUMA-node memory allocation
               to all assigned NUMA nodes. You can use this option to find
               out if restricting each PE's memory access to local-NUMA-
               node memory affects performance. For more information, see
               the Memory Affinity NOTES section.

     -t sec    Specifies the per-PE CPU time limit in seconds. The sec time
               limit is constrained by your CPU time limit on the login
               node. For example, if your time limit on the login node is
               3600 seconds but you specify a -t value of 5000, your
               application is constrained to 3600 seconds per PE. If your
               time limit on the login node is unlimited, the sec value is
               used (or, if not specified, the time per-PE is unlimited).
               You can determine your CPU time limit by using the limit
               command (csh) or the ulimit -a command (bash).

     :         Separates the names of executables and their associated
               options for Multiple Program, Multiple Data (MPMD) mode. A
               space is required before and after the colon.

NOTES
   Standard I/O
     When an application has been launched on compute nodes, aprun forwards
     stdin only to PE 0 of the application. All of the other application
     PEs have stdin set to /dev/null. An application's stdout and stderr
     messages are sent from the compute nodes back to aprun for display.

   Signal Processing
     The aprun command forwards the following signals to an application:

     ·  SIGHUP

     ·  SIGINT

     ·  SIGQUIT


     APRUN_DEFAULT_MEMORY
               Specifies default per PE memory size. An explicit aprun -m
               value overrides this setting.

     APRUN_XFER_LIMITS
               Sets the rlimit() transfer limits for aprun. If this is not
               set or set to a non-zero string, aprun will transfer the
               {get,set}rlimit() limits to apinit, which will use those
               limits on the compute nodes. If it is set to 0, none of the
               limits will be transferred other than RLIMIT_CORE,
               RLIMIT_CPU, and possibly RLIMIT_RSS.

     APRUN_SYNC_TTY
               Sets synchronous tty for stdout and stderr output. Any non-
               zero value enables synchronous tty output. An explicit aprun
               -T value overrides this value.

   Memory Affinity
     Cray XT5 compute blades use dual-socket quad-core or dual-socket, six-
     core compute nodes. The Cray X6 compute blades use dual-socket,
     twelve-core or eight-core compute nodes. Because Cray systems can run
     more tasks simultaneously, they can increase overall performance.
     However, remote-NUMA-node memory references, such as a process running
     on NUMA node 0 accessing NUMA node 1 memory, can adversely affect
     performance. To give you run time controls that can optimize memory
     references, Cray has added the following aprun memory affinity
     options:

     ·  -S pes_per_numa_node

     ·  -sl list_of_numa_nodes

     ·  -sn numa_nodes_per_node

     ·  -ss

   Hugepages for Cray XE Systems
     The Cray Gemini MRT (Memory Relocation Table) is a feature of the
     interconnect hardware on Cray XE systems that enables application
     processes running on different compute nodes to directly access each
     other's memory when that memory is backed by hugepages.

     Without the Cray Gemini MRT, only 2GB of the application's address
     space can be directly accessed from a different compute node. Your
     application might not run if you do not take steps to place the
     application's memory on hugepages.

     Please see intro_hugepages(1) for information about how to back your
     application on hugepages.

   CPU Affinity
          Note:  On Cray XT5 and Cray X6 compute nodes, your application
          can access only the resources you request on the aprun or qsub
          command (or default values). Your application does not have
          automatic access to all of a compute node's resources. For
          example, if you request four or fewer CPUs per dual-socket, quad-
          core compute node and you are not using the aprun -m option, your
          application can access only the CPUs and memory of a single NUMA
          node per node. If you include CPU affinity options that reference
          the other NUMA node's resources, the kernel either ignores those
          options or causes the application's termination. For more
          information, see Example 4: Binding PEs to CPUs (-cc cpu_list
          options) and the Workload Management and Application Placement
          for the Cray Linux Environment.

   Core Specialization
     When you use the -r option, cores are assigned to system services
     associated with your application. Using this option may improve the
     performance of your application. The width parameter of the batch
     reservation (e.g. mppwidth) that you use may be affected. To help you
     calculate the appropriate width when using core specialization, you
     can use apcount. For more information, see the apcount(1) manpage.

   Resolving  Claim exceeds reservation's node-count" Errors"
     If your aprun command requests more nodes than were reserved by the
     qsub command, ALPS displays the Claim exceeds reservation's node-count
     error message. For batch jobs, the number of nodes reserved is set
     when the qsub command is successfully processed. If you subsequently
     request additional nodes through aprun affinity options, apsched
     issues the error message and aprun exits. For example, on a Cray XT4
     quad-core system, the following qsub command reserves two nodes (290
     and 294):

       % qsub -I -lmppwidth=4 -lmppnppn=2
       % aprun -n 4 -N 2 ./xthi | sort
       Application 225100 resources: utime ~0s, stime ~0s
       Hello from rank 0, thread 0, on nid00290. (core affinity = 0)
       Hello from rank 1, thread 0, on nid00290. (core affinity = 1)
       Hello from rank 2, thread 0, on nid00294. (core affinity = 0)
       Hello from rank 3, thread 0, on nid00294. (core affinity = 1)

     In contrast, the following aprun command fails because the -S 1 option
     constrains placement to one PE per NUMA node. Two additional nodes are
     required:

       % aprun -n 4 -N 2 -S 1 ./xthi | sort
       Claim exceeds reservation's CPUs


ERRORS
     If all application processes exit normally, aprun exits with zero. If
     there is an internal aprun error or a fatal message is received from
     ALPS on a compute node, aprun exits with 1. Otherwise, the aprun exit
     requirements. For example, the command:

       % aprun -n 24 ./a.out

     places 24 PEs on:

     ·  Cray XT4 single-socket, quad-core processors on 6 nodes

     ·  Cray XT5 dual-socket, quad-core processors on 3 nodes

     ·  Cray XT5 dual-socket, six-core processors on 2 nodes

     ·  Cray X6 dual-socket, eight-core processors on 2 nodes.

     ·  Cray X6 dual-socket, 12-core processors on 1 node.

     The following example runs 12 PEs on three quad-core compute nodes
     (nodes 28-30):

       % cnselect coremask.eq.15
       28-95,128-207
       % qsub -I -lmppwidth=12 -lmppnodes=\"28-95,128-207\"
       % aprun -n 12 ./xthi | sort
       Application 1071056 resources: utime ~0s, stime ~0s
       Hello from rank 0, thread 0, on nid00028. (core affinity = 0)
       Hello from rank 0, thread 1, on nid00028. (core affinity = 0)
       Hello from rank 1, thread 0, on nid00028. (core affinity = 1)
       Hello from rank 1, thread 1, on nid00028. (core affinity = 1)
       Hello from rank 10, thread 0, on nid00030. (core affinity = 2)
       Hello from rank 10, thread 1, on nid00030. (core affinity = 2)
       Hello from rank 11, thread 0, on nid00030. (core affinity = 3)
       Hello from rank 11, thread 1, on nid00030. (core affinity = 3)
       Hello from rank 2, thread 0, on nid00028. (core affinity = 2)
       Hello from rank 2, thread 1, on nid00028. (core affinity = 2)
       Hello from rank 3, thread 0, on nid00028. (core affinity = 3)
       Hello from rank 3, thread 1, on nid00028. (core affinity = 3)
       Hello from rank 4, thread 0, on nid00029. (core affinity = 0)
       Hello from rank 4, thread 1, on nid00029. (core affinity = 0)
       Hello from rank 5, thread 0, on nid00029. (core affinity = 1)
       Hello from rank 5, thread 1, on nid00029. (core affinity = 1)
       Hello from rank 6, thread 0, on nid00029. (core affinity = 2)
       Hello from rank 6, thread 1, on nid00029. (core affinity = 2)
       Hello from rank 7, thread 0, on nid00029. (core affinity = 3)
       Hello from rank 7, thread 1, on nid00029. (core affinity = 3)
       Hello from rank 8, thread 0, on nid00030. (core affinity = 0)
       Hello from rank 8, thread 1, on nid00030. (core affinity = 0)
       Hello from rank 9, thread 0, on nid00030. (core affinity = 1)
       Hello from rank 9, thread 1, on nid00030. (core affinity = 1


     The following example runs 12 PEs on one dual-socket, six-core compute
     node:
       Hello from rank 6, thread 0, on nid00170. (core affinity = 6)
       Hello from rank 7, thread 0, on nid00170. (core affinity = 7)
       Hello from rank 8, thread 0, on nid00170. (core affinity = 8)
       Hello from rank 9, thread 0, on nid00170. (core affinity = 9)


     Example 2: PEs per node (-N option)

     If you want more compute node resources available for each PE, you can
     use the -N option. For example, the following command used on a quad-
     core system runs all PEs on one compute node:

       % cnselect coremask.eq.15
       25-88
       % qsub -I -lmppwidth=4 -lmppnodes=\"25-88\"
       % aprun -n 4 ./xthi | sort
       Application 225102 resources: utime ~0s, stime ~0s
       Hello from rank 0, thread 0, on nid00028. (core affinity = 0)
       Hello from rank 1, thread 0, on nid00028. (core affinity = 1)
       Hello from rank 2, thread 0, on nid00028. (core affinity = 2)
       Hello from rank 3, thread 0, on nid00028. (core affinity = 3)

     In contrast, the following commands restrict placement to 1 PE per
     node:

       % qsub -I -lmppwidth=4 -lmppnppn=1 -lmppnodes=\"25-88\"
       % aprun -n 4 -N 1 ./xthi | sort
       Application 225103 resources: utime ~0s, stime ~0s
       Hello from rank 0, thread 0, on nid00028. (core affinity = 0)
       Hello from rank 1, thread 0, on nid00029. (core affinity = 0)
       Hello from rank 2, thread 0, on nid00030. (core affinity = 0)
       Hello from rank 3, thread 0, on nid00031. (core affinity = 0)


     Example 3: OpenMP threads (-d option)

     For OpenMP applications, use the OMP_NUM_THREADS environment variable
     to specify the number of OpenMP threads and the -d option to specify
     the depth (number of CPUs) to be reserved for each PE and its threads.

          Note:  If you are using a PathScale compiler, set the
          PSC_OMP_AFFINITY environment variable to FALSE before compiling:

            % setenv PSC_OMP_AFFINITY FALSE

          or:

            % export PSC_OMP_AFFINITY=FALSE


     ALPS creates -n pes instances of the executable, and the executable
     spawns OMP_NUM_THREADS-1 additional threads per PE.
       Hello from rank 0, thread 3, on nid00092. (core affinity = 0)
       Hello from rank 1, thread 0, on nid00092. (core affinity = 1)
       Hello from rank 1, thread 1, on nid00092. (core affinity = 1)
       Hello from rank 1, thread 2, on nid00092. (core affinity = 1)
       Hello from rank 1, thread 3, on nid00092. (core affinity = 1)
       Hello from rank 2, thread 0, on nid00092. (core affinity = 2)
       Hello from rank 2, thread 1, on nid00092. (core affinity = 2)
       Hello from rank 2, thread 2, on nid00092. (core affinity = 2)
       Hello from rank 2, thread 3, on nid00092. (core affinity = 2)
       Hello from rank 3, thread 0, on nid00092. (core affinity = 3)
       Hello from rank 3, thread 1, on nid00092. (core affinity = 3)
       Hello from rank 3, thread 2, on nid00092. (core affinity = 3)
       Hello from rank 3, thread 3, on nid00092. (core affinity = 3)


     Because we used the default depth, each PE (rank) and its threads
     execute on one CPU of a single compute node.

     By setting the depth to 4, each PE and its threads run on separate
     CPUs:

       % cnselect coremask.eq.255
       28-95
       % qsub -I -lmppwidth=4 -lmppdepth=4 -lmppnodes=\"28-95\"
       % setenv OMP_NUM_THREADS 4
       % aprun -n 4 -d 4 ./xthi | sort
       Application 225105 resources: utime ~0s, stime ~0s
       Hello from rank 0, thread 0, on nid00028. (core affinity = 0)
       Hello from rank 0, thread 1, on nid00028. (core affinity = 1)
       Hello from rank 0, thread 2, on nid00028. (core affinity = 2)
       Hello from rank 0, thread 3, on nid00028. (core affinity = 3)
       Hello from rank 1, thread 0, on nid00028. (core affinity = 4)
       Hello from rank 1, thread 1, on nid00028. (core affinity = 5)
       Hello from rank 1, thread 2, on nid00028. (core affinity = 6)
       Hello from rank 1, thread 3, on nid00028. (core affinity = 7)
       Hello from rank 2, thread 0, on nid00029. (core affinity = 0)
       Hello from rank 2, thread 1, on nid00029. (core affinity = 1)
       Hello from rank 2, thread 2, on nid00029. (core affinity = 2)
       Hello from rank 2, thread 3, on nid00029. (core affinity = 3)
       Hello from rank 3, thread 0, on nid00029. (core affinity = 4)
       Hello from rank 3, thread 1, on nid00029. (core affinity = 5)
       Hello from rank 3, thread 2, on nid00029. (core affinity = 6)
       Hello from rank 3, thread 3, on nid00029. (core affinity = 7)


     If you want all of a compute node's cores and memory available for one
     PE and its threads, use -n 1 and -d depth. In the following example,
     one PE and its threads run on cores 0-11 of a 12-core Cray XT5 compute
     node:

       % setenv OMP_NUM_THREADS 12
       % aprun -n 1 -d 12 ./xthi | sort

     Example 4: Binding PEs to CPUs (-cc cpu_list options)

     This example uses the -cc option to bind the PEs to CPUs 0-2:

       % aprun -n 6 -cc 0-2 ./xthi | sort
       Application 225107 resources: utime ~0s, stime ~0s
       Hello from rank 0, thread 0, on nid00028. (core affinity = 0)
       Hello from rank 1, thread 0, on nid00028. (core affinity = 1)
       Hello from rank 2, thread 0, on nid00028. (core affinity = 2)
       Hello from rank 3, thread 0, on nid00028. (core affinity = 0)
       Hello from rank 4, thread 0, on nid00028. (core affinity = 1)
       Hello from rank 5, thread 0, on nid00028. (core affinity = 2)


     Normally, if the -d option and the OMP_NUM_THREADS values are equal,
     each PE and its threads will run on separate CPUs. However, the -cc
     cpu_list option can restrict the dynamic placement of PEs and threads:

       % setenv OMP_NUM_THREADS 5
       % aprun -n 4 -d 4 -cc 2,4 ./xthi | sort
       Application 225108 resources: utime ~0s, stime ~0s
       Hello from rank 0, thread 0, on nid00028. (core affinity = 2)
       Hello from rank 0, thread 1, on nid00028. (core affinity = 2)
       Hello from rank 0, thread 2, on nid00028. (core affinity = 4)
       Hello from rank 0, thread 3, on nid00028. (core affinity = 2)
       Hello from rank 0, thread 4, on nid00028. (core affinity = 2)
       Hello from rank 1, thread 0, on nid00028. (core affinity = 4)
       Hello from rank 1, thread 1, on nid00028. (core affinity = 4)
       Hello from rank 1, thread 2, on nid00028. (core affinity = 4)
       Hello from rank 1, thread 3, on nid00028. (core affinity = 2)
       Hello from rank 1, thread 4, on nid00028. (core affinity = 4)
       Hello from rank 2, thread 0, on nid00029. (core affinity = 2)
       Hello from rank 2, thread 1, on nid00029. (core affinity = 4)
       Hello from rank 2, thread 2, on nid00029. (core affinity = 4)
       Hello from rank 2, thread 3, on nid00029. (core affinity = 2)
       Hello from rank 2, thread 4, on nid00029. (core affinity = 4)
       Hello from rank 3, thread 0, on nid00029. (core affinity = 4)
       Hello from rank 3, thread 1, on nid00029. (core affinity = 2)
       Hello from rank 3, thread 2, on nid00029. (core affinity = 2)
       Hello from rank 3, thread 3, on nid00029. (core affinity = 2)
       Hello from rank 3, thread 4, on nid00029. (core affinity = 4)


     If depth is greater than the number of CPUs per NUMA node, once the
     number of threads created by the PE exceeds the number of CPUs per
     NUMA node, the remaining threads are constrained to CPUs within the
     next NUMA node on the compute node. In the following example, all
     threads are placed on NUMA node 0 except thread 6, which is placed on
     NUMA node 1:

       % setenv OMP_NUM_THREADS 7
       Hello from rank 1, thread 5, on nid00262. (core affinity = 5)
       Hello from rank 1, thread 6, on nid00262. (core affinity = 6)


     Example 5: Binding PEs to CPUs (-cc keyword options)

     By default, each PE is bound to a CPU (-cc cpu). For a Cray XT5
     application, each PE runs on a separate CPU of NUMA nodes 0 and 1. In
     the following example, each PE is bound to a CPU of a 12-core Cray XT5
     compute node:

       % aprun -n 12 -cc cpu ./xthi | sort
       Application 286323 resources: utime ~0s, stime ~0s
       Hello from rank 0, thread 0, on nid00514. (core affinity = 0)
       Hello from rank 10, thread 0, on nid00514. (core affinity = 10)
       Hello from rank 11, thread 0, on nid00514. (core affinity = 11)
       Hello from rank 1, thread 0, on nid00514. (core affinity = 1)
       Hello from rank 2, thread 0, on nid00514. (core affinity = 2)
       Hello from rank 3, thread 0, on nid00514. (core affinity = 3)
       Hello from rank 4, thread 0, on nid00514. (core affinity = 4)
       Hello from rank 5, thread 0, on nid00514. (core affinity = 5)
       Hello from rank 6, thread 0, on nid00514. (core affinity = 6)
       Hello from rank 7, thread 0, on nid00514. (core affinity = 7)
       Hello from rank 8, thread 0, on nid00514. (core affinity = 8)
       Hello from rank 9, thread 0, on nid00514. (core affinity = 9)


     For a Cray XT4 quad-core application, each PE runs on a separate CPU
     of two nodes:

       % cnselect coremask.eq.15
       40-45
       % qsub -I -lmppwidth=8 -lmppnodes=\"40-45\"
       % aprun -n 8 -cc cpu ./xthi | sort
       Application 225111 resources: utime ~0s, stime ~0s
       Hello from rank 0, thread 0, on nid00040. (core affinity = 0)
       Hello from rank 1, thread 0, on nid00040. (core affinity = 1)
       Hello from rank 2, thread 0, on nid00040. (core affinity = 2)
       Hello from rank 3, thread 0, on nid00040. (core affinity = 3)
       Hello from rank 4, thread 0, on nid00041. (core affinity = 0)
       Hello from rank 5, thread 0, on nid00041. (core affinity = 1)
       Hello from rank 6, thread 0, on nid00041. (core affinity = 2)
       Hello from rank 7, thread 0, on nid00041. (core affinity = 3)


     Cray XT4 nodes have one NUMA node (NUMA-node 0). In the following
     example, each PE of PEs 0-3 is bound to a CPU in node 40, NUMA-node 0
     and each PE of PEs 4-7 is bound to a CPU in node 41, NUMA-node 0:

       % cnselect coremask.eq.15
       40-45
       % qsub -I -lmppwidth=8 -lmppnodes=\"40-45\"

     NUMA node but not off the assigned NUMA node:

       % cnselect coremask.eq.255
       28-95
       % qsub -I -lmppwidth=8 -lmppnodes=\"28-95\"
       % aprun -n 8 -cc numa_node ./xthi | sort
       Application 225113 resources: utime ~0s, stime ~0s
       Hello from rank 0, thread 0, on nid00028. (core affinity = 0-3)
       Hello from rank 1, thread 0, on nid00028. (core affinity = 0-3)
       Hello from rank 2, thread 0, on nid00028. (core affinity = 0-3)
       Hello from rank 3, thread 0, on nid00028. (core affinity = 0-3)
       Hello from rank 4, thread 0, on nid00028. (core affinity = 4-7)
       Hello from rank 5, thread 0, on nid00028. (core affinity = 4-7)
       Hello from rank 6, thread 0, on nid00028. (core affinity = 4-7)
       Hello from rank 7, thread 0, on nid00028. (core affinity = 4-7)


     For Cray XT4 nodes, the PEs are bound to NUMA-node 0 of nodes 40 and
     41:

       % cnselect coremask.eq.15
       40-45
       % qsub -I -lmppwidth=8 -lmppnodes=\"40-45\"
       % aprun -n 8 -cc numa_node ./xthi | sort
       Application 225114 resources: utime ~0s, stime ~0s
       Hello from rank 0, thread 0, on nid00040. (core affinity = 0-3)
       Hello from rank 1, thread 0, on nid00040. (core affinity = 0-3)
       Hello from rank 2, thread 0, on nid00040. (core affinity = 0-3)
       Hello from rank 3, thread 0, on nid00040. (core affinity = 0-3)
       Hello from rank 4, thread 0, on nid00041. (core affinity = 0-3)
       Hello from rank 5, thread 0, on nid00041. (core affinity = 0-3)
       Hello from rank 6, thread 0, on nid00041. (core affinity = 0-3)
       Hello from rank 7, thread 0, on nid00041. (core affinity = 0-3)


     The following command specifies no binding; CNL can migrate threads
     among all the CPUs of node 28:

       % aprun -n 8 -cc none ./xthi | sort
       Application 225116 resources: utime ~0s, stime ~0s
       Hello from rank 0, thread 0, on nid00028. (core affinity = 0-7)
       Hello from rank 1, thread 0, on nid00028. (core affinity = 0-7)
       Hello from rank 2, thread 0, on nid00028. (core affinity = 0-7)
       Hello from rank 3, thread 0, on nid00028. (core affinity = 0-7)
       Hello from rank 4, thread 0, on nid00028. (core affinity = 0-7)
       Hello from rank 5, thread 0, on nid00028. (core affinity = 0-7)
       Hello from rank 6, thread 0, on nid00028. (core affinity = 0-7)
       Hello from rank 7, thread 0, on nid00028. (core affinity = 0-7)


     Example 6: Optimizing NUMA-node memory references (-S option)

     This example runs all PEs on NUMA node 0; the PEs cannot allocate
     remote NUMA node memory:

       % aprun -n 8 -sl 0 ./xthi | sort
       Application 225118 resources: utime ~0s, stime ~0s
       Hello from rank 0, thread 0, on nid00028. (core affinity = 0)
       Hello from rank 1, thread 0, on nid00028. (core affinity = 1)
       Hello from rank 2, thread 0, on nid00028. (core affinity = 2)
       Hello from rank 3, thread 0, on nid00028. (core affinity = 3)
       Hello from rank 4, thread 0, on nid00029. (core affinity = 0)
       Hello from rank 5, thread 0, on nid00029. (core affinity = 1)
       Hello from rank 6, thread 0, on nid00029. (core affinity = 2)
       Hello from rank 7, thread 0, on nid00029. (core affinity = 3)


     This example runs all PEs on NUMA node 1:

       % aprun -n 8 -sl 1 ./xthi | sort
       Application 225119 resources: utime ~0s, stime ~0s
       Hello from rank 0, thread 0, on nid00028. (core affinity = 4)
       Hello from rank 1, thread 0, on nid00028. (core affinity = 5)
       Hello from rank 2, thread 0, on nid00028. (core affinity = 6)
       Hello from rank 3, thread 0, on nid00028. (core affinity = 7)
       Hello from rank 4, thread 0, on nid00029. (core affinity = 4)
       Hello from rank 5, thread 0, on nid00029. (core affinity = 5)
       Hello from rank 6, thread 0, on nid00029. (core affinity = 6)
       Hello from rank 7, thread 0, on nid00029. (core affinity = 7)


     Example 8: Optimizing NUMA node-memory references (-sn option)

     This example runs four PEs on NUMA node 0 of node 28 and four PEs on
     NUMA node 0 of node 29:

       % aprun -n 8 -sn 1 ./xthi | sort
       Application 225120 resources: utime ~0s, stime ~0s
       Hello from rank 0, thread 0, on nid00028. (core affinity = 0)
       Hello from rank 1, thread 0, on nid00028. (core affinity = 1)
       Hello from rank 2, thread 0, on nid00028. (core affinity = 2)
       Hello from rank 3, thread 0, on nid00028. (core affinity = 3)
       Hello from rank 4, thread 0, on nid00029. (core affinity = 0)
       Hello from rank 5, thread 0, on nid00029. (core affinity = 1)
       Hello from rank 6, thread 0, on nid00029. (core affinity = 2)
       Hello from rank 7, thread 0, on nid00029. (core affinity = 3)


     Example 9: Optimizing NUMA-node memory references (-ss option)

     When the -ss option is used, a PE can allocate only the memory local
     to its assigned NUMA node. The default is to allow remote-NUMA-node
     memory allocation to all assigned NUMA nodes. For example, by default
     any PE running on NUMA node 0 can allocate NUMA node 1 memory.
       Hello from rank 7, thread 0, on nid00028. (core affinity = 7)


     Example 10: Memory per PE (-m option)

     The -m option can affect application placement. This example runs all
     PEs on node 43. The amount of memory available per PE is 4000 MB:

       % aprun -n 8 -m4000m ./xthi | sort
       Application 225122 resources: utime ~0s, stime ~0s
       Hello from rank 0, thread 0, on nid00043. (core affinity = 0)
       Hello from rank 1, thread 0, on nid00043. (core affinity = 1)
       Hello from rank 2, thread 0, on nid00043. (core affinity = 2)
       Hello from rank 3, thread 0, on nid00043. (core affinity = 3)
       Hello from rank 4, thread 0, on nid00043. (core affinity = 4)
       Hello from rank 5, thread 0, on nid00043. (core affinity = 5)
       Hello from rank 6, thread 0, on nid00043. (core affinity = 6)
       Hello from rank 7, thread 0, on nid00043. (core affinity = 7)


     In this example, node 43 does not have enough memory to fulfill the
     request for 4001 MB per PE. PE 7 runs on node 44:

       % aprun -n 8 -m4001 ./xthi | sort
       Application 225123 resources: utime ~0s, stime ~0s
       Hello from rank 0, thread 0, on nid00043. (core affinity = 0)
       Hello from rank 1, thread 0, on nid00043. (core affinity = 1)
       Hello from rank 2, thread 0, on nid00043. (core affinity = 2)
       Hello from rank 3, thread 0, on nid00043. (core affinity = 3)
       Hello from rank 4, thread 0, on nid00043. (core affinity = 4)
       Hello from rank 5, thread 0, on nid00043. (core affinity = 5)
       Hello from rank 6, thread 0, on nid00043. (core affinity = 6)
       Hello from rank 7, thread 0, on nid00044. (core affinity = 0)


     Example 11: Using huge pages (-m h and hs suffixes)

     This example requests 4000 MB of huge pages per PE:

       % cc -o xthi xthi.c -lhugetlbfs
       % HUGETLB_MORECORE=yes aprun -n 8 -m4000h ./xthi | sort
       %
       Application 225124 resources: utime ~0s, stime ~0s
       Hello from rank 0, thread 0, on nid00043. (core affinity = 0)
       Hello from rank 1, thread 0, on nid00043. (core affinity = 1)
       Hello from rank 2, thread 0, on nid00043. (core affinity = 2)
       Hello from rank 3, thread 0, on nid00043. (core affinity = 3)
       Hello from rank 4, thread 0, on nid00043. (core affinity = 4)
       Hello from rank 5, thread 0, on nid00043. (core affinity = 5)
       Hello from rank 6, thread 0, on nid00043. (core affinity = 6)
       Hello from rank 7, thread 0, on nid00043. (core affinity = 7)



     The following example terminates because the required 4000 MB of huge
     pages per PE are not available:

       % aprun -n 8 -m4000hs ./xthi | sort
       [NID 00043] 2009-04-09 07:58:28 Apid 379231: unable to acquire enough huge memory: desired 32000M, actual 31498M


     Example 12: Using node lists (-L option)

     You can specify candidate node lists through the aprun -L option for
     applications launched interactively and through the qsub -lmppnodes
     option for batch and interactive batch jobs.

     For an application launched interactively, use the cnselect command to
     get a list of all Cray XT5 compute nodes. Then use aprun -L option to
     specify the candidate list:

       % cnselect coremask.eq.255
       28-95
       % aprun -n 4 -N 2 -L 28-95 ./xthi | sort
       Application 225127 resources: utime ~0s, stime ~0s
       Hello from rank 0, thread 0, on nid00028. (core affinity = 0)
       Hello from rank 1, thread 0, on nid00028. (core affinity = 1)
       Hello from rank 2, thread 0, on nid00029. (core affinity = 0)
       Hello from rank 3, thread 0, on nid00029. (core affinity = 1)


     Example 13: Bypassing binary transfer (-b option)

     This aprun command runs the compute node grep command to find
     references to MemTotal in compute node file /proc/meminfo:

       % aprun -b /bin/ash -c "cat /proc/meminfo |grep MemTotal"
       MemTotal:     32909204 kB


     For further information about the commands you can use with the aprun
     -b option, see Workload Management and Application Placement for the
     Cray Linux Environment.

SEE ALSO
     intro_alps(1), apkill(1), apstat(1), cnselect(1), qsub(1)

     CC(1), cc(1), ftn(1)

     Workload Management and Application Placement for the Cray Linux
     Environment

     Cray Application Developer's Environment User's Guide

Man(1) output converted with man2html