NERSCPowering Scientific Discovery Since 1974

Advanced Running Jobs Options

  Running xfer Jobs

The intended use of the xfer queue is to transfer data between Cori and HPSS. The xfer jobs run on one of the login nodes and are therefore free of charge. If you want to transfer data to the HPSS archive system at the end of a job, you can submit an xfer job at the end of your batch job script via "sbatch -M escori hsi put <my_files>", so that you will not get charged for the duration of the data transfer.  The xfer jobs can be monitored via “squeue -M escori”. Do not run computational jobs in the xfer queue.

Below is a sample xfer job script:

#!/bin/bash -l
#SBATCH -M escori
#SBATCH -p xfer
#SBATCH -t 12:00:00
#SBATCH -J my_transfer
#SBATCH -L SCRATCH

#Archive run01 to HPSS
htar -cvf run01.tar run01

Note, there is no "#SBATCH -N nodes" line in the above example.  The xfer jobs specifying "-N nodes" will be rejected at submission time.  Also "-C haswell" is not needed since the job does not run on compute nodes.  By default, xfer jobs get ~2GB of memory allocated. If you're archiving larger files, you'll need to request more memory. You can do this by adding "#SBATCH --mem=XGB" to the above script (5 - 10 GB is a good starting point for large files).

To monitor your xfer jobs, please use the "squeue -M escori" command, or "scontrol -M escori show job job_id". 

Running bigmem jobs

There are two nodes with 750 GB of memory that can be used for jobs that require very high memory per node. There are only two nodes, so this resource is somewhat limited and should only be used for jobs that require high memory. In an effort to make these useful to more users at once, these nodes can be shared among users. If you need to run with multiple threads, you will need to request the whole node. To do this, add "#SBATCH --execlusive" and add the "-c 32" flag to your srun call. Below is a sample bigmem script for a job that needs only one core:

#!/bin/bash -l
#SBATCH -M escori
#SBATCH -p bigmem
#SBATCH -N 1
#SBATCH -t 01:00:00
#SBATCH -J my_big_job
#SBATCH -L SCRATCH
#SBATCH --mem=250GB

srun -N 1 -n 1 my_big_executable

To monitor your bigmem jobs, please use the "squeue -M escori" command, or "scontrol -M escori show job job_id".

Running Multiple Parallel Jobs While Sharing Nodes

Under certain scenarios, you might want two or more independent applications running simultaneously on each compute node allocated to your job. For example, a pair of applications that interact in a client-server fashion via some IPC mechanism on-node (e.g. shared memory), but must be launched in distinct MPI communicators.

This latter constraint would mean that MPMD mode (see below) is not an appropriate solution, since although MPMD can allow multiple executables to share compute nodes, the executables will also share an MPI_COMM_WORLD at launch.

SLURM can allow multiple executables launched with concurrent srun calls to share compute nodes as long as the sum of the resources assigned to each application does not exceed the node resources requested for the job. Importantly, you cannot over-allocate the CPU, memory, or "craynetwork" resource. While the former two are self-explanatory, the latter refers to limitations imposed on the number of applications per node that can simultaneously use the Aries interconnect, which is currently limited to 4.

Here is a quick example of an sbatch script that uses two compute nodes and runs two applications concurrently. One application uses 8 cores on each node, while the other uses 24 on each node. The number of CPUs per node is again controlled with the "-n" and "-N" flags, while the amount of memory per node with the "--mem" flag. To specify the "craynetwork" resource, we use the "--gres" flag available in both "sbatch" and "srun".

#!/bin/bash -l

#SBATCH -p regular
#SBATCH -N 2
#SBATCH -t 12:00:00
#SBATCH --gres=craynetwork:2
#SBATCH -L SCRATCH
#SBATCH -C haswell

srun -N 2 -n 16 -c 2 --mem=51200 --gres=craynetwork:1 ./exec_a &
srun -N 2 -n 48 -c 2 --mem=61440 --gres=craynetwork:1 ./exec_b &
wait 

This is example is quite similar to the mutliple srun job shown here, with the following exceptions:

  1. For our sbatch job, we have requested "--gres=craynetwork:2" which will allow us to run up to two applications simultaneously per compute node.
  2. In our srun calls, we have explicitly defined the maximum amount of memory available to each application per node with "--mem" (in this example 50 and 60 GB, respectively) such that the sum is less than the resource limit per node (roughly 122 GB).
  3. In our srun calls, we have also explicitly used one of the two requested craynetwork resources per call.

Using this combination of resource requests, we are able to run multiple parallel applications per compute node.

One additional observation: when calling srun, it is permitted to specify "--gres=craynetwork:0" which will not count against the craynetwork resource. This is useful when, for example, launching a bash script or other application that does not use the interconnect.

We don't currently anticipate this being a common use case, but if your application(s) do employ this mode of operation it would be appreciated if you let us know.          

Use the "realtime" Partition for Realtime Jobs

The "realtime" partition is used for running jobs with the need of getting realtime turnaround time. Use of this partition requires special approval.  The "realtime" Queue Request Form can be found here.

The realtime partition is a user-selective shared partition, meaning you can request either exclusive node access (with the "#SBATCH --exclusive" flag) or allow multiple applications to share a node (with the "#SBATCH --share" flag).

If there is no need to use exclusive nodes access, it is recommended to use the "--share" flag for realtime jobs, so more jobs can be scheduled in the partition.

#!/bin/bash -l

#SBATCH -p realtime
#SBATCH --share
#SBATCH -N 4
#SBATCH -n 12
#SBATCH -t 01:00:00
#SBATCH -J my_job
#SBATCH -L project
#SBATCH -C haswell

srun -N 4 -n 12 -c 1 ./mycode.exe   # pure MPI, 12 MPI tasks
#or
export OMP_NUM_THREADS=4
srun -N 4 -n 4 -c 3 ./mycode.exe  # hybrid MPI/OpenMP, 4 MPI tasks, 4 OpenMP thread per task

If you are requesting only a portion of a single node, please add the "--gres=craynetwork:0" as follows to allow more jobs being scheduled on the node. Similar to using the "shared" partition, you can request number of slots on the node (total of 64 CPUs, or 64 slots) by specifying the "-n" and/or "--mem" flags. Note that SLURM sees all logical cores, so you need to request with the -n flag double the number of the physical cores you need. For example, -n 8 will give you access to 4 physical cores.  You can also use '"-c" and "--mem-per-cpu" flags to request number of physical cores and memory per CPU your job needs. Note that each physical core counts as 2 CPUs (on Haswell nodes) in SLURM.  To request 4 physical cores, you will need to request via "#SBATCH -n 8".

#!/bin/bash -l

#SBATCH -p realtime
#SBATCH --share
#SBATCH --gres=craynetwork:0
#SBATCH -n 8
#SBATCH --mem=10GB
#SBATCH -t 01:00:00
#SBATCH -J my_job
#SBATCH -L SCRATCH,project
#SBATCH -C haswell

export OMP_NUM_THREADS=4
srun -n 2 -c 4 ./mycode.exe

If you do not use MPI, you can also use '"-c" and "--mem-per-cpu" flags to request number of physical cores and memory per CPU your job needs.  Notice each physical core is counted as 2 CPUs in SLURM.  To request 4 physical cores, you will need to request via "#SBATCH -c 8".  

#!/bin/bash -l

#SBATCH -p realtime
#SBATCH --share
#SBATCH --gres=craynetwork:0
#SBATCH -c 8
#SBATCH --mem-per-cpu=2GB
#SBATCH -t 01:00:00
#SBATCH -J my_job
#SBATCH -L SCRATCH,project
#SBATCH -C haswell

./mycode.exe

The following example requests 4 nodes with exclusive access in the "realtime" partition:

#!/bin/bash -l

#SBATCH -p realtime
#SBATCH --exclusive
#SBATCH -N 4
#SBATCH -t 01:00:00
#SBATCH -J my_job
#SBATCH -L project
#SBATCH -C haswell

srun -n 128 -c 2 ./my_exe # -c is optional since it is fully packed MPI
# or
export OMP_NUM_THREADS=8
srun -n 16 -c 16 ./mycode.exe

Running a Job on Specific Nodes

The following job script shows how to request specific nodes to run your job on.

#!/bin/bash -l
#SBATCH -p regular
#SBATCH -t 00:30:00
#SBATCH -N 4
#SBATCH -w "nid00[029-031],nid00036"
#SBATCH -J my_job
#SBATCH -o my_job.o%j
#SBATCH -L project,scratch3
#SBATCH -C haswell

srun -n 64 -c 4 ./a.out

Running Executables Built with Intel MPI

Applications built with Intel MPI can be launched via srun in the SLURM batch script on Cori compute nodes. Below is a sample compile and run script:

#!/bin/bash -l

#SBATCH -p regular      
#SBATCH -N 8  
#SBATCH -t 03:00:00
#SBATCH -L project,SCRATCH  
#SBATCH -C haswell

module load impi    
mpiicc -qopenmp -o mycode.exe mycode.c

export OMP_NUM_THREADS=8
export OMP_PROC_BIND=spread
export OMP_PLACES=threads

export I_MPI_FABRICS=ofi
export I_MPI_OFI_PROVIDER=gni
export I_MPI_OFI_LIBRARY=/usr/common/software/libfabric/1.5.0/gnu/lib/libfabric.so
export I_MPI_PMI_LIBRARY=/usr/lib64/slurmpmi/libpmi.so

srun -n 32 -c 16 ./mycode.exe

Notice the I_MPI_PMI_LIBRARY environment variable must be used to instruct Intel MPI to use SLURM's PMI library. 

Configuring MPI Process Placement on Compute Nodes

The MPI implementations supported on Cori and Edison provide tools for controlling MPI process placement on compute nodes. This can be beneficial if an application tends to favor a particular type of process communication pattern, e.g., nearest-neighbor. Processes which communicate with each other more often than with others can be placed onto the same node in order to reduce the volume of MPI messages which must travel over the interconnect.

The Cray compiler wrappers "cc", "CC", and "ftn" will link applications to Cray MPI, which is derived from MPICH. By default, Cray MPI spreads MPI processes as widely across nodes as possible, and will try to place consecutive tasks on the same node, e.g.,

user@nid01041:~> srun -n 8 -c 2 check-mpi.intel.cori|sort -nk 4 
Hello from rank 0, on nid01041. (core affinity = 0-63)
Hello from rank 1, on nid01041. (core affinity = 0-63)
Hello from rank 2, on nid01111. (core affinity = 0-63)
Hello from rank 3, on nid01111. (core affinity = 0-63)
Hello from rank 4, on nid01118. (core affinity = 0-63)
Hello from rank 5, on nid01118. (core affinity = 0-63)
Hello from rank 6, on nid01282. (core affinity = 0-63)
Hello from rank 7, on nid01282. (core affinity = 0-63)

However, one can use the MPICH_RANK_REORDER_METHOD environment variable to specify other types of MPI task placement. For example, setting it to 0 results in a round-robin placement:

user@nid01041:~> MPICH_RANK_REORDER_METHOD=0 srun -n 8 -c 2 check-mpi.intel.cori|sort -nk 4 
Hello from rank 0, on nid01041. (core affinity = 0-63)
Hello from rank 1, on nid01111. (core affinity = 0-63)
Hello from rank 2, on nid01118. (core affinity = 0-63)
Hello from rank 3, on nid01282. (core affinity = 0-63)
Hello from rank 4, on nid01041. (core affinity = 0-63)
Hello from rank 5, on nid01111. (core affinity = 0-63)
Hello from rank 6, on nid01118. (core affinity = 0-63)
Hello from rank 7, on nid01282. (core affinity = 0-63)

There are other modes available with the MPICH_RANK_REORDER_METHOD environment variable, including one which lets the user provide a file called "MPICH_RANK_ORDER" which contains a list of each task's placement on each node. These options are described in detail in the "intro_mpi" man page on Cori and Edison.

For MPI applications which perform a large amount of nearest-neighbor communication, e.g., stencil-based applications on structured grids, Cray provides a tool in the "perftools-base" module called "grid_order" which can generate a MPICH_RANK_ORDER file automatically by taking as parameters the dimensions of the grid, core count, etc. For example, to place MPI tasks in row-major order on a Cartesian grid of size (4, 4, 4), using 32 tasks per node on Cori, the result is

user@edison02:~> grid_order -R -c 32 -g 4,4,4
# grid_order -R -Z -c 32 -g 4,4,4
# Region 3: 0,0,1 (0..63)
0,1,2,3,16,17,18,19,32,33,34,35,48,49,50,51,4,5,6,7,20,21,22,23,36,37,38,39,52,53,54,55
8,9,10,11,24,25,26,27,40,41,42,43,56,57,58,59,12,13,14,15,28,29,30,31,44,45,46,47,60,61,62,63

One can then save this output to a file called "MPICH_RANK_ORDER" and then set MPICH_RANK_REORDER_METHOD=3 before running the job, which tells Cray MPI to read the MPICH_RANK_ORDER file to set the MPI task placement. For more information, please see the man page "grid_order" (available when the 'perftools-base' module is loaded) on Cori and Edison.

One can also compile and run applications on Edison and Cori using Intel MPI instead of Cray MPI. While Intel MPI provides its own set of tools for controlling MPI task placement, most are not available on Edison and Cori due to the requirement that jobs launch with Slurm's "srun" instead of Intel MPI's native "mpirun". The default MPI task placement of Intel MPI is the same as Cray MPI: it attempts to spread the tasks as widely across nodes as possible, while grouping consecutive tasks together on the same node.

Running Executables Built with OpenMPI

Applications can be built with OpenMPI based on libfabric on Cori. Below is a sample compile and run script:

#!/bin/bash -l

#SBATCH -p regular      
#SBATCH -N 8          
#SBATCH -t 03:00:00          
#SBATCH -L project
#SBATCH -C haswell

module load openmpi/2.0.3

mpicc -qopenmp -o mycode.exe mycode.c
export OMPI_MCA_pml=cm 
export OMPI_MCA_btl=self,vader,tcp 

export OMP_NUM_THREADS=8
srun -n 32 -c 16 ./mycode.exe   

Running CCM Jobs

By CCM jobs, we mostly mean single node 3rd party applications, or multi-node applications that need SSH between compute nodes.  These jobs can just run without srun.   Below is a sample CCM job batch script:

#!/bin/bash -l

#SBATCH -p regular
#SBATCH -N 8
#SBATCH -n 256
#SBATCH -t 03:00:00
#SBATCH -J my_job
#SBATCH -L SCRATCH
#SBATCH -C haswell

# Launch 3rd party executable directly such as:
./mycode.exe

Running MPMD (Multiple Program Multiple Data) Jobs

 To run MPMD Jobs under SLURM, we use the option --multi-prog and a configuration file, per specifications and example below:

--multi-prog
Run a job with different programs and different arguments for each task. In this case, the executable program specified is actually a configuration file specifying the executable and arguments for each task.
Comments in the configuration file must have a "#" in column one. The configuration file contains the following fields separated by white space:
Task rank
One or more task ranks to use this configuration. Multiple values may be comma separated. Ranges may be indicated with two numbers separated with a '-' with the smaller number first (e.g. "0-4" and not "4-0"). To indicate all tasks not otherwise specified, specify a rank of '*' as the last line of the file. If an attempt is made to initiate a task for which no executable program is defined, the following error message will be produced "No executable program specified for this task".
Executable
The name of the program to execute. May be fully qualified pathname if desired.
Arguments
Program arguments. The expression "%t" will be replaced with the task's number. The expression "%o" will be replaced with the task's offset within this range (e.g. a configured task rank value of "1-5" would have offset values of "0-4"). Single quotes may be used to avoid having the enclosed values interpreted. This field is optional. Any arguments for the program entered on the command line will be added to the arguments specified in the configuration file.
# Config file:

% cat mpmd.conf
0-35 ./a.out
36-96 ./b.out

# Batch Script:

#!/bin/bash -l

#SBATCH -p regular
#SBATCH -N 5
#SBATCH -n 97  # total of 97 tasks
#SBATCH -t 02:00:00
#SBATCH -C haswell

srun --multi-prog ./mpmd.con

Please note that the SPMD components (a.out and b.out above) share MPI_COMM_WORLD. So this run method is not for running multiple copies of the same application simultaneously, just to increase throughput.

Using Shifter to Run Custom Environments

Shifter is a open-source software stack that enables users to run custom environments on HPC systems. It is designed to be compatible with the popular Docker container format so that users can easily run Docker containers on NERSC systems.  Information on running jobs with Shifter can be found here.

Running Jobs Under a reservation

NERSC provides a service that allows users who has special needs to run jobs under a reservation, which allows users to reserve some number of compute nodes for a certain time period. For the instructions about how to request a compute node reservation, please check this website.

To run jobs under a reservation, you need to use the #SBATCH --reservation=<reservation name> flag with your job scripts. Here is an example of running a job under a reservation named "benchmarking", assuming 200 Haswell nodes have been reserved for you under the repository m1759 for 6 hours.

#!/bin/bash -l

#SBATCH -p regular
#SBATCH -N 200
#SBATCH -t 03:00:00
#SBATCH -J test
#SBATCH -L SCRATCH
#SBATCH -C haswell
#SBATCH -A m1759
#SBATCH --reservation=benchmarking

srun -n 6400 ./a.out

Where the #SBATCH -A directive is required if your default repo is not the one for which your reservation is made. The details about your reservation can be found with the following command,

scontrol show reservation [reservation name]

The reservation name is optional in the above command. If the reservation name is not provided, it will display all the reservations currently on the system.

If you need to run jobs interactively, use the --reseration=<reservation name> with your salloc command (or srun if running interactively). 

salloc -N 200 -p regular -t 3:00:00 -C haswell -A m1759 --reservation=benchmarking

Running with Different KNL Modes

KNL nodes have the option of being configured at boot-time in a variety of sub-NUMA clustering (SNC) modes and memory modes. Depending on the mode, a single KNL node can appear to the OS as having 1 (“quad” mode), 2 (“snc2” mode) or 4 (“snc4” mode) NUMA domains. It is also possible to configure the MCDRAM as a direct-map cache (“cache” mode) or to expose it as a directly accessible memory domain separate from the DDR (“flat” mode). More information about KNL modes can be found here.

To request running on KNL with a specific mode, the following line should be added to the batch script:

#SBATCH -C knl,clustering_mode,memory_mode

such as:
#SBATCH -C knl,quad,cache

or:
#SBATCH -C knl,quad,flat

About 3,000 KNL compute nodes are allowed to reboot at the job start time, and the rest of the nodes are fixed to be in the quad,cache mode. The default mode is set to the quad,cache mode, meaning, requesting nodes as “-C knl” only has the equivalent effect of requesting “-C knl,quad,cache”.

To check how many KNL nodes are available and idle currently in each cluster mode, the following sample “sinfo" command can be used:

cori11% sinfo -o "%.20b %.20F"
ACTIVE_FEATURES NODES(A/I/O/T)
knl 0/0/5/5
haswell 1939/430/19/2388
quad,cache,knl 3137/26/3182
quad,flat,knl 1/0/0/1
knl,cache,quad 6196/220/76/6492
knl,flat,quad 0/2/6/8

Here, Column 1 shows the available computer node features (Haswell or various KNL cluster modes),  and Column 2 shows the number of nodes Available/Idle/Other/Total in this partition. Notice the order of the clustering or memory modes in the ACTIVE_FEATURES does not matter, so the overall “knl, quad, cache” nodes are the sum of those listed under “knl”, ”quad, cache, knl” and “knl, cache, quad”.

Slurm allocate KNL nodes for a batch job without the advanced consideration of which KNL mode is requested.  If there are not enough nodes in the request KNL mode that your job has been allocated, some nodes will be rebooted before the job starts to run. While rebooting, the job is in the "CF" state. It can take about 25 min or longer to reboot, and from your job's point of view, it is doing nothing since the job has not started yet. The reboot time is not counted against the requested wall time, however, the charge for the job will be for the entire reboot+run time.

More information can be found on the Running Jobs on Cori FAQ page.