NERSCPowering Scientific Discovery Since 1974

Batch Jobs

Overview

Batch jobs run non-interactively under the control of a batch script -- a text file containing a number of job directives and Linux commands or utilities. Batch scripts are submitted to the batch system, where they are queued awaiting free resources. The batch scheduler/resource manager for Cori is SLURM.

Directives and Commands

Batch script directives tell SLURM how to run the job. You specify things like the number and kind of nodes to use and the time your job will run. Directive lines in the script start with the keyword #SBATCH. A minimal set of keywords is shown in the Bare-bones batch script example below.

When your job runs, Cori will execute commands in the batch script as if it were a plain shell script (you specify the shell at the top of the script, see example below). These commands will execute serially on a single node (the lowest-numbered node your job is allocated by SLURM). To run anything in parallel, you must use the "srun" command to launch multiple instances of your executable. More information about SLURM directives and srun is available below.

Submitting a Batch Script for Execution

Use the "sbatch" command to submit your script to the system. It will be placed in a queue and it will run when resources are available. See "Submitting Jobs" below.

Bare-bones Batch Script

A very simple SLURM batch script to run a job on the Haswell nodes of Cori looks like this: 

#!/bin/bash -l

#SBATCH -N 2 #Use 2 nodes
#SBATCH -t 00:30:00 #Set 30 minute time limit
#SBATCH -p regular #Submit to the regular 'partition'
#SBATCH -L SCRATCH #Job requires $SCRATCH file system
#SBATCH -C haswell #Use Haswell nodes

srun -n 32 -c 4 ./my_executable

Alternatively, the same script using the long format of SLURM keywords, is below: 

#!/bin/bash -l

#SBATCH --nodes=2
#SBATCH --time=00:30:00
#SBATCH --partition=regular
#SBATCH --license=SCRATCH #note: specify license need for the file systems your job needs, such as SCRATCH,project
#SBATCH --constraint=haswell

srun -n 32 -c 4 ./my_executable

 The example above contains:

  • The Shell: SLURM batch scripts requires users to specify which shell to use. In this case we use bash by specifying #!/bin/bash -l. Batch scripts won't run without a shell being specified in the first line.
  • #SBATCH directives: In the example above the directives tell the scheduler to allocate two nodes for your job, for thirty minutes and in the regular partition. The "-L" or "--license=" directive specifies the file systems your job needs (More details on requesting file system license can be found here). The "-C" or "--constraint" directive specifies that this job requires Haswell nodes. Directives can also specify things like what to name STDOUT files, what account to charge, whether to notify you by email when your job finishes, how many tasks to run and how many tasks per node, etc. 
  • Note that when your job runs, it starts in the job submit directory by default. 
  • The srun command is used to start execution of your code on Cori's compute nodes. You may recall that on earlier Cray systems, we used "aprun" rather than srun. Unlike our earlier systems, Cori uses what is known as native SLURM, in which SLURM serves as both a resource manager and job scheduler for the system.

SLURM Keywords

The following table lists required and useful keywords that can be used in #SBATCH directives. 

Required sbatch Options/Directives
Short FormatLong FormatDefaultDescription
-N count  --nodes=count One node will be used. Used to allocate count nodes to your job.
-t HH:MM:SS --time=HH:MM:SS 00:30:00 Always specify the maximum wallclock time for your job.
-C feature
--constraint=feature None Always specify what type of nodes to run. set to "haswell" for Haswell nodes, and set "knl,quad,cache" (or other modes) for KNL.
-p partition --partition=partition debug Always specify your partition, which will usually be debug for testing and regular for production runs. See "Queues and Policies".
Useful sbatch Options/Directives
Short FormatLong FormatDefaultDescription
N/A --ntasks-per-node=count 32 (Haswell) / 68 (KNL) Use [count] of MPI tasks per node 
-c count  --cpus-per-task=count 2 (Haswell) / 4 (KNL) Set the value as "number of of logical cores (CPUs) per MPI task". For example, on Cori Haswell, it is 64 logical cores divided by #MPI_per_node. 
-L list_of_filesystems
 --license=list_of_filesystems
The filesystems resolved by $TMPDIR and $PWD at the time of job submission. Specifies the file systems (in a comma-separated list) needed for your job. Available choices for Cori are: cscratch1 (or SCRATCH), dna, project, projecta, projectb. More details on requesting file system licenses can be found here
-J job_name  --job-name=job_name Job script name. job_name: up to 15 printable, non-whitespace characters.
-A repo  --account=repo Your default repo Charge this job to the NERSC repository repo 
-e filename  --error=filename <script_name>.e<job_id> Write STDERR to filename
-o filename  --output=filename <script_name>.o<job_id> Write STDOUT to filename. By default both standard output and standard error are directed to a file of the name "slurm-%j.out", where the "%j" is replaced with the job id. See the -i option for filename specification options.
-i filename_pattern  --input=filename_pattern "/dev/null" is open on the batch script's standard input and both standard output and standard error are directed to a file of the name "slurm-%j.out", where the "%j" is the job id.  Instruct SLURM to connect the batch script's standard input directly to the file name specified in the "filename pattern".

The filename pattern may contain one or more replacement symbols, which are a percent sign "%" followed by a letter (e.g. %j).

Supported replacement symbols are:

%j
Job id.
%N
Node name. Only one file is created, so %N will be replaced by the name of the first compute node in the job, which is the one that runs the script.

N/A

 

--mail-type=events 

--mail-user=address

Email notification Valid event values are: BEGIN, END, FAIL, REQUEUE, ALL (equivalent to BEGIN, END, FAIL, REQUEUE, and STAGE_OUT), STAGE_OUT (burst buffer stage out completed), TIME_LIMIT, TIME_LIMIT_90 (reached 90 percent of time limit), TIME_LIMIT_80 (reached 80 percent of time limit), and TIME_LIMIT_50 (reached 50 percent of time limit). Multiple type values may be specified in a comma separated list. The user to be notified is indicated with --mail-user. Mail notifications on job BEGIN, END and FAIL apply to a job array as a whole rather than generating individual email messages for each task in the job array.
-D directory_name --workdir=directory_name  Directory specification Set the working directory of the batch script to directory_name before it is executed. The path can be specified as full path or relative path to the directory where the sbatch command is executed.
N/A --export=ALL This is on by default Export the current environment variables into the batch job environment.  This is the default behavior.  Other options (not recommended to use) are specific list of environment variables, or NONE.

SLURM directives may also be specified on the sbatch command line.  Note: if you use both command-line options and script directives,  command line options will override the corresponding options in the batch script.

An expanded list of sbatch job options and keywords can be found here, but keep in mind that the page describes a generic sbatch implementation and not all options and environment variables are relevant to or defined on Cori.

The srun Command

 It is possible to use a large portion of the options for sbatch with srun due to the nature of native SLURM.

For more information on srun, check our page for frequently used srun options, visit the official documentation for srun, or refer to "srun" man page.

Selecting the KNL Nodes

In the bare-bones batch script above, we selected the Haswell nodes. If we want to instead select the KNL nodes, we need to change the constraint from "haswell" to "knl".

There are multiple memory modes for the KNL nodes, the details of which are covered elsewhere. If you specify only "knl" as the constraint, you will receive the default configuration, which is currently "quad cache" mode. We recommend this mode as most codes perform well under this mode.

Our bare-bones batch script for running 32 MPI processes across two nodes becomes:

#!/bin/bash -l

#SBATCH -N 2 #Use 2 nodes
#SBATCH -t 00:30:00 #Set 30 minute time limit
#SBATCH -p regular #Submit to the regular 'partition'
#SBATCH -L SCRATCH #Job requires $SCRATCH file system
#SBATCH -C knl,quad,cache #Use KNL nodes in quad cache format (default, recommended)

srun -n 32 -c 16 ./my_executable

Submitting Jobs

Once your batch script is ready, you can submit as below:

% sbatch myscript.sl

from the directory that contains the script file. You can specify sbatch directives as options to sbatch but we recommend putting your directives in the script instead. Then you will have a record of the directives you used, which is useful for record-keeping as well as debugging should something go wrong.

Choosing a Partition

When you submit your batch job you will usually choose one of these SLURM partitions:

    • regular : Use this for production runs.
    • debug: Use this for small, short test runs

Additional partitions are available for a number of special circumstances that are described elsewhere. See Cori Queues and Policies for more information.

Job Output

Standard output ( STDOUT ) and standard error ( STDERR ) messages from your job are directly written to the output or stderr file name you specified in your batch script or to the default output file name (as slurm-jobid.out) in your submit directory ( $SLURM_SUBMIT_DIR ) and you can monitor them there during your run. 

Job Steps and Dependencies

There is a sbatch/srun option -d(--dependency) <dependency_list> for job dependencies. The most commonly used dependency_list would be afterok:jobid[:jobid...], which means the job just submitted will be executed only after the dependent job(s) terminated without an error. Another option would be afterany:jobid[:jobid...], which means the job just submitted will be executed only after the dependent job(s) terminated either with or without an error. The second option could be useful in many restart runs if it is the user's intention to exceed wall clock limit for the first job.

For example, to run batch job2 only after batch job1 is completed:

cori02% sbatch job1
Submitted batch job 5332
cori02% sbatch -d afterok:5332 job2
or
cori02% sbatch -d afterany:5332 job2  

Job steps and dependencies can be used in a workflow to prepare input data for simulation or to archive output data after a simulation. See the Job Steps and Dependencies example in Example Batch Scripts.