NERSCPowering Scientific Discovery Since 1974

Example Batch Scripts

Please note that after the merge of Cori Phase 1 Haswell and Phase 2 KNL cabinets, running jobs on Haswell for getting the process and thread affinity has changed from before the merge.   Please see details at the Changes to Run on Cori Haswell After Merge page.

Note: There are two types of compute nodes on Cori: Haswell and KNL compute nodes. Haswell nodes (from Cori Phase 1) are designed to accelerate data-intensive applications. Users are encouraged to run large size massive parallel jobs on Edison. Jobs using at least 683 nodes on Edison get 40% charge discount.

General Recommendations and FAQ on Running Jobs

Please make sure to read these recommendations and Running Jobs FAQ first, especially with regarding to the use of srun "-c" and "--cpu_bind=cores" options. 

Job Script Generator

An interactive Job Script Generator is available at MyNERSC to provide some guidance on getting optimal process and thread binding for running hybrid MPI/OpenMP program on Edison, Cori Haswell, and Cori KNL. Select your machine, number of nodes, and more, and a skeleton script will be generated for you. Please note that you will still need to add some additional information, such as the walltime and of course the name of your executable, to complete the script.

Checking Process and Thread Affinity

Pre-built binaries from a small test code with pure MPI or hybrid MPI/OpenMP can be used to check affinity.  Please see details here.

More Example Batch Scripts for KNL

Most examples below on this page provide example scripts for running on Cori Haswell nodes. However, our general recommendations above apply to running on Haswell or KNL nodes. More KNL examples batch scripts can be found here.

Running Pure MPI Jobs

Basic Scripts for Fully Packed Pure MPI Jobs

A fully packed MPI job on Haswell uses 32 physical cores per node (if no hyperthreading is used), 1 MPI task per core, therefore the job running on 64 nodes below uses 2048 cores. The example job uses the $SCRATCH file system (which is /cscratch1 on Cori).  More details on requesting file system license can be found here.  The default distribution among sockets on Haswell is set to cyclic.  Users can add -m block:block in the srun line to use the block distribution among sockets instead if preferred.

#!/bin/bash -l

#SBATCH -p debug
#SBATCH -N 64
#SBATCH -t 00:20:00
#SBATCH -J my_job
#SBATCH -L SCRATCH
#SBATCH -C haswell

srun -n 2048 ./mycode.exe # an extra -c 2 flag is optional for fully packed pure MPI
#or
srun -n 2048 -m block:block ./mycode.exe # to use block distribution among sockets

Below is the equivalent batch script using the long SLURM keyword format:

#!/bin/bash -l

#SBATCH --partition=debug
#SBATCH --nodes=64
#SBATCH --time=00:20:00
#SBATCH --job-name=my_job
#SBATCH --license=SCRATCH
#SBATCH --constraint=haswell

srun -n 2048 ./mycode.exe # an extra -c 2 flag is optional for fully packed pure MPI

Fully Packed Hyperthreading (HT) Pure MPI Jobs

Hyperthreading is enabled and on by default, which means you can run up to 64 processes per node.  The following job will run on 64 nodes, with 4,096 cores in total. When more than 32 MPI tasks per node is needed, srun will bind MPI tasks to physical cores plus logical cores. The example job uses the $SCRATCH file system (which is /cscratch1 on Cori).

#!/bin/bash -l

#SBATCH -p debug
#SBATCH -N 64
#SBATCH -t 00:20:00
#SBATCH -J my_job
#SBATCH -L SCRATCH
#SBATCH -C haswell

srun -n 4096 ./mycode.exe # an extra -c 1 flag is optional for fully packed pure MPI with hyperthreading

 "Unpacked" Node Pure MPI Script

This example shows how to run a total of 1024 MPI tasks using only 16 MPI tasks per node (rather than 32) using 64 Haswelll nodes. Set -c to number of logical cores per MPI task, which is 64 (total logical cores) / 16 (MPI tasks per node) = 4.

#!/bin/bash -l

#SBATCH -p regular
#SBATCH -N 64
#SBATCH -t 02:00:00
#SBATCH -J my_job
#SBATCH -L SCRATCH
#SBACTH -C haswell

srun -n 1024 -c 4 ./mycode.exe
#or
srun -n 1024 --cpu_bind=cores ./mycode.exe

Running Hybrid MPI/OpenMP Applications with Pure MPI

Notice for a hybrid MPI/OpenMP application built with OpenMP compiler flags enabled: If you would like to run with pure MPI only (meaning using 1 OpenMP thread only), OMP_NUM_THREADS needs to be set to 1 explicitly. Otherwise, it will run with 2 OpenMP threads on each physical core, since the default behavior for Intel and GNU compilers is to use max available threads (the default for Cray compiler is to use 1 thread).

#!/bin/bash -l

#SBATCH -p regular
#SBATCH -N 64
#SBATCH -t 12:00:00
#SBATCH -L project
#SBATCH -C haswell

#to run with pure MPI
export OMP_NUM_THREADS=1
srun -n 2048 -c 2 ./my_hybrid_code.exe # -c is optional since this example is fully packed pure MPI (32 MPI tasks per node)

Running Pure OpenMP Jobs

Pure OpenMP Example

Make sure to compile your application with the appropriate OpenMP compiler flags.

#!/bin/bash -l

#SBATCH -p regular
#SBATCH -N 1
#SBATCH -t 12:00:00
#SBATCH -L SCRATCH
#SBATCH -C haswell

export OMP_NUM_THREADS=32
srun -n 1 -c 64 ./mycode.exe

Pure OpenMP with Hyperthreading Example

With HT, you can run pure OpenMP with up to 64 threads. Make sure to compile your application with the the appropriate OpenMP compiler flags.  The example job uses the $SCRATCH (which is /cscratch1 on Cori) and /project file systems.

#!/bin/bash -l

#SBATCH -p regular
#SBATCH -N 1
#SBATCH -t 12:00:00
#SBATCH -L SCRATCH,project
#SBATCH -C haswell

export OMP_NUM_THREADS=64
export OMP_PROC_BIND=true #"spread" is also good for Intel and CCE compilers
export OMP_PLACES=threads
srun -n 1 -c 64 ./mycode.exe

Running Hybrid MPI/OpenMP Applications

SLURM sees each compute node as with hyperthreading enabled, a total of 64 CPUs per node. The -c (--cpus-per-task) flag needs to be passed to the srun (or sbatch) command to specify number of logical cores per MPI task.

Below is a sample script to run hybrid applications. The example job uses the /project file system.   Notice the -c value is independent of the value of OMP-NUM_THREADS.

#!/bin/bash -l

#SBATCH -p regular
#SBATCH -N 64
#SBATCH -t 12:00:00
#SBATCH -L project
#SBACTH -C haswell

export OMP_PROC_BIND=true #"spread" is also good for Intel and CCE compilers
export OMP_PLACES=threads

#to run with 8 threads per task (unpacked)
export OMP_NUM_THREADS=8
srun -n 128 -c 32 ./mycode.exe # 2 MPI tasks per node, -c is set to 64/2=32

# to run with 8 threads per task (fully packed)
export OMP_NUM_THREADS=8
srun -n 256 -c 16 ./mycode.exe # 4 MPI tasks per node, -c is set to 64/4=16
 
# to run with 16 threads per task (with hyperthreading)
export OMP_NUM_THREADS=16
srun -n 256 -c 16 ./mycode.exe # 4 MPI tasks per node, -c is set to 64/4=16 

When Not All CPUs On a Node Are Used 

On Haswell, with pure MPI or hybrid MPI/OpenMP jobs, if #MPI_per_node is not a divisor of 64, -n value times -c value does not equal 64, meaning not all CPUs on a node are used.   For example, if running with 6 MPI tasks per node, the -c value will be set to floor(32/6) * 2 = 10, to allocate 10 CPUs to each MPI task.   A total of 6x10=60 CPUs (30 physical cores) out of 32 total physical cores will be used.  In this situation, an extra --cpu_bind=cores flag is needed to reach optimal process and thread binding.  

#!/bin/bash -l

#SBATCH -p regular
#SBATCH -N 2
#SBATCH -t 2:00:00
#SBATCH -L SCRATCH,project
#SBATCH -C haswell

# for pure MPI jobs
srun -n 12 -c 10 --cpu_bind=cores ./mycode.exe

# for hybrid MPI/OpenMP jobs
export OMP_NUM_THREADS=4
export OMP_PROC_BIND=true #"spread" is also good for Intel and CCE compilers
export OMP_PLACES=threads
srun -n 12 -c 10 --cpu_bind=cores ./mycode.exe

Running Multiple Parallel Jobs Sequentially

#!/bin/bash -l

#SBATCH -p regular
#SBATCH -N 4
#SBATCH -t 12:00:00
#SBATCH -L project,cscratch1
#SBATCH -C haswell

srun -n 128 -c 2 ./a.out # -c is optional for fully packed pure MPI
srun -n 64 -c 4 ./b.out
srun -n 32 -c 8 ./c.out

Running Multiple Parallel Jobs Simultaneously

Be sure to specify the total number of nodes needed to run all jobs at the same time. By default, multiple concurrent srun executions cannot share compute nodes under Slurm in the regular partition, so make that sure the total number of cores required fit on the number of nodes requested. For more details on use cases that must escape this restriction, see the next section.

In the following example, a total of 192 cores are required, which would hypothetically fit on 192 / 32 = 6 nodes. However, because sruns cannot share nodes by default, we instead have to dedicate 2 nodes to the first execution (44 cores), 4 to the second (108 cores), and again 2 to the third (40 cores).   For all three executables, the node is not fully packed, and number of MPI tasks per node is not a divisor of 64, so both -c and --cpu_bind flags are used in srun commands.

Notice the "&" at the end of each srun command.  Also the "wait" command at the end of the script is very important.  It makes sure the batch job won't exit before all the simultaneous sruns are completed. 

#!/bin/bash -l

#SBATCH -p regular
#SBATCH -N 8
#SBATCH -t 12:00:00
#SBATCH -L cscratch1
#SBATCH -C haswell

srun -N 2 -n 44 -c 2 --cpu_bind=cores ./a.out &
srun -N 4 -n 108 -c 2 --cpu_bind=cores ./b.out &
srun -N 2 -n 40 -c 2 --cpu_bind=cores ./c.out &
wait

Use the "shared" Partition for Serial or Small Parallel Jobs (on Haswell Nodes Only)

The "shared" partition can be used to run serial jobs or small parallel jobs.  Unlike other partitions, such as "regular" or "debug", where each node allocated for is exclusive to one specific job and executable, a "shared" node can be shared by multiple users and multiple applications. The "shared" partition is available only on the Haswell nodes.

On each Cori node, there are 32 physical cores, and each core has 2 hyperthreads, so there are a total of 64 CPUs (or 64 “slots”) on each node.   A “shared” job can request any number of “slots” (currently up to 32 slots are allowed, which is controlled by the submit filter) by requesting number of CPUs or a specific memory requirement that is equivalent to a number of CPUs.  For example: “#SBATCH -n 5” or “#SBATCH --mem=20GB”.

Notice in the example of #SBATCH -n 5”, the batch job is requesting 5 CPUs, and will actually get 3 physical cores, a total of 6 CPUs or 6 logical cores. Each physical CPU is exclusive to one executable only.  In the example of “SBATCH —mem=10GB”, the batch script is requesting 10 GB of memory, out of the 122 GB (to be shared by 64 slots, with 1952 MB per slot or CPU) of application usable memory of the entire node, thus requesting a total of 20*64/122=10.5 CPU, will get 6 physical cores, with a total of 12 CPUs or 12 logical cores).

Below is a sample  job batch script for the shared partition to run on 6 physical cores, a total of 12 CPUs or 12 logical cores:

#!/bin/bash -l

#SBATCH -p shared
#SBATCH -n 12
#SBATCH -t 02:00:00
#SBATCH -J my_job
#SBATCH -L project
#SBATCH -C haswell

srun -n 12 -c 1 ./my_executable   # pure MPI, 12 MPI tasks
#or
export OMP_NUM_THREADS=12
./mycode.exe    # pure OpenMP, 12 OpenMP threads, notice no “srun” command is needed.
#or
export OMP_NUM_THREADS=4
srun -n 3 -c 4 ./mycode.exe  # hybrid MPI/OpenMP, 3 MPI tasks, 4 OpenMP thread per task

The “shared” partition can be used to run “serial” jobs by requesting just 1 CPU slot, or many CPU slots if you need to use more memory with the “#SBATCH --men” flag.  Below is a sample “serial” job script:

#!/bin/bash -l

#SBATCH -p shared
#SBATCH -n 1
#SBATCH -t 02:00:00
#SBATCH -J my_job
#SBATCH -L SCRATCH
#SBATCH -C haswell

./serial.exe   

or:

#!/bin/bash -l

#SBATCH -p shared
#SBATCH --mem=10GB
#SBATCH -t 02:00:00
#SBATCH -J my_job
#SBATCH -L SCRATCH
#SBATCH -C haswell

./serial.exe

The "srun" command is not recommended to launch a serial executable in the "shared" partition.

Running Serial Jobs

Using the "shared" Partition (Available Only on Haswell Nodes)

Nodes used for the "shared" partition are shared among multiple users.  Executables can be built with regular Cray compiler wrappers such as ftn, cc, or CC.  Each node allocated in the shared partition contains 64 slots, each slot is 1 hyperthread core, with allocatable memory of 1952 MB.  

The “serial” batch script examples can be found in the previous section of the “shared” partition.

Example of submitting a serial job via the interactive batch session in the "shared" partition:

% salloc -n 1 -p shared -L SCRATCH -C haswell

More memory (and effectively more slots) can be requested via the "--mem" option:

% salloc -N 1 -n 1 -p shared --mem=10GB -L SCRATCH -C haswell
./serial.exe

 The "srun" command is not recommended to launch a serial executable in the "shared" partition.  Notice the "-N 1" is needed for using shared partition in the interactive batch mode via salloc.

Using the "regular" Partition

Serial jobs can run in the regular partition only by requesting one node only. Nodes in the regular partitions (regular, debug, etc.) are not shared between users. Jobs will be charged for the entire node, even if only 1 core is used. A sample script is as follows:

#!/bin/bash -l

#SBATCH -p debug
#SBATCH -N 1
#SBATCH -t 01:00:00
#SBATCH -L project
#SBATCH -C haswell

srun -n 1 ./serial.exe

Using "Taskfarmer" Utility

TaskFarmer is a workflow utility developed in-house at NERSC to farm tasks onto a compute node - these can be single- or multi-core tasks. It tracks which tasks have completed successfully, and allows straightforward re-submission of failed or un-run jobs from a task list.  Please see detailed taskfarmer usage information here.

Using the Burst Buffer

The Burst Buffer is a layer of flash storage located within the high-speed interconnect of Cori. It offers users an additional storage layer with exceptional I/O performance (up to 1.7TB/s read/write and 27M IO operations per second are possible). Users can request an allocation on the Burst Buffer on a per-job basis, or for a short-term persistent reservation. The following example script demonstrates a request of a 100GB allocation on the Burst Buffer in a "scratch" type, that will last for the duration of the job. It also stages data in and out of the allocation - data specified in the stage_in command will be moved onto your Burst Buffer allocation before the start of the compute job, and data specified in stage_out will be moved out after the job finishes. Your Burst Buffer allocation can be accessed using the variable $DW_JOB_STRIPED. Full more details of how to use the Burst Buffer, example scripts and use cases please see here

#!/bin/bash -l

#SBATCH -p regular
#SBATCH -N 1
#SBATCH -C haswell
#SBATCH -t 00:05:00
#DW jobdw capacity=100GB access_mode=striped type=scratch
#DW stage_in source=/global/cscratch1/sd/username/path/to/filename destination=$DW_JOB_STRIPED/filename type=file
#DW stage_out source=$DW_JOB_STRIPED/directoryname destination=/global/cscratch1/sd/username/path/to/directoryname type=directory
srun a.out INSERT_YOUR_CODE_OPTIONS_HERE

STDOUT and STDERR 

By default, while your job is running, standard output (STDOUT) and standard error (STDERR) are combined, and written to slurm-$SLURM_JOBID.out in your submit directory (for example: slurm-13791.out).

Also by default, while the stderr is not buffered, the stdout is buffered in segments with 8 KB of size, then append to this file in real time as the job runs so you can check the contents of these files for easier job monitoring. IMPORTANT: Do not remove, rename, or otherwise perturb this file while the job is still running!

If you rename your own stdout/stderr file names (via #SBATCH -o or #SBATCH -e flags) , or redirect your output to a file as follows, the temporary file names will be renamed to the file names of your choice. For example, if you have

srun -n 48 -c 4 ./a.out >& my_output_file          (for csh/tcsh)
#or: 
srun -n 48 -c 4 ./a.out > my_output_file 2>&1      (for bash)

Then the real time output files  (with stdout buffer) will be saved into "my_output_file" instead of the "slurm-13791.out" at job run time.

In some cases, you may want STDOUT to be flushed instantaneously, instead of being buffered. One possible use case is when you are debugging a code and the size of your output is less than 8KB. If the code does not exit cleanly, the output buffer will not be flushed and you will not see any output. In such cases, you may want to add the "-u" argument to srun, that will result in an unbuffered output stream. However, using this as a default is not a good idea since it imposes a considerable system overhead and may drastically slow down your simulation, even with a modest amount of output. The only case in which it should be used is if you don't see any output due to your code being terminated abruptly or if your code generates less than one screenful of output. 

Job Steps and Dependencies

There is a sbatch option -d <dependency list> or --dependency=<dependency list>  or #SBATCH -d <dependency list> / --dependency=<dependency list> for job dependencies. The most commonly used dependency_list would be afterok:jobid[:jobid...], which means the job just submitted will be executed only after the dependent job(s) have terminated without an error. Another option would be afterany:jobid[:jobid...], which means the job just submitted will be executed only after the dependent job(s) have terminated either with or without an error. The second option could be useful in many restart runs since it is the user's intention to exceed wall clock limit for the first job.

For example, to run batch job2 only after batch job1 succeeds:

cori% sbatch job1
Ssubmitted batch job 5547

cori06% sbatch --dependency=afterok:5547 job2
or
cori06% sbatch --dependency=afterany:5547 job2

or

cori06% sbatch job1
submitted batch job 5547

cori06% cat job2
#!/bin/bash -l

#SBATCH -p regular
#SBATCH -N 1
#SBATCH -t 00:30:00
#SBATCH -d afterok:5547
#SBATCH -C haswell

srun -n 16 -c 4 ./a.out

cori06% sbatch job2

The second job will be in batch "Held" status until job1 has run successfully. Note that job2 has to be submitted while job1 is still in the batch system, either running or in the queue. If job1 has exited before job2 is submitted, job2 will not be released from the "Held" status.

It is also possible to submit the second job in its dependent job (job1) batch script using SLURM keyword "$SLURM_JOB_ID":

#!/bin/bash -l
#SBATCH -p regular
#SBATCH -N 1
#SBATCH -t 00:30:00
#SBATCH -L project
#SBATCH -C haswell

sbatch -d afterok:$SLURM_JOB_ID job2
srun -n 16 -c 4 ./a.out

Please refer to the sbatch man page for other -d <dependency_list> options including afterany:jobid[:jobid...], afternotok:jobid[:jobid...], before:jobid[:jobid...], etc.

Sample Scripts for Submitting Chained Dependency Jobs

Below is a simple batch script, 'runit', for submitting three chained jobs in total (job_number_max=3). It sets the job sequence number (job_number) to 1 if this variable is undefined (that is, in the first job). When the value is less than job_number_max, the current job submits the next job. The value of job_number is incremented by 1, and the new value is provided to the subsequent job.

#!/bin/bash -l

#SBATCH -p regular
#SBATCH -N 1
#SBATCH -t 00:05:00
#SBATCH -L project
#SBATCH -C haswell

 : ${job_number:="1"}           # set job_nubmer to 1 if it is undefined
 job_number_max=3

 echo "hi from ${SLURM_JOB_ID}"

 if [[ ${job_number} -lt ${job_number_max} ]]
 then
   (( job_number++ ))
   next_jobid=$(sbatch --export=job_number=${job_number} -d afterok:${SLURM_JOB_ID} runit | awk '{print $4}')
   echo "submitted ${next_jobid}"
 fi

 sleep 15
 echo "${SLURM_JOB_ID} done"

Using the above script, three batch jobs are submitted.

Job Arrays

Job arrays offer a mechanism for submitting and managing collections of similar jobs quickly and easily.  Job arrays are only supported for batch jobs.

Submit Job Arrays

# Submit a job array with index values between 0 and 31:
$ sbatch --array=0-31 -N1
# Submit a job array with index values of 1, 3, 5 and 7:
$ sbatch --array=1,3,5,7 -N1 -n2
# Submit a job array with index values between 1 and 7 with a step size of 2 (i.e. 1, 3, 5 and 7)
$ sbatch --array=1-7:2 -N1 -p regular

Arrays can also be specified as a batch directive. e.g:

#!/bin/bash -l

#SBATCH -p regular
#SBATCH -N 1
#SBATCH -t 00:05:00
#SBATCH --array=1-10
#SBATCH -L SCRATCH
#SBATCH -C haswell

srun ./mycode.exe

A maximum number of simultaneously running tasks from the job array may be specified using a "%" separator. For example "--array=0-15%4" will limit the number of simultaneously running tasks from this job array to 4. Job arrays will have two additional environment variable set. SLURM_ARRAY_JOB_ID will be set to the first job ID of the array. SLURM_ARRAY_TASK_ID will be set to the job array index value.

Canceling Job Arrays

If the job ID of a job array is specified as input to the scancel command then all elements of that job array will be cancelled. Alternately an array ID, optionally using regular expressions, may be specified for job cancellation.

# Cancel array ID 1 to 3 from jobarray 20:
$ scancel 20_[1-3]
# Cancel array ID 4 and 5 from job array 20:
$ scancel 20_4 20_5
# Cancel all elements from job array 20:
$ scancel 20$ scancel 20_[1-3]
# Cancel the current job or job array element (if job array)
if [[-z $SLURM_ARRAY_JOB_ID]]; then
  scancel $SLURM_JOB_ID
else
  scancel ${SLURM_ARRAY_JOB_ID}_${SLURM_ARRAY_TASK_ID}
fi

Updating Array Jobs

All jobs must have the same initial options (e.g. size, time limit, etc.), however it is possible to change some of these options after the job has begun execution using the scontrol command specifying the JobID of the array or individual ArrayJobID.

$ scontrol update job=101 ...
$ scontrol update job=101_1 ...

For more information on Job Arrays, visit the official SLURM documentation.

Additional Advanced Options

More example batch scripts with advanced options that apply to both Haswell and KNL nodes can be found here, including running xfer jobs, use realtime partition, request specific nodes, run MPMD jobs, run CCM jobs, run binaries built with Intel MPI and OpenMPI, run multiple parallel jobs sharing nodes, requesting specific nodes to run, etc.