NERSCPowering Scientific Discovery Since 1974

Example Batch Scripts

Sample Batch Scripts

One of the most noted differences between the Hopper system and other NERSC systems is the number of cores per node is NOT a power of two.  This means if you want to run a job on 32 cores, two nodes will be used and some cores will remain idle.

Basic Batch Script

This script uses the default 24 cores per node, (except on the remainder node when the number of cores requested is not a multiple of 24)

#PBS -q debug
#PBS -l mppwidth=1024
#PBS -l walltime=00:10:00
#PBS -N my_job
#PBS -e my_job.$PBS_JOBID.err
#PBS -o my_job.$PBS_JOBID.out

cd $PBS_O_WORKDIR
aprun -n 1024 ./my_executable

Because of Hopper's compute node architecture with two sockets and non- uniform memory access on a chip and within a node, NERSC is now recommending a simpler way to write batch scripts, which only requires the mppwidth Torque directive for reserving compute resources.   If you want fewer than the default 24 MPI tasks per node or use OpenMP you should request all 24 cores on the desired number of nodes with the mppwidth parameter.  This obviates the need to use the directives mppnppn in the batch script. Note that -N is still added to the aprun line to launch the application with fewer than 24 cores per node.  See the alternatives "Original" and "New Recommendations" below. Both are valid batch scripts, though task placement on specific cores will be limited with the "Original" scripts.

"Unpacked" Nodes Script: Original

This script shows an example using fewer than the default 24 cores per node, in this case 12.

#PBS -q regular
#PBS -l mppwidth=128
#PBS -l mppnppn=12
#PBS -l walltime=12:00:00
#PBS -N my_job
#PBS -e my_job.$PBS_JOBID.err
#PBS -o my_job.$PBS_JOBID.out

cd $PBS_O_WORKDIR
aprun -n 128 -N 12 -S 3 ./my_executable

"Unpacked" Nodes Script: New Recommendation

This example shows how to run a total of 1024 MPI tasks using only 12 cores per node rather than all 24.  The number of nodes needed is ceiling (1024/12)=86, thus requesting 86*24=2064 cores.

#PBS -q regular
#PBS -l mppwidth=2064 ## Note: ceiling (1024 (total tasks) / 12 (desired tasks per node)) * 24 (total cores per node)
#PBS -l walltime=12:00:00
#PBS -N my_job
#PBS -e my_job.$PBS_JOBID.err
#PBS -o my_job.$PBS_JOBID.out

cd $PBS_O_WORKDIR
aprun -n 1024 -N 12 -S 3 ./my_executable

Notice the -S option for unpacked nodes usage, to specify the number of MPI tasks per NUMA node. Refer to the Using aprun page for more advanced aprun options.

Submit Example

Submit your job with the qsub command.

hopper% qsub my_script

Requesting Large Memory Nodes

384 Hopper compute nodes have 64 GB of memory, rather than the 32 GB found on most nodes. Out of these, 369 nodes are available for regular batch jobs. To request these large-memory nodes, use the Torque keyword mpplabels=bigmem:

##PBS -q debug   (do not specify a queue)
#PBS -l mppwidth=1024
#PBS -l mpplabels=bigmem
#PBS -l walltime=00:10:00
#PBS -N my_job
#PBS -e my_job.$PBS_JOBID.err
#PBS -o my_job.$PBS_JOBID.out

cd $PBS_O_WORKDIR
aprun -n 1024 ./my_executable

In this script, the user is requesting 1024 cores (43 nodes) that each contain 64 GB of memory.  Note: the queue specicification in the batch script will be ignored, the job is routed to the designated "bigmem" queue, so "#PBS -q premium" will not give the job a higher priority, nor will it be charged at the premium rate.

To request large memory nodes interactively, please simply add -lmpplabels=bigmem to the qsub line. Do not specify a queue with the -q option.  For example:

hopper% qsub -I -lmppwidth=240 -lmpplabels=bigmem -lwalltime=00:15:00

Note that it might take longer for such a job to start, as the batch system must wait for the desired big memory nodes to become available.

Running Hybrid MPI/OpenMP Applications

Hybrid MPI/OpenMP Example: New Recommendations (recommended)

Instead of requesting mppwidth=256, set mppwidth=256 total tasks / 4 tasks per node * 24 total cores per node = 1536.  In this case mppnppn and mppdepth are no longer needed. The -N and -d flags still need to be passed to the aprun command to specify the number of cores per node to use and number of OpenMP threads to use.

#PBS -q regular
#PBS -l mppwidth=1536
#PBS -l walltime=12:00:00
#PBS -N my_job
#PBS -e my_job.$PBS_JOBID.err
#PBS -o my_job.$PBS_JOBID.out

cd $PBS_O_WORKDIR
setenv OMP_NUM_THREADS 6
aprun -n 256 -N 4 -S 1 -d 6 ./my_executable

Notice the -S option for unpacked nodes usage, to specify the number of MPI tasks per NUMA node. Please refer to the Using aprun page for more advanced aprun options.

Hybrid MPI/OpenMP Example: Original (phasing out, not recommended)

This script shows an application running 256 total MPI tasks (mppwidth) using 4 MPI task per node (mppnppn) and 6 OpenMP threads (mppdepth) per task.  The aprun options for -n, -N, and -d need to match the corresponsing mppxxx requests.  Make sure to compile your application with the the appropriate OpenMP compiler flags.

Note: Please do not use the following Original example script for now, there is a bug with requesting nodes using mppdpeth currently. Insead please use the New Recommendation of requesting nodes with mppwidth only. 8/26/2013

#Do not use this example script, due to a "mppdepth" related bug. 
# Instead, ask for nodes with mppwidth only.
#PBS -q regular
#PBS -l mppwidth=256
#PBS -l mppnppn=4
#PBS -l mppdepth=6
#PBS -l walltime=12:00:00
#PBS -N my_job
#PBS -e my_job.$PBS_JOBID.err
#PBS -o my_job.$PBS_JOBID.out

cd $PBS_O_WORKDIR
setenv OMP_NUM_THREADS 6
aprun -n 256 -N 4 -S 1 -d 6 ./my_executable

 

Pure OpenMP Example: New Recommendations

Instead of requesting mppwidth=1 and mppnppn=1, set mppwidth=24 to request 1 node.  The aprun command options will decide the number of MPI tasks (1 in this pure OpenMP example) and OpenMP threads (24 in this example) to use. Make sure to compile your application with the the appropriate OpenMP compiler flags.

#PBS -q regular
#PBS -l mppwidth=24
#PBS -l walltime=12:00:00
#PBS -N my_job
#PBS -e my_job.$PBS_JOBID.err
#PBS -o my_job.$PBS_JOBID.out

cd $PBS_O_WORKDIR
setenv OMP_NUM_THREADS 24
aprun -n 1 -N 1 -d 24 ./my_executable

Pure OpenMP Example: Original (phasing out, not recommended)

This script shows an application running a pure OpenMP code using 24 threads on the node. Even if there is no MPI, the request for the total MPI task (mppwidth) is 1, and MPI task per node (mppnppn) is also 1.  The number of threads is requestied via mppdepth.  The aprun options for -n, -N, and -d need to match the corresponsing mppxxx requests. Make sure to compile your application with the the appropriate OpenMP compiler flags.
Note: Please do not use the following Original example script for now, there is a bug with requesting nodes using mppdpeth currently. Insead please use the New Recommendation of requesting nodes with mppwidth only. 8/26/2013

#PBS -q regular
#PBS -l mppwidth=1
#PBS -l mppnppn=1
#PBS -l mppdepth=24
#PBS -l walltime=12:00:00
#PBS -N my_job
#PBS -e my_job.$PBS_JOBID.err
#PBS -o my_job.$PBS_JOBID.out

cd $PBS_O_WORKDIR
setenv OMP_NUM_THREADS 24
aprun -n 1 -N 1 -d 24 ./my_executable

Running Dynamic and Shared Library Applications

System Supported Dynamic and Shared Library Script

Note the codes are compiled with the -dynamic flag.

#PBS -q regular
#PBS -l mppwidth=128
#PBS -l walltime=12:00:00
#PBS -N my_job
#PBS -e my_job.$PBS_JOBID.err
#PBS -o my_job.$PBS_JOBID.out

cd $PBS_O_WORKDIR
# setenv CRAY_ROOTFS DSL (not needed any more since it is now default)
aprun -n 128 ./my_executable

Running Multiple Serial Jobs

In some situations one needs to run the same serial program with a set of different input parameters repeatedly, to examine, for example, behavior of dynamics over a large domain in input parameter space. In that case, it is more advantageous in terms of managing one's project work as well as job scheduling, to bundle multiple application runs into a batch job and run a few such large bundled jobs, instead of numerous serial jobs with each containing just a single serial run. Here, we will examine some ways to launch such multiple serial jobs:

In the examples below, we assume that the executable a.out resides in a directory and each run's input and output files are kept in a separate directory underneath it (dir00/infile, dir00/outfile, dir01/infile, dir01/outfile, ...).

Use the "serial" queue

A serial queue that allows multiple executables from different users is now available on Hopper. Please see details on how to compile and run.

Using TaskFarmer

One can use the taskfarmer utility provided by Cray to run multiple tasks in parallel, each task is a serial job.   To use the utility, the taskfarmer modules should be loaded. A sample batch script is below:

#PBS -q regular
#PBS -l mppwidth=48
#PBS -l walltime=2:00:00
#PBS -N my_job
#PBS -e my_job.$PBS_JOBID.err
#PBS -o my_job.$PBS_JOBID.out

cd $PBS_O_WORKDIR

module load taskfarmer
cd $PBS_O_WORKDIR

tf -t 100 -n 2 -e serial.err -o serial.out ./task.sh

where a sample task.sh is given as follows, using $TF_TASKID to distinguish the taskid of each task, ranging from 0 to total_num_tasks-1.

#!/bin/sh
echo "my taskid is: " $TF_TASKID

if [ $TF_TASKID -le 9 ]
then
pre="dir0"
dirname=$pre$TF_TASKID
echo $dirname
else
pre="dir"
dirname=$pre$TF_TASKID
echo $dirname
fi

./a.out < $dirname/infile > $dirname/outfile

The 'tf' command is used to execute multiple single core tasks.  The '-t' argument gives the number of tasks; the '-n' argument is the number of Hopper nodes to use.  The script used (in this case 'task.sh') must have execute permission set.

The number of tasks can be larger than the number of compute cores allocated for the job. In the above example, 100 tasks are run with 2 nodes (48 cores).  When a task is completed on a core, the next task from the uncompleted task pool will be assigned on this core.  A "tf.log" file will be generated, with task completion details.  Also "serial.err" and "serial.out" are generated in the directory as combined tasks error and output files. See "tf" man page for more details of optional arguments of the command.

Using CCM

One can run the multiple serial work load in parallel by running in CCM (Cluster Compatable Mode) environment. Detailed information about CCM can be found here. In the example below, the work contained in the shell script, 'runtask.sh', is run in parallel in CCM environment. 

#!/bin/bash -l
#PBS -q ccm_queue
#PBS -l mppwidth=24
#PBS -l walltime=1:00:00

module load ccm
export CRAY_ROOTFS=DSL
cd $PBS_O_WORKDIR
ccmrun ./runtask.sh

 where runtask.sh is given as follows:

#!/bin/bash
./a.out < dir00/infile > dir00/outfile &
./a.out < dir01/infile > dir01/outfile &
./a.out < dir02/infile > dir02/outfile &
...
./a.out < dir23/infile > dir23/outfile &
wait

Each run is assigned to a different core, and multiple runs are processed in parallel.

Using Python

One can use a python's mpi4py module to launch multiple serial jobs. Below is a sample python script, 'mwrapper.py':

#!/usr/bin/env python
from mpi4py import MPI
from subprocess import call
import sys

exctbl = sys.argv[1]
comm = MPI.COMM_WORLD
rank = comm.Get_rank()
myDir = "dir"+str(rank).zfill(2)
cmd = "cd "+myDir+" ; "+exctbl+" < infile > outfile"
sts = call(cmd,shell=True)
comm.Barrier()

Below is a batch script to use it for a serial program, a.out:

#!/bin/bash -l
#PBS -q regular
#PBS -l mppwidth=24
#PBS -l walltime=1:00:00

module load python
export CRAY_ROOTFS=DSL
cd $PBS_O_WORKDIR
aprun -n 24 mwrapper.py $PBS_O_WORKDIR/a.out

Using MPI

MPI can be used for simultaneous execution of the serial runs, too. The following is to use a non-standard but widely accepted Fortran intrinsic function, SYSTEM, to make each MPI rank run the serial application independently of other ranks under the MPI environment.

program mpiwrapped
include 'mpif.h'
character*80 cmd
call mpi_init(ierr)
call mpi_comm_rank(mpi_comm_world,me,ierr)
write(cmd,'("./a.out < dir",i2.2,"/infile > dir",i2.2,"/outfile")') & me, me ! ./a.out < dir00/infile > dir00/outfile for rank 0, etc.
call system(trim(cmd)) call mpi_finalize(ierr) end

Another way to run the multiple serial work in parallel is to turn the serial code into an embarassingly parallel MPI code by including a minimal set of MPI-related changes. The idea is to make each MPI task read its own data from a separate input file and write the result to a separte outfile file. The input and output file names can be constructed from MPI rank. The following code is turned into a MPI code by simply adding the "include 'mpif.h'", MPI_Init, MPI_Comm_rank and MPI_Finalize statements only; and by having the input and output file names constructed from MPI rank and making the program use the files:

program mpiwrapped
include 'mpif.h'
...
character*12 file_in
character*13 file_out
...
call mpi_init(ierr)
call mpi_comm_rank(mpi_comm_world,me,ierr)
write(file_in,'("dir",i2.2,"/infile")') me ! dir00/infile for rank 0, etc. write(file_out,'("dir",i2.2,"/outfile")') me ! dir00/outfile for rank 0, etc.
...
open(10,file=file_in) read(10,*) ... ! instead of 'read(5,*) ...' close(10) ... open(11,file=file_out) write(11,*) ... ! instead of 'write(6,*) ...' close(11) ... call mpi_finalize(ierr) end

Below is a batch script to run the program above. 

#!/bin/bash
#PBS -q regular
#PBS -l mppwidth=24
#PBS -l walltime=1:00:00

cd $PBS_O_WORKDIR
aprun -n 24 ./mpiwrapped

Such wrapper codes can be similarly prepared in other languages such as C or C++. Actually, the python example above can be considered as one such example. Note also that other distributed-memory parallel programming models than MPI can similarly be used.

Running Multiple Parallel Jobs Sequentially

#PBS -q regular
#PBS -l mppwidth=64
#PBS -l walltime=12:00:00
#PBS -N my_job
#PBS -e my_job.$PBS_JOBID.err
#PBS -o my_job.$PBS_JOBID.out

cd $PBS_O_WORKDIR
aprun -n 36 ./a.out
aprun -n 64 ./b.out
aprun -n 24 ./c.out

Running Multiple Parallel Jobs Simultaneously

Be sure to specify the total number of nodes needed to run all jobs at the same time.  Notice multiple executables can not be shared on the same nodes.  If the required number of cores to launch an aprun command is not divisible by 24, an extra node needs to be added for each aprun command.  In this example, the first exedcutable needs 2 nodes, the second executable needs 3 nodes, and the last executable needs 1 node.  The mppwidth requested is the total of 7 nodes * 24 cores/node = 144.

Notice the "&" at the end of each aprun command.  Also the "wait" command at the end of the script is very important.  It makes sure the batch job won't exit before all the simultaneous apruns are completed. A general guideline is not to have more than 50 to 60 simultaneous apruns at a time in your workflow. 

#PBS -q regular
#PBS -l mppwidth=144
#PBS -l walltime=12:00:00
#PBS -N my_job
#PBS -e my_job.$PBS_JOBID.err
#PBS -o my_job.$PBS_JOBID.out

cd $PBS_O_WORKDIR
aprun -n 32 ./a.out &
aprun -n 65 ./b.out &
aprun -n 24 ./c.out &
wait

Running MPMD (Multiple Program Multiple Data) Jobs

Note that more than one executable will not be run on a given node, so make sure if the number of cores needed for each executable is not divisible by 24, an extra node is added.  See below example.  The executable a.out needs 11 nodes (256 tasks / 24 cores per node + 1).  The executable b.out needs 32 nodes (768 tasks / 24 cores per node).  mppwidth is set to (11 nodes + 32 nodes) * 24 cores per node.

#PBS -q regular
#PBS -l mppwidth=1032
#PBS -l walltime=02:00:00
#PBS -N my_job
#PBS -e my_job.$PBS_JOBID.err
#PBS -o my_job.$PBS_JOBID.out

cd $PBS_O_WORKDIR
aprun -n 256 ./a.out : -n 768 ./b.out

Serial Jobs

Using the "serial" queue

Nodes used for the "serial" queue are shared among multiple users. Jobs in the "serial" queue are charged by core, instead of the entire node. Please see details on how to compile and run.

Using the regular queues

Serial jobs can run in the regular queues by asking 1 core only. Nodes in the regular queues (regular, debug, premium, low, etc) are not shared.  Jobs will be charged for the entire nodes, even only 1 core us used. A sample script is as follows:

#PBS -q regular
#PBS -l mppwidth=1
#PBS -l walltime=02:00:00
#PBS -N my_job
#PBS -e my_job.$PBS_JOBID.err
#PBS -o my_job.$PBS_JOBID.out

cd $PBS_O_WORKDIR
aprun -n 1 ./my_executable

STDOUT and STDERR

While your job is running, standard output (STDOUT) and standard error (STDERR) are written to temporary files in your submit directory (for example: 147546.hopper11.ER and 147546.hopper11.OU). The system will append to these files in real time as the job runs so you can check the contents of these files for easier job monitoring.  If you merge the stderr/stdout via "#PBS -j eo" or "#PBS -j oe" option, then only one such spool file will appear. IMPORTANT: Do not remove or rename these spool files while the job is still running!

After the batch job completes, the above files will be renamed to the corresponding stderr/stdout files (for example: jobscript.e147546 and jobscript.o147546).  If you rename your own stdout/stderr file names, or merge stderr file to stdout file (with Torque keyword) and redirect your output to a file as follows, the temporary file names will be renamed to the file names of your choice. For example, if you have:

...
#PBS -j oe
...
aprun -n 64 ./a.out >& my_output_file (for csh/tcsh)
or: aprun -n 64 ./a.out > my_output_file 2>&1 (for bash)

Then the temporary files will be copied to "my_output_file" instead of the "jobscript.o146546" at job completion time.

Job Steps and Dependencies

There is a qsub option -W depend=dependency_list or a Torque Keyword #PBS -W depend=dependency_list for job dependencies. The most commonly used dependency_list would be afterok:jobid[:jobid...], which means the job just submitted will be executed only after the dependent job(s) have terminated without an error. Another option would be afterany:jobid[:jobid...], which means the job just submitted will be executed only after the dependent job(s) have terminated either with or without an error. The second option could be useful in many restart runs since it is the user's intention to exceed wall clock limit for the first job.

For example, to run batch job2 only after batch job1 succeeds,

hopper% qsub job1
697873.hopper11

hopper% qsub -W depend=afterok:697873 job2 
or
hopper% qsub -W depend=afterany:697873 job2 

or: 

hopper% qsub jobs
697873.hopper11

hopper% cat job2
#PBS -q regular
#PBS -l mppwidth=8
#PBS -l walltime=30:00
#PBS -W depend=afterok:697873
#PBS -j oe
cd $PBS_O_WORKDIR
aprun -n 8 ./a.out

hopper% qsub job2

The second job will be in batch "Held" status until job1 has run successfully. Note that job2 has to be submitted while job1 is still in the batch system, either running or in the queue. If job1 has exited before job2 is submitted, job2 will not be released from the "Held" status.

It is also possible to submit the second job in its dependent job (job1) batch script using Torque keyword "$PBS_JOBID":

#PBS -q regular
#PBS -l mppwidth=8
#PBS -l walltime=00:30:00
#PBS -j oe

cd $PBS_O_WORKDIR
qsub -W depend=afterok:$PBS_JOBID
aprun -n 8 ./a.out

Please refer to qsub man page for other -W depend=dependency_list options including afterany:jobid[:jobid...], afternotok:jobid[:jobid...], before:jobid[:jobid...], etc.  

Sample Scripts for Submitting Chained Dependency Jobs

Below is a simple batch script, 'runit', for submitting three chained jobs in total (job_number_max=3). It sets the job sequence number (job_number) to 1 if this variable is undefined (that is, in the first job). When the value is less than job_number_max, the current job submits the next job. The value of job_number is incremented by 1, and the new value is provided to the subsequent job.

#!/bin/bash
#PBS -q regular
#PBS -l mppwidth=1
#PBS -l walltime=0:05:00
#PBS -j oe

 : ${job_number:="1"} # set job_nubmer to 1 if it is undefined
 job_number_max=3
JOBID="${PBS_JOBID}" # use this on Hopper

 cd $PBS_O_WORKDIR

 echo "hi from ${PBS_JOBID}"

 if [[ ${job_number} -lt ${job_number_max} ]]
 then
   (( job_number++ ))
   next_jobid=$(qsub -v job_number=${job_number} -W depend=afterok:${JOBID} runit)
   echo "submitted ${next_jobid}"
 fi

 sleep 15
 echo "${PBS_JOBID} done"

Using the above script, three batch jobs are submitted.

Submitting an xfer Job

An xfer queue is created for the purpose of users being able submit a batch xfer job at the end of their regular batch jobs to transfer data files, to avoid the file transfer process still possessing the large number of compute nodes and being charged for these compute nodes for the duration of the transfer. 

An "xfer" queue is created for the purpose of users being able submit a batch "xfer" job at the end of their regular batch jobs to transfer data files, to avoid the file transfer process still possessing the large number of compute nodes and being charged for these compute nodes for the duration of the transfer.An "xfer" queue is created for the purpose of users being able submit a batch "xfer" job at the end of their regular batch jobs to transfer data files, to avoid the file transfer process still possessing the large number of compute nodes and being charged for these compute nodes for the duration of the transfer.
An xfer job could be submitted from any login node, or from within a batch job script. Below is a sample batch script:
#PBS -q xfer
#PBS -l walltime=2:00:00
#PBS -N my_job
#PBS -j oe

cd $PBS_O_WORKDIR
hsi put myfile

xfer jobs on Hopper are run on a designated login node.  Because of this, notice there are no "#PBS -l mppwidth" line in the above example.  xfer jobs specifying -lmppwidth will not be able to start since there is no such resource on the local batch server.

Job steps and dependencies can be used in a workflow to prepare input data for simulation or to archive output data after a simulation.