NERSCPowering Scientific Discovery Since 1974

Submitting Batch Jobs

Overview

A batch job is the most common way users run production applications on NERSC machines.  Carver's batch system is based on the PBS model, implemented with the Moab scheduler and Torque resource manager.

Typically, the user submits a batch script to the batch system. This script specifies, at the very least, how many nodes and cores the job will use, how long the job will run, and the name of the application to run. The job will advance in the queue until it has reached the top.   At this point, Torque will allocate the requested number of nodes to the batch job.  The batch script itself will execute on the "head node" (sometimes known as the "MOM" node).  See Queues and Policies for details of batch queues, limits, and policies.

Debug Jobs

Short jobs requesting less than 30 minutes and requiring 32 nodes (256 cores) or fewer can run in the debug queue.  From 5am-6pm Pacific Time, 8 nodes are reserved for debugging and interactive use. 

Sample Batch Scripts

Althought there are default values for all batch parameters, it is a good idea always to specify the name of the queue, the number of nodes, and the walltime for all batch jobs.  To minimize the time spent waiting in the queue, specify the smallest walltime that will safely allow the job to complete.

A common convention is to append the suffix ".pbs" to batch scripts.

Basic Batch Script

This example requests 16 nodes, and 8 tasks per node, for 10 minutes, in the debug queue.

#PBS -q debug
#PBS -l nodes=16:ppn=8
#PBS -l walltime=00:10:00
#PBS -N my_job
#PBS -e my_job.$PBS_JOBID.err
#PBS -o my_job.$PBS_JOBID.out

cd $PBS_O_WORKDIR
mpirun -np 128 ./my_executable

Submit Example

Submit your batch script with the qsub command:

carver% qsub my_job.pbs
123456.cvrsvc09-ib

The qsub command displays the job_id (12346.cvrsvc09-ib in the above example).  It is important to keep track of your job_id, to help with job tracking and problem resolution.

Torque Keywords

The following keywords may be specified as qsub command line options, or as directives (preceded by #PBS) embedded in a batch script.

Required Torque Options/Directives
OptionDefaultDescription
-l nodes=num_nodes:ppn=tasks_per_node 1 node and 1 task per node

Number of nodes assigned to job and number of tasks per node

Note:  ppn must be less than or equal to 8 on all queues but reg_xlmem on which it can be up to 32

-l walltime=HH:MM:SS 00:05:00 Maximum wallclock time for job
-q queue_name debug Name of submit queue
-N job_name Name of job script Name of job; up to 15 printable, non-whitespace characters
Useful Torque Options/Directives
OptionDefaultDescription
-A repo Default repo Charge job to repo
-e filename <job_name>.e<job_id> Write stderr to filename
-o filename <job_name>.o<job_id> Write stdout to filename
-j [oe | eo] Do not merge Merge (join) stdout and stderr.  If oe, merge as output file; ie eo, merge as error file
-m [m | b | e | n] a Email notification:
a=send mail if job aborted by system
b=send mail when job begins
e=send mail when job ends
n=never send email
Options a, b, e may be combined
-S shell Login shell Specify shell to interpret batch script
-l gres=resource[%resource] Run whether resource is available or not Resource can be gscratch, project, or projectb.  CURRENTLY BROKEN; DO NOT USE.  Specify if a batch job uses /resource. When set, a job will not start during scheduled /resource file system maintenance.
-V Do not export Export current environment variables into the batch job environment

Torque Environment Variables

The batch system defines many environment variables, which are available for use in batch scripts.  The following tables list some of the more useful variables.  Users must not redefine the value of any of these variables!

Variable Name Meaning
PBS_O_LOGNAME Login name of user who executed qsub.
PBS_O_HOME Home directory of submitting user.
PBS_O_WORKDIR Directory in which qsub command was executed.  Note that batch jobs begin execution in $PBS_O_HOME; many batch scripts execute "cd $PBS_O_WORKDIR" as first executable statement.
PBS_O_HOST Hostname of system on which qsub was executed.  This is typically a Carver login node.
PBS_JOBID Unique identifier for this job; important for tracking job status.
PBS_ENVIRONMENT Set to "PBS_BATCH" for scripts submitted as batch jobs; "PBS_INTERACTIVE" for interactive batch jobs ("qsub -I ...").
PBS_O_QUEUE Name of submit queue.
PBS_QUEUE Name of execution queue.
PBS_O_JOBNAME Name of this job.
PBS_NODEFILE Name of file containing list of nodes assigned to this job.
PBS_NUM_NODES Number of nodes assigned to this job.
PBS_NUM_PPN Value of "ppn" (processes per node) for this job.

Standard Output and Error

While your job is running, standard output (stdout) and standard error (stderr) are written to temporary "spool" files (for example: 123456-cvrsvc09-ib.OU and 123456-cvrsvc09-ib.ER) in the submit directory.  If you merge  stderr/stdout via the "#PBS -j eo" or "#PBS -j oe" option, then only one such spool file will appear. 

These files will be updated in real-time while the job is running, allowing you to use them for job monitoring.  It is important that you do not modify, remove or rename these spool files while the job is still running!

After your batch job completes, the above files will be renamed to the corresponding stdout/stderr files (for example: my_job.o123456 and my_job .e123456). 

Running Serial Jobs

A serial job is one that only requires a single computational core.  The serial queue is specifically configured to run serial jobs.  It runs on 80 12-core Westmere nodes, each having 48GB of memory. 

Serial jobs share nodes, rather than having exclusive access. Multiple jobs will be scheduled on an available node until either all cores are in use, or until there is not enough memory available for additional processes on that node.  This last characteristic makes it important to request sufficient memory for your serial job.   See Carver Memory Considerations for more information.

Serial jobs are not charged for an entire node; they are only charged for a single core.  See Usage Charging for more information.

The following script requests a single Westmere core and 10GB of memory for 12 hours:

#PBS -q serial
#PBS -l walltime=12:00:00
#PBS -l pvmem=10GB

cd $PBS_O_WORKDIR
./a.out

Note that it is not necessary to specify "-l nodes=1:ppn=1" for serial jobs.  Also note that if values other that 1 are specified for these attributes, the serial job will never start.

Running Multiple Serial Jobs

In some situations one needs to run the same serial program with a set of different input parameters repeatedly, to examine, for example, behavior of dynamics over a large domain in input parameter space. In that case, it is more advantageous in terms of managing one's project work as well as job scheduling, to bundle multiple application runs into a batch job and run a few such large bundled jobs, instead of numerous serial jobs with each containing just a single serial run. Here, we will examine some ways to launch such multiple serial jobs:

  • Using the Serial Queue
  • Using pbsdsh
  • Using taskfarmer
  • Using Python
  • Using MPI to Make an Embarassingly Parallel Code

 In the examples below, we assume that the executable a.out resides in a directory and each run's input and output files are kept in a separate directory underneath it (dir00/infile, dir00/outfile, dir01/infile, dir01/outfile, ...).

Using the Serial Queue

An easy way is to run multiple serial runs sequentially using the serial queue:

#!/bin/bash
#PBS -q serial
#PBS -l walltime=1:00:00
#PBS -l pvmem=2GB

cd $PBS_O_WORKDIR
./a.out < dir00/infile > dir00/outfile
./a.out < dir01/infile > dir01/outfile
./a.out < dir02/infile > dir02/outfile
./a.out < dir03/infile > dir03/outfile
./a.out < dir04/infile > dir04/outfile
./a.out < dir05/infile > dir05/outfile
./a.out < dir06/infile > dir06/outfile
./a.out < dir07/infile > dir07/outfile

This job will use a single core for the serial runs, not being charged for the other cores on the assigned compute node. But the application runs will be processed sequentially, and time to process the entire work will become large.

Using pbsdsh

One can run the serial work load above in parallel with 'pbsdsh'. This command distributres workload on the allocated compute nodes and all the cores execute the shell script, $PBS_O_WORKDIR/runtask.sh, in parallel.

#!/bin/bash
#PBS -q regular
#PBS -l nodes=1:ppn=8
#PBS -l walltime=1:00:00
#PBS -l pvmem=2GB

cd $PBS_O_WORKDIR
pbsdsh $PBS_O_WORKDIR/runtask.sh

 where runtask.sh is given as follows:

#!/bin/bash
cd $PBS_O_WORKDIR
wdir=$(printf "dir%2.2d" ${PBS_VNODENUM}) # dir00, dir01, dir02,...
./a.out < $wdir/infile > $wdir/outfile

The environment variable PBS_VNODENUM is a CPU (virtual node) identifier whose value runs from 0 to nodes*ppn-1. This is used in the above script to identify 8 tasks spawned by pbsdsh.

Please note that we cannot use the serial queue as we are using multiple cores (and multiple nodes if necessary). Since the work load is processed in parallel, the walltime will be reduced.

Using taskfarmer

One can use the taskfarmer utility developed at NERSC to process the work load in parallel in a similar way. To use the utility, both the tig and taskfarmer modules should be loaded. A sample batch script is below:

#!/bin/bash -l
#PBS -q regular
#PBS -l nodes=1:ppn=8
#PBS -l walltime=1:00:00
#PBS -l pvmem=2GB

module load tig taskfarmer
cd $PBS_O_WORKDIR
runcommands.sh jobfile

where jobfile is given as follows, with '/path/for/cwd' to be replaced with the proper full path for the curent working directory:

/path/for/cwd/a.out < /path/for/cwd/dir00/infile > /path/for/cwd/dir00/outfile
/path/for/cwd/a.out < /path/for/cwd/dir01/infile > /path/for/cwd/dir01/outfile
/path/for/cwd/a.out < /path/for/cwd/dir02/infile > /path/for/cwd/dir02/outfile
/path/for/cwd/a.out < /path/for/cwd/dir03/infile > /path/for/cwd/dir03/outfile
/path/for/cwd/a.out < /path/for/cwd/dir04/infile > /path/for/cwd/dir04/outfile
/path/for/cwd/a.out < /path/for/cwd/dir05/infile > /path/for/cwd/dir05/outfile
/path/for/cwd/a.out < /path/for/cwd/dir06/infile > /path/for/cwd/dir06/outfile
/path/for/cwd/a.out < /path/for/cwd/dir07/infile > /path/for/cwd/dir07/outfile

Please note that the file, jobfile, is not a shell script. It is a text file that lists all the tasks to be done. Taskfarmer uses /tmp for some temporary files. On carver, this may cause the job to fail with errors about /tmp being full. If this happens, please redirect the output to global scratch by adding "setenv TF_TMPBASE $GSCRATCH/<directory_you_create>".

Using Python

One can use a python's mpi4py module to launch multiple serial jobs. Below is a sample python script, 'mwrapper.py':

#!/usr/bin/env python
from mpi4py import MPI
from subprocess import call
import sys

exctbl = sys.argv[1]
comm = MPI.COMM_WORLD
rank = comm.Get_rank()
myDir = "dir"+str(rank).zfill(2)
cmd = "cd "+myDir+" ; "+exctbl+" < infile > outfile"
sts = call(cmd,shell=True)
comm.Barrier()

Below is a batch script to use it for a serial program, a.out:

#!/bin/bash -l
#PBS -q regular
#PBS -l nodes=1:ppn=8
#PBS -l walltime=1:00:00
#PBS -l pvmem=2GB

module load python/2.7.3 mpi4py
cd $PBS_O_WORKDIR
mpirun -np 8 mwrapper.py $PBS_O_WORKDIR/a.out

Using MPI

MPI can be used for simultaneous execution of the serial runs, too. The following is to use a non-standard but widely accepted Fortran intrinsic function, SYSTEM, to make each MPI rank run the serial application independently of other ranks under the MPI environment.

program mpiwrapped
include 'mpif.h'
character*80 cmd
call mpi_init(ierr)
call mpi_comm_rank(mpi_comm_world,me,ierr)
write(cmd,'("./a.out < dir",i2.2,"/infile > dir",i2.2,"/outfile")') & me, me ! ./a.out < dir00/infile > dir00/outfile for rank 0, etc.
call system(trim(cmd)) call mpi_finalize(ierr) end

Another way to run the multiple serial work in parallel is to turn the serial code into an embarassingly parallel MPI code by including a minimal set of MPI-related changes. The idea is to make each MPI task read its own data from a separate input file and write the result to a separte outfile file. The input and output file names can be constructed from MPI rank. The following code is turned into a MPI code by simply adding the "include 'mpif.h'", MPI_Init, MPI_Comm_rank and MPI_Finalize statements only; and by having the input and output file names constructed from MPI rank and making the program use the files: 

program mpiwrapped
include 'mpif.h'
...
character*12 file_in
character*13 file_out
...
call mpi_init(ierr)
call mpi_comm_rank(mpi_comm_world,me,ierr)
write(file_in,'("dir",i2.2,"/infile")') me ! dir00/infile for rank 0, etc. write(file_out,'("dir",i2.2,"/outfile")') me ! dir00/outfile for rank 0, etc.
...
open(10,file=file_in) read(10,*) ... ! instead of 'read(5,*) ...' close(10) ... open(11,file=file_out) write(11,*) ... ! instead of 'write(6,*) ...' close(11) ... call mpi_finalize(ierr) end

Below is a batch script to run the program above. 

#!/bin/bash
#PBS -q regular
#PBS -l nodes=1:ppn=8
#PBS -l walltime=1:00:00
#PBS -l pvmem=2GB

cd $PBS_O_WORKDIR
mpirun -np 8 ./mpiwrapped

Such wrapper codes can be similarly prepared in other languages such as C or C++. Actually, the python example above can be considered as one such example. Note also that other distributed-memory parallel programming models than MPI can similarly be used.

Running Multiple Parallel Jobs Sequentially

Make sure that the total number of nodes requested is sufficient for the largest (node-count) application, and that the requested walltime is sufficient for all applications combined.  Note that the repository charge for the batch job is based on the total number of nodes requested, regardless of whether any particular application uses all those nodes.

#PBS -q regular
#PBS -l nodes=16:ppn=8
#PBS -l walltime=4:00:00
#PBS -N my_job
#PBS -e my_job.$PBS_JOBID.err
#PBS -o my_job.$PBS_JOBID.out

cd $PBS_O_WORKDIR
mpirun -np 128 ./my_executable1
mpirun -np 32 ./my_executable2
mpirun -np 64 ./my_executable3

Running Multiple Parallel Jobs Simultaneously

If you need to run many jobs which require approximately the same run time, you can boundle them up and run in one job script. Eg., to run 8 jobs simultaneously with 4 cores per job (you need 4 nodes in total), here is a sample job script:

#!/bin/bash -l
#PBS -N test_multi
#PBS -q debug
#PBS -l nodes=4:ppn=8,walltime=30:00
#PBS -j oe

cd $PBS_O_WORKDIR

#Run 8 jobs simultaneously, each with 4 cores
let num_jobs=8
let tasks_per_job=4

#Assume jobs run in separate directories, job1, job2, ...
for i in $(seq $num_jobs)
do
    cd job$i

    #write hostfile for i-th job to use
    let lstart=($i-1)*${tasks_per_job}+1
    let lend=${lstart}+${tasks_per_job}-1
    sed -n ${lstart},${lend}'p' < $PBS_NODEFILE >nodefile$i

    mpirun -np $tasks_per_job -hostfile nodefile$i ./a.out  >& job$i.out &

    cd ..
done

wait

Please note, it is important to pass the hostfile to mpirun for each job, otherwise all jobs will be running on the same first 4 cores, which will overload the node and will likely kill the node. Also the last "wait" command is necessary, otherwise your job will exit right away afer the script gets executed, and all your jobs sent to the background will be termintated.

If you want to run multiple hybird MPI+OpenMP jobs at the same time, here is a sample job script. This job script runs 8 hybrid jobs simultaneously with 2 MPI tasks 4 threads per task for each job (so each job runs on 1 node, and you need 8 nodes in total).

#!/bin/bash -l
#PBS -N test_hybrid
#PBS -q debug
#PBS -l nodes=8:ppn=2,walltime=30:00
#PBS -l pvmem=10GB
#PBS -j oe

cd $PBS_O_WORKDIR

#Run 8 jobs simultaneously, each running 2 mpi tasks 4 threads per task
let num_jobs=8
let tasks_per_job=2
let threads_per_task=4

export OMP_NUM_THREADS=$threads_per_task

#Assume jobs run in separate directories, job1, job2, ...
for i in $(seq $num_jobs)
do
    cd job$i

    #write hostfile for i-th job to use
    let lstart=($i-1)*${tasks_per_job}+1
    let lend=${lstart}+${tasks_per_job}-1
    sed -n ${lstart},${lend}'p' < $PBS_NODEFILE >nodefile
 
   mpirun -np $tasks_per_job -bysocket -bind-to-socket -hostfile nodefile ./a.out >& job$i.out &                  

    cd ..
done

wait

Again, it is important to pass the hostfile to the mpirun command line to spread out your tasks over the nodes allocated to your job. Please also note, the torque directive #PBS -l pvmem=10GB is necessary to allow 4 times of the default per-core memory (2.5GB) for each MPI task, otherwise the 4 threads in each MPI task will be sharing 2.5GB of the default memory instead of the availabe 4 times of that mount. For more information abut using memory properly, please refer to our Memory Considerations page. This job script is good for running 4 threads per task. To run hybrid jobs with other task to thread ratios, please refer to our MPI/OpenMP page for general instructions.

As a last note, you should be aware that the contens of the batch nodefile, $PBS_NODEFILE, depends on the ppn value of the torque directive #PBS -l nodes. You are recommended to work out a more robust and convenient way to generate the nodefile for each of your multiple jobs.

Running Multiple Jobs with Job Arrays

Job arrays are a way to submit many jobs using only 1 batch submission script. 

The behavior of the different jobs of the array can be controlled by the different values of the PBS_ARRAYID environment variable for each job in the array.

#PBS -l walltime=00:05:00
#PBS -N jacar
#PBS -o jacar.out
#PBS -e jacar.err
#PBS -q serial
#PBS -t 1-20

cd $PBS_O_WORKDIR

cd workdir.$PBS_ARRAYID
./job.$PBS_ARRAYID <input.$PBS_ARRAYID >output.$PBS_ARRAYID

This example would submit 20 jobs to the Carver serial queue with PBS_ARRAYID's running from 1 to 20.

For more information about job arrays see Using Job Arrays on Carver.

Job Steps and Dependencies

There is a batch option depend=dependency_list for job dependencies. The most commonly used dependency_list would be afterok:jobid[:jobid...], which means the job just submitted will be executed only after the dependent job(s) have terminated without an error. Another option would be afterany:jobid[:jobid...], which means the job just submitted will be executed only after the dependent job(s) have terminated either with or without an error. The second option could be useful in many restart runs since it is the user's intention to exceed wall clock limit for the first job.

For example, to run batch job2 only after batch job1 succeeds,

carver% qsub job1.pbs
123456.cvrsvc09-ib
carver% qsub -W depend=afterok:123456.cvrsvc09-ib job2.pbs
123457.cvrsvc09-ib

As with most batch options, the dependency can be included in a batch script rather than on the command line:

carver% qsub job1.pbs
123456.cvrsvc09-ib
carver% qsub job2.pbs
123457.cvrsvc09-ib

where the batch script job2.pbs contains the following line:

#PBS -W depend=afterok:123456.cvrsvc09-ib

The second job will be in batch "Held" status until job1 has run successfully. Note that job2 has to be submitted while job1 is still in the batch system, either running or in the queue. If job1 has exited before job2 is submitted, job2 will not be released from the "Held" status.

It is also possible for the first job to submit the second (dependent) job by employing the Torque environment variable "PBS_JOBID":

#PBS -q regular
#PBS -l nodes=4:ppn=8
#PBS -l walltime=0:30:00
#PBS -j oe

cd $PBS_O_WORKDIR
qsub -W depend=afterok:$PBS_JOBID job2.pbs
mpirun -np 32 ./a.out

Please refer to qsub man page for other -W depend=dependency_list options including afterany:jobid[:jobid...], afternotok:jobid[:jobid...], before:jobid[:jobid...], etc.

Sample Scripts for Submitting Chained Dependency Jobs

Below is a simple batch script, 'runit.pbs', for submitting three chained jobs in total (job_number_max=3). It sets the job sequence number (job_number) to 1 if this variable is undefined (that is, in the first job). When the value is less than job_number_max, the current job submits the next job. The value of job_number is incremented by 1, and the new value is provided to the subsequent job.

#!/bin/bash
#PBS -q regular
#PBS -l nodes=1
#PBS -l walltime=0:05:00
#PBS -j oe

 : ${job_number:="1"} # set job_number to 1 if it is undefined
 job_number_max=3
JOBID="${PBS_JOBID}"

 cd $PBS_O_WORKDIR

 echo "hi from ${PBS_JOBID}"

 if [[ ${job_number} -lt ${job_number_max} ]]
 then
   (( job_number++ ))
   next_jobid=$(qsub -v job_number=${job_number} -W depend=afterok:${JOBID} runit.pbs)
   echo "submitted ${next_jobid}"
 fi

 sleep 15
 echo "${PBS_JOBID} done"

Using the above script, three batch jobs are submitted:

carver% qsub runit.pbs
123456.cvrsvc09-ib
carver% ls runit.pbs.o*
-rw------- 1 xxxxx xxxxx 899 2011-03-09 09:27 runit.pbs.o123456
-rw------- 1 xxxxx xxxxx 949 2011-03-09 09:28 runit.pbs.o123457
-rw------- 1 xxxxx xxxxx 949 2011-03-09 09:28 runit.pbs.o123458
carver% cat runit.pbs.o123456
...
hi from 123456.cvrsvc09-ib
submitted 123457.cvrsvc09-ib
123456.cvrsvc09-ib done
...
carver% cat runit.pbs.o123457
...
hi from 123457.cvrsvc09-ib
submitted 123458.cvrsvc09-ib
123457.cvrsvc09-ib done
...
carver% cat runit.pbs.o123458
...
hi from 123458.cvrsvc09-ib
123458.cvrsvc09-ib done
...