Submitting Batch Jobs
Overview
A batch job is the most common way users run production applications on NERSC machines. Carver's batch system is based on the PBS model, implemented with the Moab scheduler and Torque resource manager.
Typically, the user submits a batch script to the batch system. This script specifies, at the very least, how many nodes and cores the job will use, how long the job will run, and the name of the application to run. The job will advance in the queue until it has reached the top. At this point, Torque will allocate the requested number of nodes to the batch job. The batch script itself will execute on the "head node" (sometimes known as the "MOM" node). See Queues and Policies for details of batch queues, limits, and policies.
Debug Jobs
Short jobs requesting less than 30 minutes and requiring 32 nodes (256 cores) or fewer can run in the debug queue. From 5am-6pm Pacific Time, 8 nodes are reserved for debugging and interactive use.
Sample Batch Scripts
Althought there are default values for all batch parameters, it is a good idea always to specify the name of the queue, the number of nodes, and the walltime for all batch jobs. To minimize the time spent waiting in the queue, specify the smallest walltime that will safely allow the job to complete.
A common convention is to append the suffix ".pbs" to batch scripts.
Basic Batch Script
This example requests 16 nodes, and 8 tasks per node, for 10 minutes, in the debug queue.
#PBS -q debug
#PBS -l nodes=16:ppn=8
#PBS -l walltime=00:10:00
#PBS -N my_job
#PBS -e my_job.$PBS_JOBID.err
#PBS -o my_job.$PBS_JOBID.out
#PBS -V
cd $PBS_O_WORKDIR
mpirun -np 128 ./my_executable
Submit Example
Submit your batch script with the qsub command:
carver% qsub my_job.pbs
123456.cvrsvc09-ib
The qsub command displays the job_id (12346.cvrsvc09-ib in the above example). It is important to keep track of your job_id, to help with job tracking and problem resolution.
Torque Keywords
The following keywords may be specified as qsub command line options, or as directives (preceded by #PBS) embedded in a batch script.
| Required Torque Options/Directives | ||
|---|---|---|
| Option | Default | Description |
| -l nodes=num_nodes | 1 | Number of nodes assigned to job |
| -l walltime=HH:MM:SS | 00:05:00 | Maximum wallclock time for job |
| -q queue_name | debug | Name of submit queue |
| -N job_name | Name of job script | Name of job; up to 15 printable, non-whitespace characters |
| Useful Torque Options/Directives | ||
| Option | Default | Description |
| -A repo | Default repo | Charge job to repo |
| -e filename | <job_name>.e<job_id> | Write stderr to filename |
| -o filename | <sjob_name>.o<job_id> | Write stdout to filename |
| -j [oe | eo] | Do not merge | Merge (join) stdout and stderr. If oe, merge as output file; ie eo, merge as error file |
| -m [m | b | e | n] | a | Email notification: a=send mail if job aborted by system b=send mail when job begins e=send mail when job ends n=never send email Options a, b, e may be combined |
| -S shell | Login shell | Specify shell to interpret batch script |
| -r [y|n] | n, which means Rerunable=False | System default is not to rerun the job after system-wide outage. Users can overwrite the behavior with "#PBS -r y". |
| -l gres=project:1 | Uses generic resource | Specify if a batch job uses /project. When set, a job will not start during scheduled /project file system maintenance. |
| -V | Do not import | Export current environment variables into the batch job environment |
Torque Environment Variables
The batch system defines many environment variables, which are available for use in batch scripts. The following tables list some of the more useful variables. Users must not redefine the value of any of these variables!
| Variable Name | Meaning |
| PBS_O_LOGNAME | Login name of user who executed qsub. |
| PBS_O_HOME | Home directory of submitting user. |
| PBS_O_WORKDIR | Directory in which qsub command was executed. Note that batch jobs begin execution in $PBS_O_HOME; many batch scripts execute "cd $PBS_O_WORKDIR" as first executable statement. |
| PBS_O_HOST | Hostname of system on which qsub was executed. This is typically a Carver login node. |
| PBS_JOBID | Unique identifier for this job; important for tracking job status. |
| PBS_ENVIRONMENT | Set to "PBS_BATCH" for scripts submitted as batch jobs; "PBS_INTERACTIVE" for interactive batch jobs ("qsub -I ..."). |
| PBS_O_QUEUE | Name of submit queue. |
| PBS_QUEUE | Name of execution queue. |
| PBS_O_JOBNAME | Name of this job. |
| PBS_NODEFILE | Name of file containing list of nodes assigned to this job. |
| PBS_NUM_NODES | Number of nodes assigned to this job. |
| PBS_NUM_PPN | Value of "ppn" (processes per node) for this job. |
Standard Output and Error
While your job is running, standard output (stdout) and standard error (stderr) are written to temporary "spool" files (for example: 123456-cvrsvc09-ib.OU and 123456-cvrsvc09-ib.ER) in the submit directory. If you merge stderr/stdout via the "#PBS -j eo" or "#PBS -j oe" option, then only one such spool file will appear.
These files will be updated in real-time while the job is running, allowing you to use them for job monitoring. It is important that you do not modify, remove or rename these spool files while the job is still running!
After your batch job completes, the above files will be renamed to the corresponding stdout/stderr files (for example: my_job.o123456 and my_job .e123456).
Running Serial Jobs
A serial job is one that only requires a single computational core. The serial queue is specifically configured to run serial jobs. It runs on 80 12-core Westmere nodes, each having 48GB of memory.
Serial jobs share nodes, rather than having exclusive access. Multiple jobs will be scheduled on an available node until either all cores are in use, or until there is not enough memory available for additional processes on that node. This last characteristic makes it important to request sufficient memory for your serial job. See Carver Memory Considerations for more information.
Serial jobs are not charged for an entire node; they are only charged for a single core. See Usage Charging for more information.
The following script requests a single Westmere core and 10GB of memory for 12 hours:
#PBS -q serial
#PBS -l walltime=12:00:00
#PBS -l pvmem=10GB
cd $PBS_O_WORKDIR
./a.out
Note that it is not necessary to specify "-l nodes=1:ppn=1" for serial jobs. Also note that if values other that 1 are specified for these attributes, the serial job will never start.
Running Multiple Serial Jobs
In some situations one needs to run the same serial program with a set of different input parameters repeatedly, to examine, for example, behavior of dynamics over a large domain in input parameter space. In that case, it is more advantageous in terms of managing one's project work as well as job scheduling, to bundle multiple application runs into a batch job and run a few such large bundled jobs, instead of numerous serial jobs with each containing just a single serial run. Here, we will examine some ways to launch such multiple serial jobs:
- Using the Serial Queue
- Using pbsdsh
- Using taskfarmer
- Using Python
- Using MPI to Make an Embarassingly Parallel Code
In the examples below, we assume that the executable a.out resides in a directory and each run's input and output files are kept in a separate directory underneath it (dir00/infile, dir00/outfile, dir01/infile, dir01/outfile, ...).
Using the Serial Queue
An easy way is to run multiple serial runs sequentially using the serial queue:
#!/bin/bash
#PBS -q serial
#PBS -l walltime=1:00:00
#PBS -l pvmem=2GB
cd $PBS_O_WORKDIR
./a.out < dir00/infile > dir00/outfile
./a.out < dir01/infile > dir01/outfile
./a.out < dir02/infile > dir02/outfile
./a.out < dir03/infile > dir03/outfile
./a.out < dir04/infile > dir04/outfile
./a.out < dir05/infile > dir05/outfile
./a.out < dir06/infile > dir06/outfile
./a.out < dir07/infile > dir07/outfile
This job will use a single core for the serial runs, not being charged for the other cores on the assigned compute node. But the application runs will be processed sequentially, and time to process the entire work will become large.
Using pbsdsh
One can run the serial work load above in parallel with 'pbsdsh'. This command distributres workload on the allocated compute nodes and all the cores execute the shell script, $PBS_O_WORKDIR/runtask.sh, in parallel.
#!/bin/bash
#PBS -q regular
#PBS -l nodes=1:ppn=8
#PBS -l walltime=1:00:00
#PBS -l pvmem=2GB
cd $PBS_O_WORKDIR
pbsdsh $PBS_O_WORKDIR/runtask.sh
where runtask.sh is given as follows:
#!/bin/bash
cd $PBS_O_WORKDIR
wdir=$(printf "dir%2.2d" ${PBS_VNODENUM}) # dir00, dir01, dir02,...
./a.out < $wdir/infile > $wdir/outfile
The environment variable PBS_VNODENUM is a CPU (virtual node) identifier whose value runs from 0 to nodes*ppn-1. This is used in the above script to identify 8 tasks spawned by pbsdsh.
Please note that we cannot use the serial queue as we are using multiple cores (and multiple nodes if necessary). Since the work load is processed in parallel, the walltime will be reduced.
Using taskfarmer
One can use the taskfarmer utility developed at NERSC to process the work load in parallel in a similar way. To use the utility, both the tig and taskfarmer modules should be loaded. A sample batch script is below:
#!/bin/bash -l
#PBS -q regular
#PBS -l nodes=1:ppn=8
#PBS -l walltime=1:00:00
#PBS -l pvmem=2GB
module load tig taskfarmer
cd $PBS_O_WORKDIR
runcommands.sh jobfile
where jobfile is given as follows, with '/path/for/cwd' to be replaced with the proper full path for the curent working directory:
/path/for/cwd/a.out < /path/for/cwd/dir00/infile > /path/for/cwd/dir00/outfile
/path/for/cwd/a.out < /path/for/cwd/dir01/infile > /path/for/cwd/dir01/outfile
/path/for/cwd/a.out < /path/for/cwd/dir02/infile > /path/for/cwd/dir02/outfile
/path/for/cwd/a.out < /path/for/cwd/dir03/infile > /path/for/cwd/dir03/outfile
/path/for/cwd/a.out < /path/for/cwd/dir04/infile > /path/for/cwd/dir04/outfile
/path/for/cwd/a.out < /path/for/cwd/dir05/infile > /path/for/cwd/dir05/outfile
/path/for/cwd/a.out < /path/for/cwd/dir06/infile > /path/for/cwd/dir06/outfile
/path/for/cwd/a.out < /path/for/cwd/dir07/infile > /path/for/cwd/dir07/outfile
Please note that the file, jobfile, is not a shell script. It is a text file that lists all the tasks to be done.
Using Python
One can use a python's mpi4py module to launch multiple serial jobs. Below is a sample python script, 'mwrapper.py':
#!/usr/bin/env python
from mpi4py import MPI
from subprocess import call
import sysexctbl = sys.argv[1]
comm = MPI.COMM_WORLD
rank = comm.Get_rank()
myDir = "dir"+str(rank).zfill(2)
cmd = "cd "+myDir+" ; "+exctbl+" < infile > outfile"
sts = call(cmd,shell=True)
comm.Barrier()
Below is a batch script to use it for a serial program, a.out:
#!/bin/bash -l
#PBS -q regular
#PBS -l nodes=1:ppn=8
#PBS -l walltime=1:00:00
#PBS -l pvmem=2GB
module load python/2.7.3 mpi4py
cd $PBS_O_WORKDIR
mpirun -np 8 mwrapper.py $PBS_O_WORKDIR/a.out
Using MPI
MPI can be used for simultaneous execution of the serial runs, too. The following is to use a non-standard but widely accepted Fortran intrinsic function, SYSTEM, to make each MPI rank run the serial application independently of other ranks under the MPI environment.
program mpiwrapped
include 'mpif.h'
character*80 cmd
call mpi_init(ierr)
call mpi_comm_rank(mpi_comm_world,me,ierr)
write(cmd,'("./a.out < dir",i2.2,"/infile > dir",i2.2,"/outfile")') & me, me ! ./a.out < dir00/infile > dir00/outfile for rank 0, etc.
call system(trim(cmd)) call mpi_finalize(ierr) end
Another way to run the multiple serial work in parallel is to turn the serial code into an embarassingly parallel MPI code by including a minimal set of MPI-related changes. The idea is to make each MPI task read its own data from a separate input file and write the result to a separte outfile file. The input and output file names can be constructed from MPI rank. The following code is turned into a MPI code by simply adding the "include 'mpif.h'", MPI_Init, MPI_Comm_rank and MPI_Finalize statements only; and by having the input and output file names constructed from MPI rank and making the program use the files:
program mpiwrapped
include 'mpif.h'
...
character*12 file_in
character*13 file_out
...
call mpi_init(ierr)
call mpi_comm_rank(mpi_comm_world,me,ierr)
write(file_in,'("dir",i2.2,"/infile")') me ! dir00/infile for rank 0, etc. write(file_out,'("dir",i2.2,"/outfile")') me ! dir00/outfile for rank 0, etc.
...
open(10,file=file_in) read(10,*) ... ! instead of 'read(5,*) ...' close(10) ... open(11,file=file_out) write(11,*) ... ! instead of 'write(6,*) ...' close(11) ... call mpi_finalize(ierr) end
Below is a batch script to run the program above.
#!/bin/bash
#PBS -q regular
#PBS -l nodes=1:ppn=8
#PBS -l walltime=1:00:00
#PBS -l pvmem=2GB
cd $PBS_O_WORKDIR
mpirun -np 8 ./mpiwrapped
Such wrapper codes can be similarly prepared in other languages such as C or C++. Actually, the python example above can be considered as one such example. Note also that other distributed-memory parallel programming models than MPI can similarly be used.
Running Multiple Parallel Jobs Sequentially
Make sure that the total number of nodes requested is sufficient for the largest (node-count) application, and that the requested walltime is sufficient for all applications combined. Note that the repository charge for the batch job is based on the total number of nodes requested, regardless of whether any particular application uses all those nodes.
#PBS -q regular
#PBS -l nodes=16:ppn=8
#PBS -l walltime=4:00:00
#PBS -N my_job
#PBS -e my_job.$PBS_JOBID.err
#PBS -o my_job.$PBS_JOBID.out
#PBS -V
cd $PBS_O_WORKDIR
mpirun -np 128 ./my_executable1
mpirun -np 32 ./my_executable2
mpirun -np 64 ./my_executable3
Running Multiple Parallel Jobs Simultaneously
If you need to run many jobs which require approximately the same run time, you can boundle them up and run in one job script. Eg., to run 8 jobs simultaneously with 4 cores per job (you need 4 nodes in total), here is a sample job script:
#!/bin/bash -l
#PBS -N test_multi
#PBS -q debug
#PBS -l nodes=4:ppn=8,walltime=30:00
#PBS -j oe
#PBS -V
cd $PBS_O_WORKDIR
#Run 8 jobs simultaneously, each with 4 cores
let num_jobs=8
let tasks_per_job=4
#Assume jobs run in separate directories, job1, job2, ...
for i in $(seq $num_jobs)
do
cd job$i
#write hostfile for i-th job to use
let lstart=($i-1)*${tasks_per_job}+1
let lend=${lstart}+${tasks_per_job}-1
sed -n ${lstart},${lend}'p' < $PBS_NODEFILE >nodefile
mpirun -np $tasks_per_job -hostfile nodefile ./a.out >& job$i.out &
cd ..
done
wait
Please note, it is important to pass the hostfile to mpirun for each job, otherwise all jobs will be running on the same first 4 cores, which will overload the node and will likely kill the node. Also the last "wait" command is necessary, otherwise your job will exit right away afer the script gets executed, and all your jobs sent to the background will be termintated.
If you want to run multiple hybird MPI+OpenMP jobs at the same time, here is a sample job script. This job script runs 8 hybrid jobs simultaneously with 2 MPI tasks 4 threads per task for each job (so each job runs on 1 node, and you need 8 nodes in total).
#!/bin/bash -l
#PBS -N test_hybrid
#PBS -q debug
#PBS -l nodes=8:ppn=2,walltime=30:00
#PBS -l pvmem=10GB
#PBS -j oe
#PBS -V
cd $PBS_O_WORKDIR
#Run 8 jobs simultaneously, each running 2 mpi tasks 4 threads per task
let num_jobs=8
let tasks_per_job=2
let threads_per_task=4
export OMP_NUM_THREADS=$threads_per_task
#Assume jobs run in separate directories, job1, job2, ...
for i in $(seq $num_jobs)
do
cd job$i
#write hostfile for i-th job to use
let lstart=($i-1)*${tasks_per_job}+1
let lend=${lstart}+${tasks_per_job}-1
sed -n ${lstart},${lend}'p' < $PBS_NODEFILE >nodefile
mpirun -np $tasks_per_job -bysocket -bind-to-socket -hostfile nodefile ./a.out >& job$i.out &
cd ..
done
wait
Again, it is important to pass the hostfile to the mpirun command line to spread out your tasks over the nodes allocated to your job. Please also note, the torqueue directive #PBS -l pvmem=10GB is necessary to allow 4 times of the default percore memory (2.5GB) for each MPI task, otherwise the 4 threads in each MPI task will be sharing 2.5GB of the default memory instead of the availabe 4 times of that mount. For more information abut uisng memory properly, please refer to our Memory Considerations page. This job script is good for running 4 threads per task. To run hybrid jobs with other task to thread ratios, please refer to our MPI/OpenMP page for general instructions.
As a last note, you should be aware that the contens of the batch nodefile, $PBS_NODEFILE, depends on the ppn value of the torque directive #PBS -l nodes. You are recommended to work out a more robust and convenient way to generate the nodefile for each of your multiple jobs.
Job Steps and Dependencies
There is a batch option depend=dependency_list for job dependencies. The most commonly used dependency_list would be afterok:jobid[:jobid...], which means the job just submitted will be executed only after the dependent job(s) have terminated without an error. Another option would be afterany:jobid[:jobid...], which means the job just submitted will be executed only after the dependent job(s) have terminated either with or without an error. The second option could be useful in many restart runs since it is the user's intention to exceed wall clock limit for the first job.
For example, to run batch job2 only after batch job1 succeeds,
carver% qsub job1.pbs
123456.cvrsvc09-ib
carver% qsub -W depend=afterok:123456.cvrsvc09-ib job2.pbs
123457.cvrsvc09-ib
As with most batch options, the dependency can be included in a batch script rather than on the command line:
carver% qsub job1.pbs
123456.cvrsvc09-ib
carver% qsub job2.pbs
123457.cvrsvc09-ib
where the batch script job2.pbs contains the following line:
#PBS -W depend=afterok:123456.cvrsvc09-ib
The second job will be in batch "Held" status until job1 has run successfully. Note that job2 has to be submitted while job1 is still in the batch system, either running or in the queue. If job1 has exited before job2 is submitted, job2 will not be released from the "Held" status.
It is also possible for the first job to submit the second (dependent) job by employing the Torque environment variable "PBS_JOBID":
#PBS -q regular
#PBS -l nodes=4:ppn=8
#PBS -l walltime=0:30:00
#PBS -j oe
cd $PBS_O_WORKDIR
qsub -W depend=afterok:$PBS_JOBID job2.pbs
mpirun -np 32 ./a.out
Please refer to qsub man page for other -W depend=dependency_list options including afterany:jobid[:jobid...], afternotok:jobid[:jobid...], before:jobid[:jobid...], etc.
Sample Scripts for Submitting Chained Dependency Jobs
Below is a simple batch script, 'runit.pbs', for submitting three chained jobs in total (job_number_max=3). It sets the job sequence number (job_number) to 1 if this variable is undefined (that is, in the first job). When the value is less than job_number_max, the current job submits the next job. The value of job_number is incremented by 1, and the new value is provided to the subsequent job.
#!/bin/bash
#PBS -q regular
#PBS -l nodes=1
#PBS -l walltime=0:05:00
#PBS -j oe
: ${job_number:="1"} # set job_number to 1 if it is undefined
job_number_max=3
JOBID="${PBS_JOBID}"
cd $PBS_O_WORKDIR
echo "hi from ${PBS_JOBID}"
if [[ ${job_number} -lt ${job_number_max} ]]
then
(( job_number++ ))
next_jobid=$(qsub -v job_number=${job_number} -W depend=afterok:${JOBID} runit.pbs)
echo "submitted ${next_jobid}"
fi
sleep 15
echo "${PBS_JOBID} done"
Using the above script, three batch jobs are submitted:
carver% qsub runit.pbs
123456.cvrsvc09-ib
carver% ls runit.pbs.o*
-rw------- 1 xxxxx xxxxx 899 2011-03-09 09:27 runit.pbs.o123456
-rw------- 1 xxxxx xxxxx 949 2011-03-09 09:28 runit.pbs.o123457
-rw------- 1 xxxxx xxxxx 949 2011-03-09 09:28 runit.pbs.o123458
carver% cat runit.pbs.o123456
...
hi from 123456.cvrsvc09-ib
submitted 123457.cvrsvc09-ib
123456.cvrsvc09-ib done
...
carver% cat runit.pbs.o123457
...
hi from 123457.cvrsvc09-ib
submitted 123458.cvrsvc09-ib
123457.cvrsvc09-ib done
...
carver% cat runit.pbs.o123458
...
hi from 123458.cvrsvc09-ib
123458.cvrsvc09-ib done
...


