NERSCPowering Scientific Discovery Since 1974

Submitting Batch Jobs

Debug Jobs

Short jobs requesting less than 30 minutes and requiring 512 nodes (2,048 cores) or fewer can run in the debug queue.  From 5am-6pm Pacific Time, 256 nodes are reserved for debugging and interactive use.  See also, running Interactive Jobs.

Sample Batch Scripts

The following batch script requests 8 cores on 2 nodes with a 10 minute wall clock limit in the debug queue. Torque directive lines tell the batch system how to run a job and begin with #PBS.

#PBS -q debug
#PBS -l mppwidth=8
#PBS -l walltime=00:10:00
#PBS -j eo
#PBS -V

cd $PBS_O_WORKDIR
aprun -n 8 ./a.out

 Here is another example requesting 8 processors using 4 nodes with only 2 cores per node:

#PBS -q debug
#PBS -l mppwidth=8
#PBS -l mppnppn=2
#PBS -l walltime=00:10:00
#PBS -j eo

cd $PBS_O_WORKDIR
aprun -n 8 -N 2 ./a.out

Notice the number specified for -l mppwidth will always match the -n option for aprun. Since quad core mode is the default mode, the following code is equivalent to the first example:

#PBS -q debug
#PBS -l mppwidth=8
#PBS -l mppnppn=4
#PBS -l walltime=00:10:00
#PBS -j eo

cd $PBS_O_WORKDIR
aprun -n 8 -N 4 ./a.out
The TORQUE keyword "#PBS -l mppnppn=4" and aprun option "-N 4" are optional in quad core mode. The following table lists the most important corresponding "aprun" vs. "#PBS -l" options:
aprun option#PBS -l optionDescription
-n 8 -l mppwidth=8 Width (number of PEs)
-N 2 -l mppnppn=2 Number of PEs per node

In the sample scripts above, the line cd $PBS_O_WORKDIR changes the current working directory to the directory from which the script was submitted. NERSC recommends running jobs from the$SCRATCH or $SCRATCH2 instead of $HOME. The easiest way to run a job from $SCRATCH is to submit the job from the $SCRATCH directory. Alternatively, a user may replace cd $PBS_O_WORKDIR with cd $SCRATCH in the batch script. 

Submit Batch Jobs

  1. Follow these steps to submit an application on Franklin:
  2. Copy the example batch script above into a file.
  3. Edit the batch script Torque keywords for the desired run.
  4. Verify the aprun job-launch command (similar to the MPICH mpirun command) arguments -N matches the Torque directive mppnppn.
  5. Submit the job using the qsub command. For example,
franklin% qsub myscript

Torque Keywords

STDOUT and STDERR

While your job is running standard output (STDOUT) and standard error (STDERR) are written to temporary files in your submit directory (for example: 6561550.nid00003.ER and 6561550.nid00003.OU). These files will progress in real time while the job is running so users could check the contents of these files for easier job monitoring. If you merge the stderr/stdout via "#PBS -j eo" or "#PBS -j oe" option, then only one such spool file will appear.It is important that users do not remove or rename these spool files while the job is still running!

After the batch job completes, the above files will be renamed to the corresponding stderr/stdout files (for example: jobscript.e6561550 and jobscript.o6561550). If you rename your own stdout/stderr file names, or merge stderr file to stdout file (with Torque keyword) and redirect your output to a file as follows, the temp file names will be renamed to the file names of your choice. For example, if you have:

...
#PBS -j oe
...
aprun -n 64 ./a.out >& my_output_file (for csh/tcsh)
or: aprun -n 64 ./a.out > my_output_file 2>&1 (for bash)

Then the temp files will be copied to "my_output_file" instead of the "jobscript.o6561550" at job completion time.

Undelivered Batch Output

Sometimes the batch system fails to deliver the stdout/stderr files back to the user. Once a night, the orphaned output files of a user's jobs will be be placed in the user's $SCRATCH/Undelivered_Batch_Output directory. The directory will be created if it does not yet exist. Output files there are identified by the job id.

Job Steps and Dependencies

There is a qsub option -W depend=dependency_list or a Torque Keyword #PBS -W depend=dependency_list for job dependencies. The most commonly used dependency_list would be afterok:jobid[:jobid...], which means the job just submitted could only be executed after the dependent job(s) have terminated without an error. Another option would be afterany:jobid[:jobid...], which means the job just submitted could only be executed after the dependent job(s) have terminated with or without an error. The second option could be useful in many restart runs since it is the user's intention to exceed wall clock limit for the first job.

For example, to run batch job2 only after batch job1 succeeds,

franklin% qsub job1
297873.nid00003
franklin% qsub -W depend=afterok:297873.nid00003 job2
or
franklin% qsub -W depend=afterany:297873.nid00003

or:

franklin% qsub job1
297873.nid00003
franklin% cat job2
#PBS -q debug
#PBS -l mppwidth=8
#PBS -l walltime=0:30:00
#PBS -W depend=afterok:297873.nid00003
#PBS -j oe

cd $PBS_O_WORKDIR
aprun -n 8 ./a.out
franklin% qsub job2

Second job will be in batch "Held" status until job1 has run successfully. Note job2 has to be submitted while job1 is still in the batch system, either running or in the queue. If job1 has exited before job2 is submitted, job2 will not be released from the "Held" status.

It is also possible to submit the second job in its dependent job (job1) batch script using Torque keyword "$PBS_JOBID":

#PBS -q debug
#PBS -l mppwidth=8
#PBS -l walltime=0:30:00
#PBS -j oe
cd $PBS_O_WORKDIR
qsub -W depend=afterok:$PBS_JOBID job2
aprun -n 8 ./a.out
#PBS -q debug
#PBS -l mppwidth=8
#PBS -l walltime=0:30:00
#PBS -j oe

cd $PBS_O_WORKDIR
qsub -W depend=afterok:$PBS_JOBID job2
aprun -n 8 ./a.out

Please refer to qsub man page for other -W depend=dependency_list options including afterany:jobid[:jobid...], afternotok:jobid[:jobid...], before:jobid[:jobid...], etc.

Running Multiple Parallel Jobs Sequentially

Multiple parallel jobs could be run sequentially in one single batch job. Be sure to specify the LARGEST number of cores needed for the jobs for the Torque keyword "mppwidth". For example, the following sample script will reserve 10 cores, and run three executables in sequential order:

#PBS -q debug
#PBS -l mppwidth=10
#PBS -l walltime=0:30:00
#PBS -j oe

cd $PBS_O_WORKDIR
aprun -n 4 ./a.out
aprun -n 10 ./b.out
aprun -n 6 ./c.out

Running Multiple Parallel Jobs Simultaneously

Multiple parallel jobs could be run simultaneously in one single batch job. (Note: NERSC policy is please not to bundle more than 10 simultaneous apruns in one job script.) Be sure to specify the TOTAL number of nodes needed for these jobs times 4 for the Torque keyword "mppwidth". (Note: not simply the total number of cores needed since only one executable could run on a single node). For example, the following sample script will reserve 32 cores (not simply adding 4+15+9=28), and run three executables simultaneously, a.out on 1 nodes, b.out on 4 nodes, and c.out on 3 nodes, thus a total of 8 nodes, and mppwidth should be 4*8=32.

#PBS -q debug
#PBS -l mppwidth=32
#PBS -l walltime=0:30:00
#PBS -j oe

cd $PBS_O_WORKDIR
aprun -n 4 ./a.out &
aprun -n 15 ./b.out &
aprun -n 9 ./c.out &
wait

Please notice the "&" and "wait" in the above example. "&" is to put each aprun command in the background, and "wait" is wait for all the background processes to finish before exiting the batch script.

Running MPMD (Mulit-Program Mulit-Data) Jobs

To run an MPMD job, use aprun option " -n pes executable1 : -n pes executable2 : ...". All the executables share a single MPI_COMM_WORLD.

For example, the following command runs a.out on 4 cores and b.out on 8 cores:

aprun -n 4 ./a.out : -n 8 ./b.out

Please notice that the number of nodes needed for each executable should be calculated separately since only one executable could run on each node. The number of nodes needed for this job would be the total number of nodes needed for each executable. And mppwidth need to be set to the total number of nodes needed times 4.

For example, the following command runs a.out on 3 cores and b.out on 9 cores, and the total number of nodes (with default quad core) needed is 1 + 3 = 4, thus mppwidth needs to be set to 16, instead of simply adding 3 and 9:

#PBS -q debug
#PBS -l mppwidth=16
#PBS -l walltime=0:30:00
#PBS -j oe

cd $PBS_O_WORKDIR
aprun -n 3 ./a.out : -n 9 ./b.out

Running Serial Jobs

In order to preserve interactive response, serial jobs that require more than 5 minutes and/or 1 GB of memory should be run in the same manner as larger parallel jobs. For example, if a user has a serial code that will need to run for 2 hours, the following batch script could be submitted:

#PBS -q regular
#PBS -l mppwidth=1
#PBS -l walltime=02:00:00
#PBS -j eo
#PBS -V

cd $PBS_O_WORKDIR
aprun -n 1 ./a.out