NERSCPowering Scientific Discovery Since 1974

Job Arrays

Job Arrays

Job arrays have many advantages, including reduced load on UGE, faster job submission, and easier job management. If you find yourself submitting thousands of jobs at a time that are largely identical, you should use job arrays. For example, if you have many different data sets, but want to run the same program on all of the data sets, you can either use Perl to generate one script for each data set, or you can use a job array with a single script.  Using a UGE job array is cleaner, easier and more efficient!

Here is a description of how UGE job arrays work:

Job arrays can be submitted from the command line with the -t option to qsub, e.g.,

genepool01% qsub -t 1-20:1 myjob.csh

This would submit 20 identical jobs, each running instances of myjob.csh, with job indices from 1-20.

To use different input file for each job you need to reference $SGE_TASK_ID in myjob.csh.  For this example, if it is the case that the inputs are named "inputs.0", ..., "inputs.20" and you want to have corresponding outputfiles, then you can do the following in myjob.csh:

./myprogram inputs.$SGE_TASK_ID results.$SGE_TASK_ID

The $SGE_TASK_ID variable corresponds to the task currently being run on the compute node and for this example will number between 1 and 20, and the program will execute on each input file.

This is computationally equivalent to 20 individual queue submissions, however,

  • only ONE qsub command is issued (and only ONE qdel command is required to delete all these jobs)
  • only ONE entry appears in the qstat output
  • the load on the UGE submit node (cluster login node) is significantly lighter than if it has to process 1000 individial jobs

Other options for job arrays

If the ":1" was replaced with ":n" it would do every "n"th job.  The job index does not need to start at one, you could instead do something like

genepool01% qsub -t 25-78:5 myjob.csh

which would start the tasks at 25, increment by 5 and end at 75.  Why does it end at 75 and not 78 as is entered?  UGE automatically changes the upperbound so that it is consistent with the lower bound plus an integer-multiple of the increment that is less than 78, i.e. 25+5*10 = 75.

For files with more complicated naming conventions, you can do the following in a perl script:

#!/usr/bin/env perl
open(LIST, "<", "jobFiles.list") or die "Error: $!\n";
my @labels;
while (<LIST>) {
    my $name = $_;
    push(@labels, $name);
}
my $whichFile = $ENV{SGE_TASK_ID} - 1;
my $filename = $labels[$whichFile];

Here "jobFiles.list" is a file containing a list of input files and $filename is the input file for the job. This snippet of code is run during execution (not at submission). The qsub -t option in your job submission should reflect the actual number of files in your list.

Limiting concurrent running tasks

The concurrency of array jobs which are particularly I/O intensive may need to be restricted to only a few running tasks.  To specify the maximum number of tasks that can run concurrently, use the "-tc <maximum # of tasks>" option. 

dmj@genepool04:~$ qsub -t 1-500 -tc 10 -b y sleep 300
Your job-array 3791842.1-500:1 ("sleep") has been submitted
dmj@genepool04:~$ qs -u dmj
JOBID      ST  PRIOR USER     PROJECT      QUEUE  NAME     R_N:s|TS R_RAM/N R_RAM/s    R_TIME    U_TIME       START/SUB_TIME TASK
-----------------------------------------------------------------------------------------------------------------
3791842     r 0.1395 dmj      system.p     normal sleep           1    5.0G    5.0G  12:00:00  00:00:23  2012-10-29 08:18:45 1
3791842     r 0.1373 dmj      system.p     normal sleep           1    5.0G    5.0G  12:00:00  00:00:23  2012-10-29 08:18:45 2
3791842     r 0.1351 dmj      system.p     normal sleep           1    5.0G    5.0G  12:00:00  00:00:23  2012-10-29 08:18:45 3
3791842     r 0.1331 dmj      system.p     normal sleep           1    5.0G    5.0G  12:00:00  00:00:23  2012-10-29 08:18:45 4
3791842     r 0.1312 dmj      system.p     normal sleep           1    5.0G    5.0G  12:00:00  00:00:23  2012-10-29 08:18:45 5
3791842     r 0.1293 dmj      system.p     normal sleep           1    5.0G    5.0G  12:00:00  00:00:23  2012-10-29 08:18:45 6
3791842     r 0.1276 dmj      system.p     normal sleep           1    5.0G    5.0G  12:00:00  00:00:23  2012-10-29 08:18:45 7
3791842     r 0.1259 dmj      system.p     normal sleep           1    5.0G    5.0G  12:00:00  00:00:23  2012-10-29 08:18:45 8
3791842     r 0.1243 dmj      system.p     normal sleep           1    5.0G    5.0G  12:00:00  00:00:23  2012-10-29 08:18:45 9
3791842     r 0.1227 dmj      system.p     normal sleep           1    5.0G    5.0G  12:00:00  00:00:23  2012-10-29 08:18:45 10
3791842    qw 0.0000 dmj      system.p     normal sleep           1    5.0G    5.0G  12:00:00  --:--:--  2012-10-29 08:18:25 11-500:1
2012-10-29 08:19:08.765629
dmj@genepool04:~$