Running Jobs
Full Document |
Running on Bassi
|
| Submit Class1 | Job Type | Destination Class2 | Nodes | Available Processors | Max Wallclock | Relative Priority3 | MPP Charge (units of nodes*wall hrs) | Availability |
|---|---|---|---|---|---|---|---|---|
| interactive | parallel | interactive | 1-4 | 1-32 | 30 mins | 1 | 48 | Everyone |
| debug | parallel | debug | 1-8 | 1-64 | 30 mins | 2 | 48 | Everyone |
| premium | parallel | premium | 1-48 | 1-384 | 12 hrs | 4 | 96 | Everyone |
| regular | parallel | reg_1 | 1-15 | 1-120 | 36 hrs | 5 | 48 | Everyone |
| parallel | reg_16 | 16-31 | 121-248 | 18 hrs | 5 | 48 | Everyone | |
| parallel | reg_32 | 32-48 | 249-384 | 18 hrs | 5 | 48 | Everyone | |
| low | parallel | low | 1-32 | 1-256 | 12 hrs | 6 | 24 | Everyone |
| special4 | parallel | special | 1-64 | 1-512 | 48 hrs | 3 | 48 | By special arrangement |
| full_config4 | parallel | full_config | 1-ALL | 1-ALL | 48 hrs | 3 | 48 | By special arrangement |
1 - This is the class name to be used in LoadLeveler scripts.
2 - Users cannot submit scripts directly to a destination class,
but this is the class name that will appear when using job monitoring utilities.
3 - The priorites listed in the table are relative. NERSC assigns priorities in terms of
"equivalent days waiting in the queue".
In addition to the relative priority given to jobs depending on their LoadLeveler class, certain projects
with high priority within DOE receive a "scheduling boost". These tend to be INCITE projects.
4 - Available by special arrangement only.
See also Queue Policies, below, for information on run limits.
You can use the llclass command on the system to obtain information about the LoadLeveler classes. Detailed information about a single LoadLeveler class can be found using llclass -l classname.
If you request more wall clock time than allowed by the class (as indicated by the Max Wallclock column in the table above), your job will be submitted with the wall clock time adjusted to the maximum allowed. If you omit requested time, then a default of 30 minutes will be used.
Your job will be charged and scheduled according to the priority listed in the class name. Both interactive and debug are charged at the regular rate. See MPP Accounting.
The classes are configured to give the best service to premium and regular jobs. Premium jobs are charged at twice the rate of regular jobs, but are scheduled at a higher priority.
Loadleveler uses a scheduling technique called "backfilling". This method starts smaller, shorter jobs if they will not affect the start time for the job that is scheduled to begin next. This scheduling technique is advantageous from both a user and system perspective. It allows a faster turn around for shorter jobs, and it maximizes system usage.
Here's a list of commonly used Loadleveler commands with brief descriptions. Detailed information on these commands can be obtained by typing:
% man commandname
| Command | Description |
|---|---|
| llsubmit | Submits a script file to Loadleveler for execution |
| llqs | Lists all Loadleveler jobs, including requested wall clock time and number of nodes. |
| llq | Same as llqs but with less information. |
| llcancel | Deletes a job from the Loadleveler queue |
| llclass | Displays information about the configured classes |
| llstatus | Displays status of individual nodes on the SP |
The llqs command is specific to NERSC. There are a number of options available, please see the man page.
The llq command has been usurped by llqs but it is still useful for details about a particular job. See Monitoring jobs for more information.
To submit a batch job to Loadleveler you need to create a script file containing the commands that make up the batch job. This file must contain the following types of commands:
Here is an example of a script file named myjob that
runs a program named a.out when submitted from Bassi:
#@ job_name = myjob #@ account_no = repo_name #@ output = myjob.out #@ error = myjob.err #@ job_type = parallel #@ environment = COPY_ALL #@ notification = complete #@ network.MPI = sn_all,not_shared,us #@ node_usage = not_shared #@ class = regular #@ bulkxfer = yes # # #@ tasks_per_node = 8 #@ node = 1 #@ wall_clock_limit= 01:00:00 # #@ queue ./a.out
A program to check a LoadLeveler script for common simple errors is available on Bassi. This does not provide a comphrehensive error check (far from it, in fact!)
% ll_check_script script_name
For most jobs run on Bassi, NERSC suggests that the following keywords be specified. Some of these keywords may have default values, but NERSC considers them to be undefined and you should never rely on their existence or persistence.
There are many, many LoadLeveler keywords. Many do not have any meaning in the NERSC environment. Many others can contain site-specific values. NERSC recommends that scripts contain the minimum set of keywords needed to run your job in the way you want. If you are trying to use keywords that you do not understand, it is best to delete them. A common error is to specify a keyword and value that do not apply at NERSC (or are wrong for the NERSC environment); such jobs will be accepted into the queue, but will never run.
A complete description of Loadleveler keywords can be found in IBM's LoadLeveler manual.
#@ job_type = parallel |
Specifies that your job is parallel. |
#@ node = N |
Specifies that your job will use N nodes. |
#@ network.MPI =sn_all,not_shared,US |
Specifies communication protocol and adapters. csss is an alias for sn_all. Please see the LL documentation if you wish to use LAPI. |
#@ class = class |
Specifies that your job should be run in the submit class class. |
#@ tasks_per_node = N |
Specifies the number of tasks that will run on each node. To run a different number of tasks on different nodes, see Selecting the number of nodes and tasks. |
#@ wall_clock_limit = HH:MM:SS |
Specifies that your job requires HH:MM:SS of wall clock time. |
#@ queue |
Places your job in the queue. It must be the last keyword specified. Any keywords placed after this in the script are ignored by the current job step. |
#@ node_usage = not_shared |
Specifies that your job should not share nodes with other jobs. This is the default and cannot be overridden. |
#@ job_name = jobname |
Specifies the name of your job. |
#@ account_no = repo_name |
Specifies the repository that will be charged. If omitted, your default repository will be used. |
#@ notification = complete |
Specifies that email should be sent upon completion of the job. Other options include always, error, start, and never. The default is complete. |
#@ shell = /usr/bin/shell |
Specifies the shell that parses the script. If not specified, your login shell will be used. |
#@ bulkxfer = yes |
Specifies the the job use "bulk transfer" for message passing. Most codes will benefit from using "bulk transfer". |
#@resources = ConsumableMemory(N mb) |
Specifies the memory available to your job per node. See Memory usage considerations. The default value of N is 3395 mb per task. |
#@ output = myjob.out |
Specifies the name and location of STDOUT. If not given, the default is /dev/null. The default directory is the submitting directory. |
#@ error = myjob.err |
specifies the name and location of STDERR. If not given, the default is /dev/null. The default directory is the submitting directory. |
#@ environment = COPY_ALL |
Specifies that all environment variables from your shell should be used. You can also list individual variables which should be separated with semicolons. |
If you are using a script that was used on another (non-SMP) IBM machine, beware of the following keywords:
These keywords specify the maximum/minimum number of nodes (not processors!) requested for a parallel job, regardless of the number of processors contained in the node. The node keyword should be used instead.
You must specify the number of tasks and number of nodes you wish to use. Remember that Bassi has a maximum of 8 tasks per node; if you ask for more than 8 tasks per node (either explicitly or implicitly) your job will never run.
You usually specify numbers of tasks and nodes by including two of the following three LoadLeveler keywords in your batch script:
For example, to run on 4 nodes and pack the nodes with 8 MPI tasks per node, in your LoadLeveler script use:
#@ node = 4 #@ tasks_per_node = 8
To use a number of tasks that is not a multiple of 8, do not set the tasks_per_node keyword. For 29 tasks on 6 nodes, you would use:
#@ node = 6 #@ total_tasks = 29
LoadLeveler will allocate tasks on the nodes in a round-robin fashion.
It is also possible to select which tasks run together on a node. However you must be very careful to completely specify all tasks and the correct number of nodes. Do not use any of the three keywords listed above. Instead, to run 8 tasks on 3 nodes with task 0 on one node, odd tasks on another node and the remaining even tasks on the third node, use:
#@task_geometry = {(0)(1,3,5,7)(2,4,6)}
See LoadLeveler documentation on "Task Assignment Considerations".
Once you have created a script file, it can be submitted to Loadleveler for execution using the llsubmit command:
% llsubmit script_file llsubmit: The job "s01007.nersc.gov.101" has been submitted.
script_file specifies the name of the script file containing commands that comprise the batch job and keywords which control various aspects of the job, e.g., resource limits, output files, mail messages, etc.
You do not specify a job priority in addition to the class. The class will contain the name of the priority at which the job will be charged. See Class info and job scheduling.
There are system limits on the number of jobs a user can have in various states. See NERSC queue policies for Bassi.
LoadLeveler allows you to define "job steps" in a single batch script. Each step is essentially a different job that can run based on a set of "dependencies" on other job steps contained in the same script. LoadLeveler class limits apply to each step separately.
See the LoadLeveler documentation for full details. Here we describe how to run an executable based on whether or not a previous program completed successfully. This strategy might be used by a code that writes checkpoint files at the end of a run and then uses those files as input for the next run.
You define steps in a LoadLeveler script by naming them with the keyword definition
#@step_name = step_name
followed by other definitions for that step and finally a "queue" directive. Additional steps follow.
Some LoadLeveler directives must be included in each step. A typical first job step for an MPI code would look like this:
#@ step_name = step1 #@ job_type = parallel #@ node = 8 #@ wall_clock_limit = 24:00:00 #@ class = regular #@ tasks_per_node = 8 #@ network.MPI = sn_all,not_shared,us #@ executable = a.out #@ queue
The second job step contains a dependency, as specified on the with the "dependency" keyword. Here we are going to assume that if the first job step is successful it returns an error code of 0 (zero). Here is a typical second job step:
#@ step_name = step2 #@ dependency = (step1==0) #@ job_type = parallel #@ wall_clock_limit = 24:00:00 #@ class = regular #@ node = 4 #@ tasks_per_node = 8 #@ network.MPI = sn_all,not_shared,us #@ executable = b.out #@ queue
The second job step will only be run if step1 has an exit code of 0. The executable a.out and b.out can themselves be executable shell scripts.
You can also include "if" statements in your LoadLeveler script itself to take different actions for different job steps. Here's an full example script that run the toy "flip" program introduced in the Introduction to MPI tutorial.
#@ class = debug
#@ shell = /usr/bin/csh
#@ wall_clock_limit = 0:05:00
#@ notification = always
#@ output = $(host).$(jobid).$(stepid).out
#@ error = $(host).$(jobid).$(stepid).err
#@ environment = COPY_ALL
#
#@ step_name = step1
#@ job_type = parallel
#@ node = 1
#@ tasks_per_node = 4
#@ network.MPI = sn_all,not_shared,us
#@ queue
#
#@ step_name = step2
#@ dependency = (step1==0)
#@ job_type = parallel
#@ node = 1
#@ tasks_per_node = 2
#@ network.MPI = sn_all,not_shared,us
#@ queue
if ($LOADL_STEP_NAME == "step1") then
./flip
else if ($LOADL_STEP_NAME == "step2") then
echo "Step 1 returned exit code of 0."
echo "100500" >flip.in
./flip
else
echo "Error: no value for step name."
endif
exit
To monitor your job after submission, you can use the llqs command:
% llqs Step Id JobName UserName Class ST NDS WallClck Submit Time ------------- ------------ -------- ------- -- --- -------- ----------- b0307.1087.0 a240 buffy regular R 32 00:31:44 3/13 04:30 b0301.529.0 s1.x willow regular R 64 00:28:17 3/12 21:45 b0301.578.0 xdnull xander debug R 5 00:05:19 3/14 12:44 b0309.929.0 s01009.ners spike regular R 128 03:57:27 3/13 05:17 b0301.530.0 s2.x willow regular I 64 04:00:00 3/12 21:48 b0301.532.0 s3.x willow regular I 64 04:00:00 3/12 21:50 b0301.533.0 y1.x willow regular I 64 04:00:00 3/12 22:17 b0301.534.0 y2.x willow regular I 64 04:00:00 3/12 22:17 b0301.535.0 y3.x willow regular I 64 04:00:00 3/12 22:17 b0301.537.0 s01001.nerg spike regular I 128 02:30:00 3/13 06:10 b0309.930.0 s01009.nerg spike regular I 128 02:30:00 3/13 07:17
This information can also be found on the NERSC Batch Queue Status page.
NERSC has written a utility, phost, that can be used to record which tasks ran on which node.
The llcancel command is used to delete a job.
% llcancel joblist
The joblist contains the list of Job IDs you want to delete. You can get this name from the first column of the output from either the llqs or the llq command:
bassi% llq Id Owner Submitted ST PRI Class Running On ------------------- ---------- ----------- -- --- ------------ ----------- b01015.137.0 bones 10/7 12:05 R 50 regular s00608 b01005.84.0 spock 10/7 16:51 R 50 regular s01816 b01015.177.0 jimkirk 10/7 17:01 R 50 interactive s00713 b01015.174.0 sulu 10/7 16:54 R 50 low s01601 b01015.176.0 uhura 10/7 17:00 R 50 debug s00101 5 job steps in queue, 0 waiting, 0 pending, 5 running, 0 held
For example, if user spock wanted to delete a job, then the command is:
% llcancel s01005.84.0 llcancel: Cancel command has been sent to the central manager.
The llcancel command works for both running and queued jobs.
By default, job usage is charged to your default repository. Use the NERSC Information Management (NIM) system to view your default repo.
You can specify the repo to be charged in your LoadLeveler script. Use this keyword:
#@account_no = repo_name
Your job charge depends on which class you specify. In general, one submits a job to the class which contains the name of the desired NERSC batch charging priority. For example, a premium job is submitted to the class named "premium". Interactive and debug jobs are charged at the same rate as regular class jobs.
See Accounting on the IBM for information about how accounts are charged.
Here is an example batch script.
#@ job_name = myjob #@ account_no = repo_name #@ output = myjob.out #@ error = myjob.err #@ job_type = parallel #@ environment = COPY_ALL #@ notification = complete #@ network.MPI = sn_all,not_shared,us #@ node_usage = not_shared #@ class = regular # # #@ node = 16 #@ tasks_per_node = 8 #@ wall_clock_limit= 01:00:00 # #@ bulkxfer = yes # #@ queue ./a.out < input
![]() |
Page last modified: Sat, 01 Mar 2008 00:50:31 GMT Page URL: http://www.nersc.gov/nusers/systems/bassi/running_jobs/batch.php Web contact: webmaster@nersc.gov Computing questions: consult@nersc.gov Privacy and Security Notice |
![]() |