NERSC logo National Energy Research Scientific Computing Center
  A DOE Office of Science User Facility
  at Lawrence Berkeley National Laboratory
 

Running on Bassi

Bassi Batch Jobs

This document describes the IBM SP batch queuing system called Loadleveler as implemented at NERSC. It explains how to submit batch jobs, how to monitor the progress of jobs after they are submitted, and how to delete unwanted jobs. Example scripts are included which can be used as templates for customization.


Overview

All production runs on the SP should be done through the batch system. IBM's batch system software is called LoadLeveler. Since interactive resources are scarce on this system, most development and debugging runs should also be done in batch. This page describes how to use Loadleveler.

To submit a job to LoadLeveler create a batch script which contains the job's requirements, then submit the job to the batch system. You can then monitor the job as it progresses through the queue and can delete unwanted jobs. Jobs may be submitted at various NERSC charge priority values.

Here is a list of terms commonly used in reference to Loadleveler.

node

A Bassi node contains 8 CPUs. Your batch job has exclusive use of each node it uses and you are charged for the use of all CPUs while your job is resident on a node.

Processors within a node share a common pool of memory.

keyword

A Loadleveler batch script consists of UNIX shell commands and LoadLeveler keywords. See Loadleveler keywords commonly used on Bassi. A complete list of all Loadleveler keywords is available in IBM's documentation, but note that not all available keywords are relevant or useful on Bassi.

job ID

Loadleveler assigns an ID to every submitted job. The term "Job ID" will refer to the identifier used to monitor or delete a job.

class

This is similar to what is commonly called a queue. Each class has associated limits and properties, and a user selects a "submit" class for the job. There are also "destination" classes in which the job ultimately runs. In general, users submit to a submit class and the job is automatically routed to a destination class. Users cannot submit jobs to a destination class directly.

Class Info and Policies

Classes and Job Scheduling

All Loadleveler jobs must be submitted to a valid submit class. If the class doesn't exist, no error message will be issued. The job will be submitted, but will sit in the queue indefinitely. If this happens to you, you must delete the job using the llcancel command and resubmit to an available class.

NERSC users specify one of the following submit classes to queue jobs. Upon submission the job is routed to the appropriate LoadLeveler class according to the following criteria. (Users can not directly access the LoadLeveler classes.)

Submit Class1 Job Type Destination Class2 Nodes Available Processors Max Wallclock Relative Priority3 MPP Charge (units of nodes*wall hrs) Availability
interactive parallel interactive 1-4 1-32 30 mins 1 48 Everyone
debug parallel debug 1-8 1-64 30 mins 2 48 Everyone
premium parallel premium 1-48 1-384 12 hrs 4 96 Everyone
regular parallel reg_1 1-15 1-120 36 hrs 5 48 Everyone
parallel reg_16 16-31 121-248 18 hrs 5 48 Everyone
parallel reg_32 32-48 249-384 18 hrs 5 48 Everyone
low parallel low 1-32 1-256 12 hrs 6 24 Everyone
special4 parallel special 1-64 1-512 48 hrs 3 48 By special arrangement
full_config4 parallel full_config 1-ALL 1-ALL 48 hrs 3 48 By special arrangement

Notes

1 - This is the class name to be used in LoadLeveler scripts.
2 - Users cannot submit scripts directly to a destination class, but this is the class name that will appear when using job monitoring utilities.
3 - The priorites listed in the table are relative. NERSC assigns priorities in terms of "equivalent days waiting in the queue". In addition to the relative priority given to jobs depending on their LoadLeveler class, certain projects with high priority within DOE receive a "scheduling boost". These tend to be INCITE projects.
4 - Available by special arrangement only.

See also Queue Policies, below, for information on run limits.


You can use the llclass command on the system to obtain information about the LoadLeveler classes. Detailed information about a single LoadLeveler class can be found using llclass -l classname.

If you request more wall clock time than allowed by the class (as indicated by the Max Wallclock column in the table above), your job will be submitted with the wall clock time adjusted to the maximum allowed. If you omit requested time, then a default of 30 minutes will be used.

Your job will be charged and scheduled according to the priority listed in the class name. Both interactive and debug are charged at the regular rate. See MPP Accounting.

The classes are configured to give the best service to premium and regular jobs. Premium jobs are charged at twice the rate of regular jobs, but are scheduled at a higher priority.

Loadleveler uses a scheduling technique called "backfilling". This method starts smaller, shorter jobs if they will not affect the start time for the job that is scheduled to begin next. This scheduling technique is advantageous from both a user and system perspective. It allows a faster turn around for shorter jobs, and it maximizes system usage.

NERSC Queue Policies for Bassi

  • For the production batch classes, each user may have:
    • 3 jobs running (this parameter can be adjusted depending on system load).
    • 4 jobs in Idle state (jobs queued to run; this parameter can be adjusted depending on system load).
    If you have 4 jobs queued (in Idle state) and need to run an Interactive or Debug job, place one of your jobs on User Hold: llhold jobid. To requeue the job: llhold -r jobid.
  • The combined number of debug and interactive jobs that a user may have submitted or running at a given time must be two or fewer. Note that this policy only applies to jobs run in the interactive and debug batch classes. This includes parallel jobs (anything compiled with one of the "mp" compilers, e.g. mpxlf90) that are executed from the command line, as well as those jobs (parallel or otherwise) that are explicitly submitted to these two classes with the "llsubmit" command. The policy has NO effect on sequential programs executed from the command line, including all the normal Unix commands.
  • The interactive and debug classes are to be used for code development, testing, and debugging. Production runs are strictly prohibited from using the interactive and debug classes. User accounts are subject to suspension if they are determined to be using the interactive or debug class for production computing.
  • Any job that has been in the queue for 7 days or more, and is in the "user hold" (Loadlever status HU) state, will be removed from the system. Note that this means:
    • Jobs may not be held for more than 7 days; and
    • Jobs older than 7 days may not be held.
  • Since jobs on User Hold age in the queue, their release may perturb the scheduler such that overall system throughput is degraded. In such circustances NERSC may change the state of User Hold jobs to System Hold, and release them only when overall system throughput will not be affected.
  • A 60 minute time limit is enforced on all user processes on the login nodes.
  • Job "chaining" in the debug and interactive classes is strictly forbidden. Chaining is defined as using a batch script to submit another batch script. User accounts are subject to suspension if they are found to be chaining jobs in the debug or interactive classes.
  • Bassi is occassionally removed from service for maintenance. Users will be given seven days notice before such events, usually on the "Message of the Day" (MOTD), which is displayed upon login and is also available here. Usually, a system reservation will be made so that all jobs will finish normally before a maintence period; however, jobs that are running - for any reason - may be terminated at the start of a maintence period.

Common commands

Here's a list of commonly used Loadleveler commands with brief descriptions. Detailed information on these commands can be obtained by typing:

% man commandname
Command Description
llsubmit Submits a script file to Loadleveler for execution
llqs Lists all Loadleveler jobs, including requested wall clock time and number of nodes.
llq Same as llqs but with less information.
llcancel Deletes a job from the Loadleveler queue
llclass Displays information about the configured classes
llstatus Displays status of individual nodes on the SP

The llqs command is specific to NERSC. There are a number of options available, please see the man page.

The llq command has been usurped by llqs but it is still useful for details about a particular job. See Monitoring jobs for more information.

Creating batch jobs

To submit a batch job to Loadleveler you need to create a script file containing the commands that make up the batch job. This file must contain the following types of commands:

  • Loadleveler keywords embedded in the beginning of the script file. They provide Loadleveler with information needed to process the job.
  • Any AIX commands that can be entered interactively, including shell commands and user-created programs.

Here is an example of a script file named myjob that runs a program named a.out when submitted from Bassi:

#@ job_name        = myjob
#@ account_no      = repo_name
#@ output          = myjob.out
#@ error           = myjob.err
#@ job_type        = parallel
#@ environment     = COPY_ALL
#@ notification    = complete
#@ network.MPI     = sn_all,not_shared,us
#@ node_usage      = not_shared
#@ class           = regular
#@ bulkxfer	   = yes
#
#
#@ tasks_per_node  = 8
#@ node	           = 1
#@ wall_clock_limit= 01:00:00
#
#@ queue

./a.out 

A program to check a LoadLeveler script for common simple errors is available on Bassi. This does not provide a comphrehensive error check (far from it, in fact!)

% ll_check_script script_name

LoadLeveler keywords commonly used at NERSC

For most jobs run on Bassi, NERSC suggests that the following keywords be specified. Some of these keywords may have default values, but NERSC considers them to be undefined and you should never rely on their existence or persistence.

There are many, many LoadLeveler keywords. Many do not have any meaning in the NERSC environment. Many others can contain site-specific values. NERSC recommends that scripts contain the minimum set of keywords needed to run your job in the way you want. If you are trying to use keywords that you do not understand, it is best to delete them. A common error is to specify a keyword and value that do not apply at NERSC (or are wrong for the NERSC environment); such jobs will be accepted into the queue, but will never run.

A complete description of Loadleveler keywords can be found in IBM's LoadLeveler manual.

Required LoadLeveler Keywords
#@ job_type = parallel Specifies that your job is parallel.
#@ node = N Specifies that your job will use N nodes.
#@ network.MPI =sn_all,not_shared,US Specifies communication protocol and adapters. csss is an alias for sn_all. Please see the LL documentation if you wish to use LAPI.
#@ class = class Specifies that your job should be run in the submit class class.
#@ tasks_per_node = N Specifies the number of tasks that will run on each node. To run a different number of tasks on different nodes, see Selecting the number of nodes and tasks.
#@ wall_clock_limit = HH:MM:SS Specifies that your job requires HH:MM:SS of wall clock time.
#@ queue Places your job in the queue. It must be the last keyword specified. Any keywords placed after this in the script are ignored by the current job step.

Recommended Loadleveler Keywords
#@ node_usage = not_shared Specifies that your job should not share nodes with other jobs. This is the default and cannot be overridden.
#@ job_name = jobname Specifies the name of your job.
#@ account_no = repo_name Specifies the repository that will be charged. If omitted, your default repository will be used.
#@ notification = complete Specifies that email should be sent upon completion of the job. Other options include always, error, start, and never. The default is complete.
#@ shell = /usr/bin/shell Specifies the shell that parses the script. If not specified, your login shell will be used.
#@ bulkxfer = yes Specifies the the job use "bulk transfer" for message passing. Most codes will benefit from using "bulk transfer".
#@resources = ConsumableMemory(N mb) Specifies the memory available to your job per node. See Memory usage considerations. The default value of N is 3395 mb per task.

f
Other useful LoadLeveler keywords
#@ output = myjob.out Specifies the name and location of STDOUT. If not given, the default is /dev/null. The default directory is the submitting directory.
#@ error = myjob.err specifies the name and location of STDERR. If not given, the default is /dev/null. The default directory is the submitting directory.
#@ environment = COPY_ALL Specifies that all environment variables from your shell should be used. You can also list individual variables which should be separated with semicolons.

If you are using a script that was used on another (non-SMP) IBM machine, beware of the following keywords:

  • #@ max_processors
  • #@ min_processors

These keywords specify the maximum/minimum number of nodes (not processors!) requested for a parallel job, regardless of the number of processors contained in the node. The node keyword should be used instead.

Selecting the number of tasks and nodes

You must specify the number of tasks and number of nodes you wish to use. Remember that Bassi has a maximum of 8 tasks per node; if you ask for more than 8 tasks per node (either explicitly or implicitly) your job will never run.

You usually specify numbers of tasks and nodes by including two of the following three LoadLeveler keywords in your batch script:

  • #@ node = desired number of nodes
  • #@ total_tasks = desired number of tasks
  • #@ tasks_per_node = desired number of tasks per node; up to a maximum of 8 on Bassi

For example, to run on 4 nodes and pack the nodes with 8 MPI tasks per node, in your LoadLeveler script use:

#@ node = 4
#@ tasks_per_node = 8

To use a number of tasks that is not a multiple of 8, do not set the tasks_per_node keyword. For 29 tasks on 6 nodes, you would use:

#@ node = 6
#@ total_tasks = 29

LoadLeveler will allocate tasks on the nodes in a round-robin fashion.

It is also possible to select which tasks run together on a node. However you must be very careful to completely specify all tasks and the correct number of nodes. Do not use any of the three keywords listed above. Instead, to run 8 tasks on 3 nodes with task 0 on one node, odd tasks on another node and the remaining even tasks on the third node, use:

#@task_geometry = {(0)(1,3,5,7)(2,4,6)}

See LoadLeveler documentation on "Task Assignment Considerations".

Submitting Jobs

Once you have created a script file, it can be submitted to Loadleveler for execution using the llsubmit command:

% llsubmit script_file
llsubmit: The job "s01007.nersc.gov.101" has been submitted.

script_file specifies the name of the script file containing commands that comprise the batch job and keywords which control various aspects of the job, e.g., resource limits, output files, mail messages, etc.

You do not specify a job priority in addition to the class. The class will contain the name of the priority at which the job will be charged. See Class info and job scheduling.

There are system limits on the number of jobs a user can have in various states. See NERSC queue policies for Bassi.

Job steps and dependencies

LoadLeveler allows you to define "job steps" in a single batch script. Each step is essentially a different job that can run based on a set of "dependencies" on other job steps contained in the same script. LoadLeveler class limits apply to each step separately.

See the LoadLeveler documentation for full details. Here we describe how to run an executable based on whether or not a previous program completed successfully. This strategy might be used by a code that writes checkpoint files at the end of a run and then uses those files as input for the next run.

You define steps in a LoadLeveler script by naming them with the keyword definition

#@step_name	=	step_name

followed by other definitions for that step and finally a "queue" directive. Additional steps follow.

Some LoadLeveler directives must be included in each step. A typical first job step for an MPI code would look like this:

#@ step_name    = step1
#@ job_type = parallel
#@ node = 8
#@ wall_clock_limit = 24:00:00
#@ class = regular
#@ tasks_per_node = 8 
#@ network.MPI = sn_all,not_shared,us
#@ executable = a.out
#@ queue

The second job step contains a dependency, as specified on the with the "dependency" keyword. Here we are going to assume that if the first job step is successful it returns an error code of 0 (zero). Here is a typical second job step:

#@ step_name    = step2
#@ dependency   = (step1==0)
#@ job_type = parallel
#@ wall_clock_limit = 24:00:00
#@ class = regular
#@ node = 4
#@ tasks_per_node = 8 
#@ network.MPI = sn_all,not_shared,us
#@ executable = b.out
#@ queue

The second job step will only be run if step1 has an exit code of 0. The executable a.out and b.out can themselves be executable shell scripts.

You can also include "if" statements in your LoadLeveler script itself to take different actions for different job steps. Here's an full example script that run the toy "flip" program introduced in the Introduction to MPI tutorial.

#@ class = debug       
#@ shell = /usr/bin/csh
#@ wall_clock_limit = 0:05:00
#@ notification = always 
#@ output = $(host).$(jobid).$(stepid).out
#@ error = $(host).$(jobid).$(stepid).err
#@ environment = COPY_ALL
#
#@ step_name    = step1
#@ job_type = parallel
#@ node = 1
#@ tasks_per_node =  4
#@ network.MPI = sn_all,not_shared,us 
#@ queue
#
#@ step_name    = step2
#@ dependency   = (step1==0)
#@ job_type = parallel
#@ node = 1
#@ tasks_per_node =  2
#@ network.MPI = sn_all,not_shared,us 
#@ queue


if ($LOADL_STEP_NAME == "step1") then
        ./flip
else if ($LOADL_STEP_NAME == "step2") then
        echo "Step 1 returned exit code of 0."
        echo "100500" >flip.in
        ./flip
else
        echo "Error: no value for step name."
endif 
exit

Monitoring jobs

To monitor your job after submission, you can use the llqs command:

%  llqs 
Step Id        JobName      UserName  Class  ST NDS WallClck Submit Time
------------- ------------ -------- ------- -- --- -------- -----------
b0307.1087.0   a240         buffy    regular R   32 00:31:44  3/13 04:30
b0301.529.0    s1.x         willow   regular R   64 00:28:17  3/12 21:45
b0301.578.0    xdnull       xander   debug   R    5 00:05:19  3/14 12:44
b0309.929.0    s01009.ners  spike    regular R  128 03:57:27  3/13 05:17
b0301.530.0    s2.x         willow   regular I   64 04:00:00  3/12 21:48
b0301.532.0    s3.x         willow   regular I   64 04:00:00  3/12 21:50
b0301.533.0    y1.x         willow   regular I   64 04:00:00  3/12 22:17
b0301.534.0    y2.x         willow   regular I   64 04:00:00  3/12 22:17
b0301.535.0    y3.x         willow   regular I   64 04:00:00  3/12 22:17
b0301.537.0    s01001.nerg  spike    regular I  128 02:30:00  3/13 06:10
b0309.930.0    s01009.nerg  spike    regular I  128 02:30:00  3/13 07:17

This information can also be found on the NERSC Batch Queue Status page.

Using phost to determine node IDs

NERSC has written a utility, phost, that can be used to record which tasks ran on which node.

Deleting jobs

The llcancel command is used to delete a job.

% llcancel joblist

The joblist contains the list of Job IDs you want to delete. You can get this name from the first column of the output from either the llqs or the llq command:

bassi%  llq 
Id                  Owner      Submitted   ST PRI Class        Running On 
------------------- ---------- ----------- -- --- ------------ -----------
 b01015.137.0       bones      10/7  12:05 R  50  regular       s00608    
 b01005.84.0        spock      10/7  16:51 R  50  regular       s01816    
 b01015.177.0       jimkirk    10/7  17:01 R  50  interactive   s00713    
 b01015.174.0       sulu       10/7  16:54 R  50  low           s01601    
 b01015.176.0       uhura      10/7  17:00 R  50  debug         s00101    

5 job steps in queue, 0 waiting, 0 pending, 5 running, 0 held

For example, if user spock wanted to delete a job, then the command is:

% llcancel s01005.84.0
llcancel: Cancel command has been sent to the central manager.

The llcancel command works for both running and queued jobs.

Account charging

By default, job usage is charged to your default repository. Use the NERSC Information Management (NIM) system to view your default repo.

You can specify the repo to be charged in your LoadLeveler script. Use this keyword:

#@account_no = repo_name

Your job charge depends on which class you specify. In general, one submits a job to the class which contains the name of the desired NERSC batch charging priority. For example, a premium job is submitted to the class named "premium". Interactive and debug jobs are charged at the same rate as regular class jobs.

See Accounting on the IBM for information about how accounts are charged.

Example batch script

Here is an example batch script.

#@ job_name        = myjob
#@ account_no      = repo_name
#@ output          = myjob.out
#@ error           = myjob.err
#@ job_type        = parallel
#@ environment     = COPY_ALL
#@ notification    = complete
#@ network.MPI     = sn_all,not_shared,us
#@ node_usage      = not_shared
#@ class           = regular
#
#
#@ node            = 16
#@ tasks_per_node  = 8 
#@ wall_clock_limit= 01:00:00
#
#@ bulkxfer        = yes
#
#@ queue

./a.out < input


LBNL Home
Page last modified: Sat, 01 Mar 2008 00:50:31 GMT
Page URL: http://www.nersc.gov/nusers/systems/bassi/running_jobs/batch.php
Web contact: webmaster@nersc.gov
Computing questions: consult@nersc.gov

Privacy and Security Notice
DOE Office of Science