NERSC logo National Energy Research Scientific Computing Center
  A DOE Office of Science User Facility
  at Lawrence Berkeley National Laboratory
Back to Running on Bassi

Running Jobs on Bassi

On this page

Related pages

Introduction

NERSC's IBM POWER 5 p575 system, named Bassi, has 111 SMP "compute" nodes with 8 processors per node for a total of 888 processors. Each node has a common pool of 32 GBytes of memory.

The nodes are connected via a high-speed network known as HPS (also called Federation). Each node has a "dual-link" HPS adapter, which attaches to a dual-plane network. Each link can support 2 GB/s of data transfer in each direction.

Bassi supports both interactive and batch processing. Interactive jobs are limited to 4 nodes and 30 minutes of wallclock time. Parallel programs can be run either using distributed memory message passing, shared memory threading, or some combination of the two.

[Top]

MPI Codes

Most programs running on Bassi execute in parallel and use the Message Passing Interface, or MPI, to communicate among separate tasks. The default computing environment is configured to support MPI jobs as transparently as possible. The discussions on these pages assumes you are running an MPI code unless otherwise noted.

[Top]

OpenMP Codes

Interactive jobs that use OpenMP must run with the the environment variable MP_TASK_AFFINITY unset. As of August 2, 2006 (Parallel Environment 3.3.2.4) jobs run through LoadLeveler batch scripts ignore this environment variable. Make sure that you do not use the "rset" keyword to request task affinity in your batch script.

THESE SETTINGS ARE WRONG FOR OPENMP CODES. DO NOT USE FOR OPENMP CODES.
#@ rset = rset_mcm_affinity
#@ mcm_affinity_options = mcm_distribute, mcm_mem_pref, mcm_sni_none
THESE SETTINGS ARE WRONG FOR OPENMP CODES. DO NOT USE FOR OPENMP CODES.

For interactive jobs use the appropriate statement:

unsetenv MP_TASK_AFFINITY               (for csh, tcsh)

unset    MP_TASK_AFFINITY               (for sh, ksh, bash)

The default number of OpenMP threads per task (process) is 8. If you are running a pure OpenMP code, you should run using one task per node. If you are using more than one task per node, you should reduce OMP_NUM_THREADS so that the product of tasks x threads is 8. A combination of tasks and threads significantly in exess of 8 may result in poor performance.

OpenMP code will probably performer better with the following environment variable setting:

export XLSMPTOPTS=SPINS=0;YIELDS=0 (for sh-like shells)

setenv XLSMPTOPTS 'SPINS=0;YIELDS=0' (for csh-like shells)

If you are calling malloc() or ALLOCATE in OpenMP threads, you may want to set this environment variable:

export	MALLOCMULTIHEAP=true (sh-like shells)

setenv MALLOCMULTIHEAP true (csh-like shells)

Otherwise, memory allocations will be serialized on a node.

[Top]

MPI-I/O Codes

Jobs that use MPI-I/O or explict MPI threading routines must run with the the environment variable MP_SINGLE_THREAD unset. As of August 10, 2006 MP_SINGLE_THREAD is no longer set in the default Bassi environment.

[Top]

Running on Bassi

Job Launch Overview: POE and LoadLeveler

Bassi uses two software packages to run parallel programs: the Parallel Operating Environment (POE) executes parallel programs and LoadLeveler schedules jobs. Users can interact with this IBM software in a number of different ways and at a number of different levels. This can be very confusing, so a brief discussion of POE and LoadLeveler follow.

Parallel Operating Environment (POE)

POE is used to run parallel programs. This product augments the basic AIX operating system with software needed to run parallel programs. The command poe (all lower case) executes parallel programs. However, the poe command is not explicitly required to run a parallel program, depending on which options were used to compile an executable. POE recognizes environment variables and poe command line flags that specify how a parallel program should run. Please see "Operation and Use" in the IBM Manuals for more on POE.

LoadLeveler

LoadLeveler is used in addition to POE in order to run parallel jobs. Loadleveler is a "job management system" that is used to schedule all parallel jobs, regardless of whether the jobs are batch or interactive. More information on Loadleveler can be found on the IBM Batch page.

When running in job in batch mode, a user submits to LoadLeveler a script that contains commands and LoadLeveler keywords. The value of the LoadLeveler keywords determines how the code executes (e.g. number of nodes used, number of tasks, etc.)

You control how your parallel job executes by specifying

  1. LoadLeveler keyword values (batch mode), and/or
  2. values passed to POE on the command line, and/or
  3. environment variables

In batch mode you should completely specify how your job should run using LoadLeveler keywords exclusively, if possible. NERSC recommends that you be as explicit as possible in your specifications in order to avoid confusion.

In interactive mode poe command-line options override environment variable settings.

Avoid confusion! POE vs. LoadLeveler keywords and options

It is important to make the distinction between LoadLeveler keywords and poe command line options. They do not have the same names in general. For example, node is a LoadLeveler keyword, but is not a poe command-line option. The poe option is called nodes and is not a LoadLeveler keyword. total_tasks is a LoadLeveler keyword, but not a poe command-line switch. Therefore poe will completely ignore -total_tasks on the command line without warning or comment. For example, the following will run 4 tasks, rather than 8 tasks as might be expected:

 % poe ./a.out -nodes 4 -total_tasks 8 (does not work as expected!)

Because the default value of the MP_TASKS_PER_NODE POE environment variable is 1, this command line will run 1 task on each of 4 nodes, and ignore the total_tasks specification on the command line because it is not a valid poe command line option. See Interactive jobs.

Managing Memory Usage

Each Bassi node has 32 GB of total memory. Some memory is used by the operating system (AIX, HPS switch, etc); the remainder - about 26.4 GB - is available for user applications. If your code tries to use more than the available physical memory it will start writing memory pages to disk. This will certainly cause poor code performance and excessive paging will lead to unpredictable behavior: code crashes, hangs, and possibly node crashes. Any job that tries to use more than 27032 MB of memory on a node will be killed by the system software. Please plan your runs to fit into the available physical memory on each node.

Bassi's POWER 5 nodes support a "large page" memory configuration. High Performance scientific applications generally benefit from using "large page." By default, parallel applications built on Bassi are enabled to use large pages. It is possible to build binaries that do not use large pages, however. See Memory Considerations for more details.

The Bassi compute nodes are configured to use 20 GB of large-page memory. The size of the large-page pool is set at boot time and cannot be otherwise changed. If a large-page-enabled application requests more memory than is available in large pages, additional memory will be allocated and used from the small memory pool (perhaps at lower performance). However, a code that runs only in small pages cannot access the large-page memory pool.

Please use the following table to help make effective use of the memory on the Bassi compute nodes. For running interactive serical utilities on the login nodes, see Running Interactive Jobs.

Bassi Compute Nodes (8 CPUs per node)
Total Physical Memory: 32 GB
Large-Page Memory Pool Size: 20 GB
Large-Page Memory Available to Large-Page Enabled User Applications: 18.922 GB
Memory in Small Pages: 12 GB
Small-Page Memory Available to User Applications: ~7-8 GB
Total Memory Available to Large-Page Enabled User Applications: 26.4 GB
Total Memory Available to Non-Large-Page Enabled User Applications: ~8 GB max

Job Memory Limits

The maximum memory your code can use on a single node is 26.4 GB (27032 MB). If you want to be able to access the maximum amount of memory, consult the following table to see what you need to specify in your LoadLeveler batch script:

Tasks per NodeLL directive
8No specification needed.
7#@resources = ConsumableMemory(3861 mb)
6#@resources = ConsumableMemory(4505 mb)
5#@resources = ConsumableMemory(5406 mb)
4#@resources = ConsumableMemory(6758 mb)
3#@resources = ConsumableMemory(9010 mb)
2#@resources = ConsumableMemory(13516 mb)
1#@resources = ConsumableMemory(27032 mb)

Memory limits are being enforced via LoadLeveler. Jobs that request more than 27032 MB (26.4 GB) of memory per node will not start. Default values are in place so that job scripts that request 8 tasks per node can access the full 26.4 GB (27032 MB) of memory. Jobs will be killed if they try to access more than that amount of memory per node.

Jobs that run with fewer than 8 tasks per node need to correctly specify their memory usage in batch scripts. The maximum memory a job can use on a single node is calculated by:

(tasks per node requested)  * (ConsumableMemory requested per task)

The default value of ConsumableMemory per task is set at 3379 mb. If you run fewer than 8 tasks per node and want to access the full amount of memory permitted (27032 mb), you need add a line to your LoadLeveler batch script:

#@resources = ConsumableMemory(N mb)

where N is the amount of memory per task that your job will use. NOTE: Each task is not limited to N mb; rather, all tasks on a given node are limited to using an aggregate amount of memory equal to: (number of tasks on that node)*(N mb) with a maximum limit of 27032 mb. N must be an integer. Here "tasks" refers to the number of independent instances of your executable that are started (initiated) on a node.

For example, to run 2 tasks per node and access the maximum amount of memory, set this in your batch script:

#@tasks_per_node = 2
#@resources = ConsumableMemory(13516 mb)

Running on Bassi

Communication protocols

Most parallel jobs that run on Bassi do so through an explicit message passing interfaces. The choice of API and protocols must be specified at run time, by setting environment variables for interactive jobs and through a network LoadLeveler keyword for batch jobs (see below and elsewhere on these pages.)

Message Passing APIs

There are two explicit message passing interfaces, or APIs, available. Calls to these libraries are made in source code and a run-time library containing these functions is loaded when the job executes.

MPI

Most codes use MPI, the popular Message Passing Interface. The default programming and interactive run-time environment is configured to support MPI. See the IBM Documentation for details of the IBM implementation of MPI.

Most MPI batch job scripts need to include this line:

#@network.MPI = sn_all,not_shared,us
LAPI

The low-level IBM LAPI interface is also available. See the IBM Documentation for details.

Most LAPI batch job scripts need to include this line:

#@network.LAPI = sn_all,not_shared,us

Interactive jobs that use LAPI must set an environment variable thusly:

bassi% export MP_MSG_API=LAPI   (sh-like shells)
bassi% setenv MP_MSG_API LAPI   (csh-like shells)

Communications Protocols

There are two separate implementations of the communications library that actually ships data packets across the network:

Internet Protocol (IP)

This implementation uses the standard IP protocol that commonly links computers over ethernet networks. IP offers less performance that User Space (see below), but some specialized application may require IP.

Batch job scripts that require IP need to include one of these lines:

#@network.MPI = sn_all,not_shared,ip
#@network.LAPI = sn_all,not_shared,ip

Interactive jobs that use IP must set an environment variable thusly:

bassi% export MP_EUILIB=ip (sh-like shells)
bassi% setenv MP_EUILIB ip (csh-like shells)
User Space (US) Communication Subsystem

This communication subsystem is designed to take advantage of Bassi's high performance switch (HPS).

You will use the "US" protocol in almost all cases. The library is dynamically linked to your code at runtime. You do not have to modify your source code, but you must specify a protocol when you run your program.

For MPI batch jobs using US, the follow line must appear in the batch script:

#@network.MPI = sn_all,not_shared,us

Running on Bassi: Runtime Configuration and Options

NOTE: Beginning September 2, 2006 the environment variables settings described below are ignored by the runtime system for jobs submitted through the batch system. They now apply only to jobs launched interactively from the shell command line.

Managing Task and Memory Affinity on SMPs

Bassi's SMP nodes are organized around components called Multi-chip Modules, MCM's. An MCM contains several processors, I/O buses, and memory. An MCM on Bassi contains one active processor. (A Bassi MCM might be better termed a "DCM," or dual-chip module, of which one is active on Bassi.) While a processor in an MCM can access the I/O bus and memory in another MCM, most scientific applications will see improved performance if the processor, the memory it uses, and the I/O adapter it connects to, are all in the same MCM.

Threaded applications, including OpenMP codes, should not request affinity in most cases. If task affinity is requested, for example, all threads spawned from a single task will be bound to a single MCM. Usually these codes would prefer that AIX schedule and migrate the threads as efficiently as it can.

The runtime behavior is controlled by keyword settings in batch job scripts and by environment variable settings for interactive parallel jobs.

Batch jobs that want memory, task, or I/O affinity must include two lines in their batch scripts, one to request affinity and another to specify affinity options. The most common specification for jobs run at NERSC is expected to be the following:

#@ rset = rset_mcm_affinity
#@ mcm_affinity_options = mcm_distribute, mcm_mem_pref, mcm_sni_none

Memory Affinity

By requesting MCM memory affinity, a processor will preferentially obtain memory from the local MCM. This setting is independent of the task or I/O affinity described below.

TypeSpecificationValuesDefault
Environment variable: MEMORY_AFFINITY MCM | -1 (none) MCM*
LoadLeveler directive: #@mcm_affinity_options mcm_mem_none | mcm_mem_pref | mcm_mem_req mcm_mem_none
*This is set in the standard NERSC environment-defined home dot-files.

In batch jobs a setting of mcm_mem_pref will request that memory be allocated from the local MCM whenever possible, mcm_mem_none specifies no memory affinity, and mcm_mem_req requires that all memory be allocated from the local MCM only.

Task Affinity

Task affinity settings controls the placement of tasks of a parallel job. By requesting MCM afinity, a task will not be migrated between MCM's during its execution. The tasks are allocated in a round-robin fashion among the MCM's attached to the job. By default, the tasks are allocated to all the MCMs in the node.

Most codes will benefit from using MCM task affinity, but this also binds all tasks and spawned threads to a single MCM, thus disabling effective use of OpenMP. OpenMP users need to unset MP_TASK_AFFINITY before running interactive parallel jobs. Likewise, MPI-IO and parallel HDF5 make heavy use of threads and will perform poorly unless task affinity is disabled.

TypeSpecificationValuesDefault
Environment variable: MP_TASK_AFFINITY MCM | -1 (none) MCM*
LoadLeveler directive: #@ mcm_affinity_options mcm_distribute | mcm_accumulate mcm_accumulate
*This is set in the standard NERSC environment-defined home dot-files.

In batch job scripts a setting of mcm_distribute will distribute the tasks as described above. A setting of mcm_acculate will attempt to accumulate all tasks onto a single MCM whenever possible.

Communication over HPS

The network switch on Bassi is known as HPS (High Performance Switch), or "Federation."

Remote Direct Memory Access (RDMA) is a mechanism which allows large contiguous messages to be transferred while reducing the message transfer overhead. To use RDMA, in interactive parallel jobs set the environment variable MP_USE_BULK_XFER; for batch jobs you must add the keyword #@bulkxfer=yes.

TypeSpecificationValuesDefault
Environment variable: MP_USE_BULK_XFER yes | no yes*
LoadLeveler directive: #@bulkxfer yes | no no
*This is set in the standard NERSC environment-defined home dot-files.

Contiguous messages with data lengths greater than or equal to the value of MP_BULK_MIN_MSG_SIZE will use the bulk transfer path. Messages with data lengths that are smaller than the value you specify for this environment variable, or are noncontiguous, will use packet mode transfer.

TypeSpecificationValuesDefault
Environment variable: MP_BULK_MIN_MSG_SIZE 4K-2048M 150K
LoadLeveler directive: N/A N/A N/A

The following image shows the point-to-point bandwidth for RDMA vs. non-RDMA MPI communication using MP_BULK_MIN_MSG_SIZE=4096. Click for a larger PDF version of the image.

HPS RDMA vs. packet transfers

Improved MPI Latency for Single-Threaded Applications

To avoid lock overheads in a program that is known to be single-threaded (user-created threads), set the environment variable MP_SINGLE_THREAD to yes. The internode MPI latency is reduced from about 5.1 microseconds to 4.5 microseconds if MP_SINGLE_THREAD is set to "yes."

MPI-IO and MPI one-sided functions are unavailable if this variable is set to yes.

TypeSpecificationValuesDefault
Environment variable: MP_SINGLE_THREAD yes | no no
LoadLeveler directive: N/A N/A N/A
*This was set to "yes" in the default NERSC home dot-files prior to August 2, 2006.

Running on Bassi

Interactive jobs

Serial Jobs

You run interactive serial programs on the node you are currently logged onto by typing the executable file's name, e.g.:

bassi%  ./a.out

Processes run on the login nodes are subject to limits on certain resources such as memory. These limits are in place to make sure that the login nodes are available and responsive to all users logged in. If your program needs greater resources than provided by the interactive limits, you should run it in batch. There are both soft (default) and hard (maximum) limits. The limits are:

ResourceSoft LimitHard Limit
Memory (data)128 MB2 GB
CPU Time3600 secs3600 secs
MAX Processes512512

You query and change these limits with the limit command for csh and tcsh users and with the ulimit command for ksh, sh, and bash users.

Query the limits:

% limit 		(csh, tcsh: shows limits in effect)
% limit -h		(csh, tcsh: shows maximum values) 
% ulimit -a 		(bash, sh, ksh: shows limits in effect)
% limit -aH		(bash,sh, ksh: shows maximum values) 
Changing the limits (example for memory that can be allocated by a program):
% limit datasize 2097151 	(csh, tcsh)
% ulimit -d 2097152		(bash, sh, ksh) 

You cannot run interactive serial jobs on any of the "compute" nodes.

Parallel Jobs

Interactive parallel programs are executed by the Parallel Operating Environment (POE) software. POE can run the job only on the "compute" nodes. Parallel jobs do not run on the login nodes that are used for interactive terminal sessions and interactive serial jobs.

Interactive jobs are run through the LoadLeveler job scheduling software in the interactive class. Therefore interactive jobs are subject to class resource limits and policies associated with the interactive class.

To run a parallel job interactively:

  1. compile your program with one of the "parallel" compiler invocations, e.g. mpxlf
  2. use the poe command to invoke the Parallel Operating Environment.

NOTE: Use of the poe command is optional for programs compiled with one of the "parallel" compiler invocations. Those programs will execute just as if you had typed poe at the command line. In fact, there is no way to run an executable compiled with an "mp" compiler as a "serial" program.

Options can be passed to poe on the command line or by setting environment variables. Command line options override the environment variable settings.

Interactive jobs are charged to your default repository, unless you specify otherwise. See IBM Accounts & Charging.

Required POE flags

You should specify two of the following:

POE command line optionPOE Environment variable DefaultDescription
-procsMP_PROCS1 The number of program tasks.
-nodesMP_NODES1* Specifies the number of physical nodes on which to run the parallel tasks.
-tasks_per_nodeMP_TASKS_PER_NODE1* Specifies the number of tasks to be run on each of the physical nodes.
* Technically, these have undefined default values, but LoadLeveler's other defaults effectively set these default values to 1.

If you choose not to run with tasks per node set to the number of processors on the node, LoadLeveler will allocate tasks to the nodes in a balanced fashion.

There are a number of other available flags available. Two of the most useful flags are -retry N -retrycount M. These options specify an attempt to launch your parallel job should be made M times, with wait of N seconds between launch attempts. This is a good way of running an interactive job when the machine is busy.

Note that these are not the LoadLeveler keywords that are used in batch scripts, even though some of the names may be similar or even identical. Command-line arguments that poe does not recognize are passed to your program as arguments without warning or comment by poe.

See Operation and Use, Volume 1 Using the Parallel Operating Environment for POE command line flags and environment variables.

For example, to run a parallel job with N total tasks on M "compute" nodes, with 10 attempts to run the job, waiting 30 seconds between attempts, you would type:

bassi% poe ./a.out -procs N -nodes M -retry 30 -retrycount 10

Running on Bassi

Bassi Batch Jobs

This document describes the IBM SP batch queuing system called Loadleveler as implemented at NERSC. It explains how to submit batch jobs, how to monitor the progress of jobs after they are submitted, and how to delete unwanted jobs. Example scripts are included which can be used as templates for customization.


Overview

All production runs on the SP should be done through the batch system. IBM's batch system software is called LoadLeveler. Since interactive resources are scarce on this system, most development and debugging runs should also be done in batch. This page describes how to use Loadleveler.

To submit a job to LoadLeveler create a batch script which contains the job's requirements, then submit the job to the batch system. You can then monitor the job as it progresses through the queue and can delete unwanted jobs. Jobs may be submitted at various NERSC charge priority values.

Here is a list of terms commonly used in reference to Loadleveler.

node

A Bassi node contains 8 CPUs. Your batch job has exclusive use of each node it uses and you are charged for the use of all CPUs while your job is resident on a node.

Processors within a node share a common pool of memory.

keyword

A Loadleveler batch script consists of UNIX shell commands and LoadLeveler keywords. See Loadleveler keywords commonly used on Bassi. A complete list of all Loadleveler keywords is available in IBM's documentation, but note that not all available keywords are relevant or useful on Bassi.

job ID

Loadleveler assigns an ID to every submitted job. The term "Job ID" will refer to the identifier used to monitor or delete a job.

class

This is similar to what is commonly called a queue. Each class has associated limits and properties, and a user selects a "submit" class for the job. There are also "destination" classes in which the job ultimately runs. In general, users submit to a submit class and the job is automatically routed to a destination class. Users cannot submit jobs to a destination class directly.

Class Info and Policies

Classes and Job Scheduling

All Loadleveler jobs must be submitted to a valid submit class. If the class doesn't exist, no error message will be issued. The job will be submitted, but will sit in the queue indefinitely. If this happens to you, you must delete the job using the llcancel command and resubmit to an available class.

NERSC users specify one of the following submit classes to queue jobs. Upon submission the job is routed to the appropriate LoadLeveler class according to the following criteria. (Users can not directly access the LoadLeveler classes.)

Submit Class1 Job Type Destination Class2 Nodes Available Processors Max Wallclock Relative Priority3 MPP Charge (units of nodes*wall hrs) Availability
interactive parallel interactive 1-4 1-32 30 mins 1 48 Everyone
debug4 parallel debug 1-8 1-64 30 mins 2 48 Everyone
premium5 parallel premium 1-48 1-384 12 hrs 4 96 Everyone
regular parallel reg_1 1-15 1-120 36 hrs 5 48 Everyone
parallel reg_16 16-31 121-248 18 hrs 5 48 Everyone
parallel reg_32 32-48 249-384 18 hrs 5 48 Everyone
low parallel low 1-32 1-256 12 hrs 6 24 Everyone
special6 parallel special 1-64 1-512 48 hrs 3 48 By special arrangement
full_config6 parallel full_config 1-ALL 1-ALL 48 hrs 3 48 By special arrangement

Notes

1 - This is the class name to be used in LoadLeveler scripts.
2 - Users cannot submit scripts directly to a destination class, but this is the class name that will appear when using job monitoring utilities.
3 - The priorites listed in the table are relative. NERSC assigns priorities in terms of "equivalent days waiting in the queue". In addition to the relative priority given to jobs depending on their LoadLeveler class, certain projects with high priority within DOE receive a "scheduling boost". These tend to be INCITE projects.
4 - 4 nodes are reserved exclusively for interactive and debug use weekdays from 5:00 to 18:00 Pacific Time.
5 - The intent of the premium queue is to allow for faster turnaround before conferences and urgent project deadlines. It should be used with care, and in most cases a project should not spend more than 10 percent of its time in premium.
6 - Available by special arrangement only.

See also Queue Policies, below, for information on run limits.


You can use the llclass command on the system to obtain information about the LoadLeveler classes. Detailed information about a single LoadLeveler class can be found using llclass -l classname.

If you request more wall clock time than allowed by the class (as indicated by the Max Wallclock column in the table above), your job will be submitted with the wall clock time adjusted to the maximum allowed. If you omit requested time, then a default of 30 minutes will be used.

Your job will be charged and scheduled according to the priority listed in the class name. Both interactive and debug are charged at the regular rate. See MPP Accounting.

The classes are configured to give the best service to premium and regular jobs. Premium jobs are charged at twice the rate of regular jobs, but are scheduled at a higher priority.

Loadleveler uses a scheduling technique called "backfilling". This method starts smaller, shorter jobs if they will not affect the start time for the job that is scheduled to begin next. This scheduling technique is advantageous from both a user and system perspective. It allows a faster turn around for shorter jobs, and it maximizes system usage.

NERSC Queue Policies for Bassi

Common commands

Here's a list of commonly used Loadleveler commands with brief descriptions. Detailed information on these commands can be obtained by typing:

% man commandname
Command Description
llsubmit Submits a script file to Loadleveler for execution
llqs Lists all Loadleveler jobs, including requested wall clock time and number of nodes.
llq Same as llqs but with less information.
llcancel Deletes a job from the Loadleveler queue
llclass Displays information about the configured classes
llstatus Displays status of individual nodes on the SP

The llqs command is specific to NERSC. There are a number of options available, please see the man page.

The llq command has been usurped by llqs but it is still useful for details about a particular job. See Monitoring jobs for more information.

Creating batch jobs

To submit a batch job to Loadleveler you need to create a script file containing the commands that make up the batch job. This file must contain the following types of commands:

Here is an example of a script file named myjob that runs a program named a.out when submitted from Bassi:

#@ job_name        = myjob
#@ account_no      = repo_name
#@ output          = myjob.out
#@ error           = myjob.err
#@ job_type        = parallel
#@ environment     = COPY_ALL
#@ notification    = complete
#@ network.MPI     = sn_all,not_shared,us
#@ node_usage      = not_shared
#@ class           = regular
#@ bulkxfer	   = yes
#
#
#@ tasks_per_node  = 8
#@ node	           = 1
#@ wall_clock_limit= 01:00:00
#
#@ queue

./a.out 

A program to check a LoadLeveler script for common simple errors is available on Bassi. This does not provide a comphrehensive error check (far from it, in fact!)

% ll_check_script script_name

LoadLeveler keywords commonly used at NERSC

For most jobs run on Bassi, NERSC suggests that the following keywords be specified. Some of these keywords may have default values, but NERSC considers them to be undefined and you should never rely on their existence or persistence.

There are many, many LoadLeveler keywords. Many do not have any meaning in the NERSC environment. Many others can contain site-specific values. NERSC recommends that scripts contain the minimum set of keywords needed to run your job in the way you want. If you are trying to use keywords that you do not understand, it is best to delete them. A common error is to specify a keyword and value that do not apply at NERSC (or are wrong for the NERSC environment); such jobs will be accepted into the queue, but will never run.

A complete description of Loadleveler keywords can be found in IBM's LoadLeveler manual.

Required LoadLeveler Keywords
#@ job_type = parallel Specifies that your job is parallel.
#@ node = N Specifies that your job will use N nodes.
#@ network.MPI =sn_all,not_shared,US Specifies communication protocol and adapters. csss is an alias for sn_all. Please see the LL documentation if you wish to use LAPI.
#@ class = class Specifies that your job should be run in the submit class class.
#@ tasks_per_node = N Specifies the number of tasks that will run on each node. To run a different number of tasks on different nodes, see Selecting the number of nodes and tasks.
#@ wall_clock_limit = HH:MM:SS Specifies that your job requires HH:MM:SS of wall clock time.
#@ queue Places your job in the queue. It must be the last keyword specified. Any keywords placed after this in the script are ignored by the current job step.

Recommended Loadleveler Keywords
#@ node_usage = not_shared Specifies that your job should not share nodes with other jobs. This is the default and cannot be overridden.
#@ job_name = jobname Specifies the name of your job.
#@ account_no = repo_name Specifies the repository that will be charged. If omitted, your default repository will be used.
#@ notification = complete Specifies that email should be sent upon completion of the job. Other options include always, error, start, and never. The default is complete.
#@ shell = /usr/bin/shell Specifies the shell that parses the script. If not specified, your login shell will be used.
#@ bulkxfer = yes Specifies the the job use "bulk transfer" for message passing. Most codes will benefit from using "bulk transfer".
#@resources = ConsumableMemory(N mb) Specifies the memory available to your job per node. See Memory usage considerations. The default value of N is 3395 mb per task.

f
Other useful LoadLeveler keywords
#@ output = myjob.out Specifies the name and location of STDOUT. If not given, the default is /dev/null. The default directory is the submitting directory.
#@ error = myjob.err specifies the name and location of STDERR. If not given, the default is /dev/null. The default directory is the submitting directory.
#@ environment = COPY_ALL Specifies that all environment variables from your shell should be used. You can also list individual variables which should be separated with semicolons.

If you are using a script that was used on another (non-SMP) IBM machine, beware of the following keywords:

These keywords specify the maximum/minimum number of nodes (not processors!) requested for a parallel job, regardless of the number of processors contained in the node. The node keyword should be used instead.

Selecting the number of tasks and nodes

You must specify the number of tasks and number of nodes you wish to use. Remember that Bassi has a maximum of 8 tasks per node; if you ask for more than 8 tasks per node (either explicitly or implicitly) your job will never run.

You usually specify numbers of tasks and nodes by including two of the following three LoadLeveler keywords in your batch script:

For example, to run on 4 nodes and pack the nodes with 8 MPI tasks per node, in your LoadLeveler script use:

#@ node = 4
#@ tasks_per_node = 8

To use a number of tasks that is not a multiple of 8, do not set the tasks_per_node keyword. For 29 tasks on 6 nodes, you would use:

#@ node = 6
#@ total_tasks = 29

LoadLeveler will allocate tasks on the nodes in a round-robin fashion.

It is also possible to select which tasks run together on a node. However you must be very careful to completely specify all tasks and the correct number of nodes. Do not use any of the three keywords listed above. Instead, to run 8 tasks on 3 nodes with task 0 on one node, odd tasks on another node and the remaining even tasks on the third node, use:

#@task_geometry = {(0)(1,3,5,7)(2,4,6)}

See LoadLeveler documentation on "Task Assignment Considerations".

Submitting Jobs

Once you have created a script file, it can be submitted to Loadleveler for execution using the llsubmit command:

% llsubmit script_file
llsubmit: The job "s01007.nersc.gov.101" has been submitted.

script_file specifies the name of the script file containing commands that comprise the batch job and keywords which control various aspects of the job, e.g., resource limits, output files, mail messages, etc.

You do not specify a job priority in addition to the class. The class will contain the name of the priority at which the job will be charged. See Class info and job scheduling.

There are system limits on the number of jobs a user can have in various states. See NERSC queue policies for Bassi.

Job steps and dependencies

LoadLeveler allows you to define "job steps" in a single batch script. Each step is essentially a different job that can run based on a set of "dependencies" on other job steps contained in the same script. LoadLeveler class limits apply to each step separately.

See the LoadLeveler documentation for full details. Here we describe how to run an executable based on whether or not a previous program completed successfully. This strategy might be used by a code that writes checkpoint files at the end of a run and then uses those files as input for the next run.

You define steps in a LoadLeveler script by naming them with the keyword definition

#@step_name	=	step_name

followed by other definitions for that step and finally a "queue" directive. Additional steps follow.

Some LoadLeveler directives must be included in each step. A typical first job step for an MPI code would look like this:

#@ step_name    = step1
#@ job_type = parallel
#@ node = 8
#@ wall_clock_limit = 24:00:00
#@ class = regular
#@ tasks_per_node = 8 
#@ network.MPI = sn_all,not_shared,us
#@ executable = a.out
#@ queue

The second job step contains a dependency, as specified on the with the "dependency" keyword. Here we are going to assume that if the first job step is successful it returns an error code of 0 (zero). Here is a typical second job step:

#@ step_name    = step2
#@ dependency   = (step1==0)
#@ job_type = parallel
#@ wall_clock_limit = 24:00:00
#@ class = regular
#@ node = 4
#@ tasks_per_node = 8 
#@ network.MPI = sn_all,not_shared,us
#@ executable = b.out
#@ queue

The second job step will only be run if step1 has an exit code of 0. The executable a.out and b.out can themselves be executable shell scripts.

You can also include "if" statements in your LoadLeveler script itself to take different actions for different job steps. Here's an full example script that run the toy "flip" program introduced in the Introduction to MPI tutorial.

#@ class = debug       
#@ shell = /usr/bin/csh
#@ wall_clock_limit = 0:05:00
#@ notification = always 
#@ output = $(host).$(jobid).$(stepid).out
#@ error = $(host).$(jobid).$(stepid).err
#@ environment = COPY_ALL
#
#@ step_name    = step1
#@ job_type = parallel
#@ node = 1
#@ tasks_per_node =  4
#@ network.MPI = sn_all,not_shared,us 
#@ queue
#
#@ step_name    = step2
#@ dependency   = (step1==0)
#@ job_type = parallel
#@ node = 1
#@ tasks_per_node =  2
#@ network.MPI = sn_all,not_shared,us 
#@ queue


if ($LOADL_STEP_NAME == "step1") then
        ./flip
else if ($LOADL_STEP_NAME == "step2") then
        echo "Step 1 returned exit code of 0."
        echo "100500" >flip.in
        ./flip
else
        echo "Error: no value for step name."
endif 
exit

Monitoring jobs

To monitor your job after submission, you can use the llqs command:

%  llqs 
Step Id        JobName      UserName  Class  ST NDS WallClck Submit Time
------------- ------------ -------- ------- -- --- -------- -----------
b0307.1087.0   a240         buffy    regular R   32 00:31:44  3/13 04:30
b0301.529.0    s1.x         willow   regular R   64 00:28:17  3/12 21:45
b0301.578.0    xdnull       xander   debug   R    5 00:05:19  3/14 12:44
b0309.929.0    s01009.ners  spike    regular R  128 03:57:27  3/13 05:17
b0301.530.0    s2.x         willow   regular I   64 04:00:00  3/12 21:48
b0301.532.0    s3.x         willow   regular I   64 04:00:00  3/12 21:50
b0301.533.0    y1.x         willow   regular I   64 04:00:00  3/12 22:17
b0301.534.0    y2.x         willow   regular I   64 04:00:00  3/12 22:17
b0301.535.0    y3.x         willow   regular I   64 04:00:00  3/12 22:17
b0301.537.0    s01001.nerg  spike    regular I  128 02:30:00  3/13 06:10
b0309.930.0    s01009.nerg  spike    regular I  128 02:30:00  3/13 07:17

This information can also be found on the NERSC Batch Queue Status page.

Using phost to determine node IDs

NERSC has written a utility, phost, that can be used to record which tasks ran on which node.

Deleting jobs

The llcancel command is used to delete a job.

% llcancel joblist

The joblist contains the list of Job IDs you want to delete. You can get this name from the first column of the output from either the llqs or the llq command:

bassi%  llq 
Id                  Owner      Submitted   ST PRI Class        Running On 
------------------- ---------- ----------- -- --- ------------ -----------
 b01015.137.0       bones      10/7  12:05 R  50  regular       s00608    
 b01005.84.0        spock      10/7  16:51 R  50  regular       s01816    
 b01015.177.0       jimkirk    10/7  17:01 R  50  interactive   s00713    
 b01015.174.0       sulu       10/7  16:54 R  50  low           s01601    
 b01015.176.0       uhura      10/7  17:00 R  50  debug         s00101    

5 job steps in queue, 0 waiting, 0 pending, 5 running, 0 held

For example, if user spock wanted to delete a job, then the command is:

% llcancel s01005.84.0
llcancel: Cancel command has been sent to the central manager.

The llcancel command works for both running and queued jobs.

Account charging

By default, job usage is charged to your default repository. Use the NERSC Information Management (NIM) system to view your default repo.

You can specify the repo to be charged in your LoadLeveler script. Use this keyword:

#@account_no = repo_name

Your job charge depends on which class you specify. In general, one submits a job to the class which contains the name of the desired NERSC batch charging priority. For example, a premium job is submitted to the class named "premium". Interactive and debug jobs are charged at the same rate as regular class jobs.

See Accounting on the IBM for information about how accounts are charged.

Example batch script

Here is an example batch script.

#@ job_name        = myjob
#@ account_no      = repo_name
#@ output          = myjob.out
#@ error           = myjob.err
#@ job_type        = parallel
#@ environment     = COPY_ALL
#@ notification    = complete
#@ network.MPI     = sn_all,not_shared,us
#@ node_usage      = not_shared
#@ class           = regular
#
#
#@ node            = 16
#@ tasks_per_node  = 8 
#@ wall_clock_limit= 01:00:00
#
#@ bulkxfer        = yes
#
#@ queue

./a.out < input

Class Info and Policies

Classes and Job Scheduling

All Loadleveler jobs must be submitted to a valid submit class. If the class doesn't exist, no error message will be issued. The job will be submitted, but will sit in the queue indefinitely. If this happens to you, you must delete the job using the llcancel command and resubmit to an available class.

NERSC users specify one of the following submit classes to queue jobs. Upon submission the job is routed to the appropriate LoadLeveler class according to the following criteria. (Users can not directly access the LoadLeveler classes.)

Submit Class1 Job Type Destination Class2 Nodes Available Processors Max Wallclock Relative Priority3 MPP Charge (units of nodes*wall hrs) Availability
interactive parallel interactive 1-4 1-32 30 mins 1 48 Everyone
debug4 parallel debug 1-8 1-64 30 mins 2 48 Everyone
premium5 parallel premium 1-48 1-384 12 hrs 4 96 Everyone
regular parallel reg_1 1-15 1-120 36 hrs 5 48 Everyone
parallel reg_16 16-31 121-248 18 hrs 5 48 Everyone
parallel reg_32 32-48 249-384 18 hrs 5 48 Everyone
low parallel low 1-32 1-256 12 hrs 6 24 Everyone
special6 parallel special 1-64 1-512 48 hrs 3 48 By special arrangement
full_config6 parallel full_config 1-ALL 1-ALL 48 hrs 3 48 By special arrangement

Notes

1 - This is the class name to be used in LoadLeveler scripts.
2 - Users cannot submit scripts directly to a destination class, but this is the class name that will appear when using job monitoring utilities.
3 - The priorites listed in the table are relative. NERSC assigns priorities in terms of "equivalent days waiting in the queue". In addition to the relative priority given to jobs depending on their LoadLeveler class, certain projects with high priority within DOE receive a "scheduling boost". These tend to be INCITE projects.
4 - 4 nodes are reserved exclusively for interactive and debug use weekdays from 5:00 to 18:00 Pacific Time.
5 - The intent of the premium queue is to allow for faster turnaround before conferences and urgent project deadlines. It should be used with care, and in most cases a project should not spend more than 10 percent of its time in premium.
6 - Available by special arrangement only.

See also Queue Policies, below, for information on run limits.


You can use the llclass command on the system to obtain information about the LoadLeveler classes. Detailed information about a single LoadLeveler class can be found using llclass -l classname.

If you request more wall clock time than allowed by the class (as indicated by the Max Wallclock column in the table above), your job will be submitted with the wall clock time adjusted to the maximum allowed. If you omit requested time, then a default of 30 minutes will be used.

Your job will be charged and scheduled according to the priority listed in the class name. Both interactive and debug are charged at the regular rate. See MPP Accounting.

The classes are configured to give the best service to premium and regular jobs. Premium jobs are charged at twice the rate of regular jobs, but are scheduled at a higher priority.

Loadleveler uses a scheduling technique called "backfilling". This method starts smaller, shorter jobs if they will not affect the start time for the job that is scheduled to begin next. This scheduling technique is advantageous from both a user and system perspective. It allows a faster turn around for shorter jobs, and it maximizes system usage.

NERSC Queue Policies for Bassi

Running on Bassi

Monitoring your job

NERSC has developed a utility for users to get information about queued and running jobs called llqs. In addition to the information provided by the native Loadleveler llq command, it displays the number of nodes and wallclock time. The -C option shows the repository to which the job is being charged. The -u username option shows jobs of just one user.

Bassi's queues are displayed on the web, updated every 10 minutes. A completed jobs list is updated daily at midnight.

Other utilities are provided by IBM. See Monitoring Batch Jobs


LBNL Home
Page last modified: Thu, 17 Jan 2008 23:41:20 GMT
Page URL: http://www.nersc.gov/nusers/resources/bassi/running_jobs/print.php
Web contact: webmaster@nersc.gov
Computing questions: consult@nersc.gov

Privacy and Security Notice
DOE Office of Science