Running Jobs
Full Document |
Running on Bassi: Runtime Configuration and OptionsNOTE: Beginning September 2, 2006 the environment variables settings described below are ignored by the runtime system for jobs submitted through the batch system. They now apply only to jobs launched interactively from the shell command line.
Managing Task and Memory Affinity on SMPsBassi's SMP nodes are organized around components called Multi-chip Modules, MCM's. An MCM contains several processors, I/O buses, and memory. An MCM on Bassi contains one active processor. (A Bassi MCM might be better termed a "DCM," or dual-chip module, of which one is active on Bassi.) While a processor in an MCM can access the I/O bus and memory in another MCM, most scientific applications will see improved performance if the processor, the memory it uses, and the I/O adapter it connects to, are all in the same MCM. Threaded applications, including OpenMP codes, should not request affinity in most cases. If task affinity is requested, for example, all threads spawned from a single task will be bound to a single MCM. Usually these codes would prefer that AIX schedule and migrate the threads as efficiently as it can. The runtime behavior is controlled by keyword settings in batch job scripts and by environment variable settings for interactive parallel jobs. Batch jobs that want memory, task, or I/O affinity must include two lines in their batch scripts, one to request affinity and another to specify affinity options. The most common specification for jobs run at NERSC is expected to be the following: #@ rset = rset_mcm_affinity #@ mcm_affinity_options = mcm_distribute, mcm_mem_pref, mcm_sni_none Memory AffinityBy requesting MCM memory affinity, a processor will preferentially obtain memory from the local MCM. This setting is independent of the task or I/O affinity described below.
In batch jobs a setting of mcm_mem_pref will request that memory be allocated from the local MCM whenever possible, mcm_mem_none specifies no memory affinity, and mcm_mem_req requires that all memory be allocated from the local MCM only. Task AffinityTask affinity settings controls the placement of tasks of a parallel job. By requesting MCM afinity, a task will not be migrated between MCM's during its execution. The tasks are allocated in a round-robin fashion among the MCM's attached to the job. By default, the tasks are allocated to all the MCMs in the node. Most codes will benefit from using MCM task affinity, but this also binds all tasks and spawned threads to a single MCM, thus disabling effective use of OpenMP. OpenMP users need to unset MP_TASK_AFFINITY before running interactive parallel jobs. Likewise, MPI-IO and parallel HDF5 make heavy use of threads and will perform poorly unless task affinity is disabled.
In batch job scripts a setting of mcm_distribute will distribute the tasks as described above. A setting of mcm_acculate will attempt to accumulate all tasks onto a single MCM whenever possible. Communication over HPSThe network switch on Bassi is known as HPS (High Performance Switch), or "Federation." Remote Direct Memory Access (RDMA) is a mechanism which allows large contiguous messages to be transferred while reducing the message transfer overhead. To use RDMA, in interactive parallel jobs set the environment variable MP_USE_BULK_XFER; for batch jobs you must add the keyword #@bulkxfer=yes.
Contiguous messages with data lengths greater than or equal to the value of MP_BULK_MIN_MSG_SIZE will use the bulk transfer path. Messages with data lengths that are smaller than the value you specify for this environment variable, or are noncontiguous, will use packet mode transfer.
The following image shows the point-to-point bandwidth for RDMA vs. non-RDMA MPI communication using MP_BULK_MIN_MSG_SIZE=4096. Click for a larger PDF version of the image.
Improved MPI Latency for Single-Threaded ApplicationsTo avoid lock overheads in a program that is known to be single-threaded (user-created threads), set the environment variable MP_SINGLE_THREAD to yes. The internode MPI latency is reduced from about 5.1 microseconds to 4.5 microseconds if MP_SINGLE_THREAD is set to "yes." MPI-IO and MPI one-sided functions are unavailable if this variable is set to yes.
|
![]() |
Page last modified: Mon, 14 Jan 2008 23:15:32 GMT Page URL: http://www.nersc.gov/nusers/systems/bassi/running_jobs/runtime_config.php Web contact: webmaster@nersc.gov Computing questions: consult@nersc.gov Privacy and Security Notice |
![]() |