NERSCPowering Scientific Discovery Since 1974

Runtime Tuning Options

 MPI Task Distribution on Nodes

The distribution of MPI tasks on the nodes can be written to the standard output file by setting environment variable MPICH_RANK_REORDER_DISPLAY to 1. Users can control the distribution of MPI tasks on the nodes using the environment variable MPICH_RANK_REORDER_METHOD.

The default task distribution in quad core mode is SMP-style placement, when the environment variable MPICH_RANK_REORDER_METHOD is set to 1. For example, 8 MPI tasks would be distributed as follows:

Node Node 1 Node 2
Core 1 Core 2 Core 3 Core 4 Core 1 Core 2 Core 3 Core 4
MPI Rank 0 1 2 3 4 5 6 7

Setting MPICH_RANK_REORDER_METHOD to 0 would allow a round-robin placement of MPI tasks:

Node Node 1 Node 2
Core 1 Core 2 Core 3 Core 4 Core 1 Core 2 Core 3 Core 4
MPI Rank 0 2 4 6 1 3 5 7

Setting MPICH_RANK_REORDER_METHOD to 2 would allow a folded-rank placement of MPI tasks:

Node Node 1 Node 2
Core 1 Core 2 Core 3 Core 4 Core 1 Core 2 Core 3 Core 4
MPI Rank 0 3 4 7 1 2 5 6

Setting MPI environment variable MPICH_RANK_REORDER_METHOD to 3 requires a custom placement of MPI ranks with a user defined file MPICH_RANK_ORDER. See the intro_mpi man page for more information.

 CNL malloc Environment Variables

The CNL kernel provides the following runtime malloc tunable environment variables to control how the system memory allocation routine "malloc" behaves (note the trailing underscores):

  • MALLOC_TRIM_THRESHOLD_
  • MALLOC_TOP_PAD_
  • MALLOC_MMAP_THRESHOLD_
  • MALLOC_MMAP_MAX_

The two variables that have been found most useful are MALLOC_MMAP_MAX_ and MALLOC_TRIM_THRESHOLD_ . The recommended settings for these two variables are:

  • MALLOC_TRIM_THRESHOLD_ = 536870912
  • MALLOC_MMAP_MAX_ = 0

Setting MALLOC_MMAP_MAX_ limits the number of 'internal' mmap regions. The setting of 0 means that the program will not use any "non" heap mapping regions instead of the default value of 64. This eliminates the system calls to mmap/munmap.

MALLOC_TRIM_THRESHOLD_ is the amount of free space at the top of the heap after a free() that needs to exist before malloc will return the memory to the OS. Setting MALLOC_TRIM_THRESHOLD_ helps performance by reducing system time overhead by reducing the number of calls to sbrk/brk. The default setting of 128 KBytes is much too low for a node with 4 GBytes of memory and one application. We suggest setting it to 0.5 GBytes.

Please refer to Cray document CNL malloc Environment Variables for more information.

 Running Large Jobs

Many applications will run across the entire Franklin system with the default environment settings. However, if you have any problems, you may want to try adjusting the following variables. Often in the standard error file of a failed run, Franklin will report which of these environment variable limits were reached.

Environment VariableDefaultRecommended RangeNotes
MPICH_MSGS_PER_PROC 16384 1-38128 Jobs larger than default setting do not necessarily need to adjust this parameter. Depends on communication pattern
MPICH_UNEX_BUFFER_SIZE largest of 60M or (3000 * num_cores) 60-260M Try doubling the default value. Increase from there. Use the lowest value possible to avoid lowering available memory to the application
MPICH_PTL_UNEX_EVENTS largest of 20480 or (2.2 * num_cores) 20480-163840 Try doubling default value. Increase from there. Use lowest possible value which enables successful run.
MPICH_PTL_OTHER_EVENTS largest of 2048 and (num_cores/4) 2048-131072 Try doubling default value. Increase from there. Use lowest possible value which enables successful run.