Runtime Tuning Options
MPI Task Distribution on Nodes
The distribution of MPI tasks on the nodes can be written to the standard output file by setting environment variable MPICH_RANK_REORDER_DISPLAY to 1. Users can control the distribution of MPI tasks on the nodes using the environment variable MPICH_RANK_REORDER_METHOD.
The default task distribution in quad core mode is SMP-style placement, when the environment variable MPICH_RANK_REORDER_METHOD is set to 1. For example, 8 MPI tasks would be distributed as follows:
| Node | Node 1 | Node 2 | ||||||
|---|---|---|---|---|---|---|---|---|
| Core 1 | Core 2 | Core 3 | Core 4 | Core 1 | Core 2 | Core 3 | Core 4 | |
| MPI Rank | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
Setting MPICH_RANK_REORDER_METHOD to 0 would allow a round-robin placement of MPI tasks:
| Node | Node 1 | Node 2 | ||||||
|---|---|---|---|---|---|---|---|---|
| Core 1 | Core 2 | Core 3 | Core 4 | Core 1 | Core 2 | Core 3 | Core 4 | |
| MPI Rank | 0 | 2 | 4 | 6 | 1 | 3 | 5 | 7 |
Setting MPICH_RANK_REORDER_METHOD to 2 would allow a folded-rank placement of MPI tasks:
| Node | Node 1 | Node 2 | ||||||
|---|---|---|---|---|---|---|---|---|
| Core 1 | Core 2 | Core 3 | Core 4 | Core 1 | Core 2 | Core 3 | Core 4 | |
| MPI Rank | 0 | 3 | 4 | 7 | 1 | 2 | 5 | 6 |
Setting MPI environment variable MPICH_RANK_REORDER_METHOD to 3 requires a custom placement of MPI ranks with a user defined file MPICH_RANK_ORDER. See the intro_mpi man page for more information.
CNL malloc Environment Variables
The CNL kernel provides the following runtime malloc tunable environment variables to control how the system memory allocation routine "malloc" behaves (note the trailing underscores):
- MALLOC_TRIM_THRESHOLD_
- MALLOC_TOP_PAD_
- MALLOC_MMAP_THRESHOLD_
- MALLOC_MMAP_MAX_
The two variables that have been found most useful are MALLOC_MMAP_MAX_ and MALLOC_TRIM_THRESHOLD_ . The recommended settings for these two variables are:
- MALLOC_TRIM_THRESHOLD_ = 536870912
- MALLOC_MMAP_MAX_ = 0
Setting MALLOC_MMAP_MAX_ limits the number of 'internal' mmap regions. The setting of 0 means that the program will not use any "non" heap mapping regions instead of the default value of 64. This eliminates the system calls to mmap/munmap.
MALLOC_TRIM_THRESHOLD_ is the amount of free space at the top of the heap after a free() that needs to exist before malloc will return the memory to the OS. Setting MALLOC_TRIM_THRESHOLD_ helps performance by reducing system time overhead by reducing the number of calls to sbrk/brk. The default setting of 128 KBytes is much too low for a node with 4 GBytes of memory and one application. We suggest setting it to 0.5 GBytes.
Please refer to Cray document CNL malloc Environment Variables for more information.
Running Large Jobs
Many applications will run across the entire Franklin system with the default environment settings. However, if you have any problems, you may want to try adjusting the following variables. Often in the standard error file of a failed run, Franklin will report which of these environment variable limits were reached.
| Environment Variable | Default | Recommended Range | Notes |
|---|---|---|---|
| MPICH_MSGS_PER_PROC | 16384 | 1-38128 | Jobs larger than default setting do not necessarily need to adjust this parameter. Depends on communication pattern |
| MPICH_UNEX_BUFFER_SIZE | largest of 60M or (3000 * num_cores) | 60-260M | Try doubling the default value. Increase from there. Use the lowest value possible to avoid lowering available memory to the application |
| MPICH_PTL_UNEX_EVENTS | largest of 20480 or (2.2 * num_cores) | 20480-163840 | Try doubling default value. Increase from there. Use lowest possible value which enables successful run. |
| MPICH_PTL_OTHER_EVENTS | largest of 2048 and (num_cores/4) | 2048-131072 | Try doubling default value. Increase from there. Use lowest possible value which enables successful run. |


