NERSCPowering Scientific Discovery Since 1974

Runtime Tuning Options

Cray MPI Environment Variables

Cray and NERSC attempt to set MPI environment variable to the best defaults for the majority of applications; however, adjusting the environment variables may in some cases improve application performance or may be necessary to enable an application to run.

Environment Variable Name Description Default Range Recommendations
MPICH_GNI_MAX_EAGER_MSG_SIZE Controls the threshold for switching from eager to rendezvous protocols for inter-node messaging. 8,192 bytes
<131,072.   May help to increase this environment variable for applications sending medium sized messages. With applications spend long time in MPI_Waitall, setting this value higher helps.
MPICH_GNI_NUM_BUFS Controls number of 32K DMA buffers available for each rank to use in eager protocol 64 32K buffers (2M total) May help to increase modestly, but this will constrain other resources, so do so cautiously.
MPICH_GNI_NDREG_LAZYMEM Controls whether or not to use a lazy memory deregistration policy inside UDREG On (variable set) set or unset Disabling may help applications that use small 4kb pages.  Disable this option with caution.  Users should try compiling and running with large pages before disabling this setting.  Disabling this setting results in a performance drop for large transfers
MPICH_GNI_DMAPP_INTEROP Only relevant for mixed MPI/SHMEM/UPC/CAF codes. Setting allows MPICH2 and DMAPP interface to share memory registration cache On (variable set) set or unset May need to disable for codes which call shmem_init after MPI_Init
MPICH_GNI_RDMA_THRESHOLD Controls threshold where RDMA read/write operations switch from using FMA to BTE method 1,024 bytes (Max: 65,536 bytes)
Using BTE method helps to get some overlap of communication with computation.  NERSC is still exploring recommendations for this setting.
MPICH_SMP_SINGLE_COPY_SIZE specifies threshold at which the shared memory channel switches to a single-copy (XPMEM) protocol for the intra-node messages from a double copy protocol. 8,192 bytes NERSC is still exploring recommendations for this setting.
MPICH_SMP_SINGLE_COPY_OFF If set, disables the single-copy mode for the SMP device and forces all on-node messages, regardless of size, to be buffered. Not enabled NERSC currently sets this value to 1 by default.

 

MPI Task Distribution Across Nodes

The way MPI tasks are distributed across the Hopper compute nodes can affect the performance of an application.  By default, MPI task ranks fill up one node before moving to fill up the next node.  MPI ranks can be reordered by setting an environment variable in your batch script.

setenv MPICH_RANK_REORDER_METHOD 2
or
export MPICH_RANK_REORDER_METHOD=2

The table below describes the different reordering method.  Note that setting the above environment variable

Environment Variable Setting Method Notes
0 Round robin placement Sequential ranks are placed on the next node in the list and placement starts over at the first node when the end of the list is reached
1 Default (SMP-style placement) Sequential ranks fill up each node before moving to the next node in the list
2 Folded rank placement Like round-robin except that each pass over the node list is in the opposite direction of the previous pass
3 Custom placement The ordering is specified in a file name MPICH_RANK_ORDER

Your application might benefit from different MPI rank orders if you have a significant amount of time spent in point-to-point communication or want to spread out the tasks doing I/O across nodes.  This is a simple runtime setting so users are encouraged to experiment with a different MPI rank order method to improve performance.  See a performance study for 6 key NERSC benchmarks using different rank reorder methods.

Additional References