Runtime Tuning Options
Cray MPI Environment Variables
Cray and NERSC attempt to set MPI environment variable to the best defaults for the majority of applications; however, adjusting the environment variables may in some cases improve application performance or may be necessary to enable an application to run.
|Environment Variable Name||Description||Default||Range Recommendations|
|MPICH_GNI_MAX_EAGER_MSG_SIZE||Controls the threshold for switching from eager to rendezvous protocols for inter-node messaging.||8,192 bytes
||<131,072. May help to increase this environment variable for applications sending medium sized messages. With applications spend long time in MPI_Waitall, setting this value higher helps.|
|MPICH_GNI_NUM_BUFS||Controls number of 32K DMA buffers available for each rank to use in eager protocol||64 32K buffers (2M total)||May help to increase modestly, but this will constrain other resources, so do so cautiously.|
|MPICH_GNI_NDREG_LAZYMEM||Controls whether or not to use a lazy memory deregistration policy inside UDREG||On (variable set)||set or unset Disabling may help applications that use small 4kb pages. Disable this option with caution. Users should try compiling and running with large pages before disabling this setting. Disabling this setting results in a performance drop for large transfers|
|MPICH_GNI_DMAPP_INTEROP||Only relevant for mixed MPI/SHMEM/UPC/CAF codes. Setting allows MPICH2 and DMAPP interface to share memory registration cache||On (variable set)||set or unset May need to disable for codes which call shmem_init after MPI_Init|
|MPICH_GNI_RDMA_THRESHOLD||Controls threshold where RDMA read/write operations switch from using FMA to BTE method||1,024 bytes (Max: 65,536 bytes)
||Using BTE method helps to get some overlap of communication with computation. NERSC is still exploring recommendations for this setting.|
|MPICH_SMP_SINGLE_COPY_SIZE||specifies threshold at which the shared memory channel switches to a single-copy (XPMEM) protocol for the intra-node messages from a double copy protocol.||8,192 bytes||NERSC is still exploring recommendations for this setting.|
|MPICH_SMP_SINGLE_COPY_OFF||If set, disables the single-copy mode for the SMP device and forces all on-node messages, regardless of size, to be buffered.||Not enabled||NERSC currently sets this value to 1 by default.|
MPI Task Distribution Across Nodes
The way MPI tasks are distributed across the Hopper compute nodes can affect the performance of an application. By default, MPI task ranks fill up one node before moving to fill up the next node. MPI ranks can be reordered by setting an environment variable in your batch script.
setenv MPICH_RANK_REORDER_METHOD 2
The table below describes the different reordering method. Note that setting the above environment variable
|Environment Variable Setting||Method||Notes|
|0||Round robin placement||Sequential ranks are placed on the next node in the list and placement starts over at the first node when the end of the list is reached|
|1||Default (SMP-style placement)||Sequential ranks fill up each node before moving to the next node in the list|
|2||Folded rank placement||Like round-robin except that each pass over the node list is in the opposite direction of the previous pass|
|3||Custom placement||The ordering is specified in a file name MPICH_RANK_ORDER|
Your application might benefit from different MPI rank orders if you have a significant amount of time spent in point-to-point communication or want to spread out the tasks doing I/O across nodes. This is a simple runtime setting so users are encouraged to experiment with a different MPI rank order method to improve performance. See a performance study for 6 key NERSC benchmarks using different rank reorder methods.
- XE6 Porting & Tuning Tips. Jeff Larkin, Cray. XE6 Workshop at NERSC. Feb 2011.