MPI Shared Memory Mode Setting Temporarily Changed to Prevent Node Failure

By Rebecca Hartman-Baker

onesandzeros

NERSC engineers have identified an issue with the use of XPMEM for sharing on-node memory between processes on Perlmutter. Some applications have experienced crashes, instability, or decreasing performance over several hours. To alleviate this problem, during last week's maintenance, NERSC temporarily changed the shared memory mode from the default of XPMEM to CMA while this issue is being investigated and resolved.

This change may have performance impacts on a variety of NERSC applications.

Users may choose to switch back to XPMEM by setting MPICH_SMP_SINGLE_COPY_MODE=XPMEM, but beware that this may lead to unstable behavior and cause jobs to crash, so do so at your own risk. You can find more details about this flag in the intro-mpi man page found on Perlmutter login nodes: type man mpi and search for XPMEM.