NERSCPowering Scientific Discovery Since 1974

If your codes hang or run much slower after the maintenance on 6/25/14

June 27, 2014

During the maintenance on 6/25/14, Edison was upgraded to Cray Linux Environment (CLE) 5.2UP01 and Cray Message Passing Toolkits 7.0.0 which require all MPI codes to be recompiled. As a result, some users may see their codes hang or run much slower than before after recompilation. We believe this is due to some regression introduced in the new CLE version, and have filed a bug with Cray. If you find your codes run much slower than before,  please use one of the two workarounds below until Cray provides a permanent fix.

  • Set the following environment variable,  MPICH_GNI_MDD_SHARING, or  UGNI_CDM_MDD_DEDICATED (a low level UGNI environment variable) in your job script before invoking the aprun command:
setenv MPICH_GNI_MDD_SHARING disabled   #for csh/tcsh users
export MPICH_GNI_MDD_SHARING=disabled   #for bash shell users

The environment varialbe, MPICH_GNI_MDD_SHARING, controls whether MPI uses dedicated or shared Memory Domain Descriptors (MDDs). If set to enabled, shared MDDs are used; if set to disabled, dedicated MDDs are used. Shared MDDs make better use of system resources. The default is enabled.

More details about this environment variable can be found in the intro_mpi man page (type man intro_mpi).

Note, the above environemnt variable is helpful for MPI codes, to disable MDDs for all programming models (including MPI/DMAPP/SHMEM/CAF), please use the following environment variable:

setenv UGNI_CDM_MDD_DEDICATED 2  #for csh/tcsh users
export UGNI_CDM_MDD_DEDICATED=2 #for bash shell users
  • Or use hugepages. To use hugepages, load one of the available hugepages modules, compile and run with the module loaded. For example,
module load craype-hugepages2M
cc my_app.c

Then in your job script load the hugepages module, a sample job script is as follows:

#!/bin/bash -l 
#PBS -N myjob
#PBS -l mppwidth=240
#PBS -l walltime=00:30:00
#PBS -j oe

module load craype-hugepages2M
aprun -n 240 ./a.out

More information about hugepages can be found in the intro_hugepages man page (man intro_hugepages). We also have a website about how to use  Large page memory.