Some users have noted performance variability in the execution of their applications. There are many potential sources of variability on an HPC system and NERSC has identified the following best practices to mitigate variability and improve application performance.
This document is intented for users who are familiar with NERSC systems such as Edison. For new NERSC users.
Use of hugepages can reduce the cost of access memory, especially in the case of many `MPI_Alltoall` operations.
- load the hugepages module (module load craype-hugepages2M)
- recompile your code
- add "module load craype-hugepages2M) to your batch scripts
note: you may consider adding "module load craype-hugepages2M" to your "~/.bashrc.ext" files.
For more details see the manual pages (man intro_hugepages) on Cori or Edison.
Location of executables
Compilation of executables should be done in `$HOME` or `/tmp`. Executables can be copied into the compute node memory at the start of a job with `sbcast` to greatly improve job startup times and reduce run-time variability in some cases:
sbcast -f --compress ./my_program.x /tmp/my_program.x
srun -n 1024 -c 2 --cpu_bind=cores /tmp/my_program.x
Cori and Edison both use a Cray Aries network with a Dragonfly topology. SLURM has some options to control the placement of parallel jobs in the system. A "compact" placement can isolate a job from other traffic on the network.
SLURM's topology awareness can enabled by adding
--switches=N to your job script, where
N is the number of switches for your job. Each "switch" corresponds to approximately 384 compute nodes. This is most effective for jobs using less than ~300 nodes and 2 or fewer switches. Note: requesting fewer switches may increase the time it takes for your job to start.
#SBATCH -N 256 #SBATCH --switches=2
More details about NERSC's interconnect and specifics on how to query topology information from within a job can be found here.
KNL Cache mode
The default memory mode for Cori's Xeon Phi nodes is quadrant,cache. In cache mode the 16 GB of on-package MCDRAM is configured as a direct map cache. Direct map cache's are prone to cache trashing which hurts performance and results in variability since it is dependent on the state of each specific node. NERSC has implemented zonesort, a kernel module provided by Intel to mitigate this issue. This is on automatically by default when using quadrant,cache mode. However, the best and most consistent performance is obtained when the memory per node is less than 16 GB.
It is also possible to run jobs in flat mode e.g.
#SBATCH -C knl,quad,flat See here for details of using various KNL modes.
Running with correct affinity and binding options can greatly affect variability.
- use at least 8 ranks per node (1 rank per node cannot utilize the full network bandwidth)
man intro_mpifor additional options
- check job script generator to get correct binding
check-hybrid.*.corito check affinity settings
user@nid01041:~> srun -n 8 -c 4 --cpu_bind=cores check-mpi.intel.cori|sort -nk 4 Hello from rank 0, on nid07208. (core affinity = 0,68,136,204) Hello from rank 1, on nid07208. (core affinity = 1,69,137,205) Hello from rank 2, on nid07208. (core affinity = 2,70,138,206) Hello from rank 3, on nid07208. (core affinity = 3,71,139,207) Hello from rank 4, on nid07208. (core affinity = 4,72,140,208) Hello from rank 5, on nid07208. (core affinity = 5,73,141,209) Hello from rank 6, on nid07208. (core affinity = 6,74,142,210) Hello from rank 7, on nid07208. (core affinity = 7,75,143,211)
Using core-specialization (
#SBATCH -S n) moves all OS functions to cores not in use by user applications, where
n is the number of cores to dedicate to the OS. The example shows 4 cores per node on KNL for the OS and the other 64 for the application.
#SBATCH -S 4 srun -n 128 -c 4 --cpu_bind=cores /tmp/my_program.x
Following best practices for I/O can greatly improve application performance and reduce impact on and from shared resources such as the filesystem.
- Best practices for I/O
- The burst buffer is now available to all users and can provide greatly improved I/O performance (depending on the access pattern).
This is an example for Cori KNL including all of the above recommendations:
#!/bin/bash #SBATCH -N 2 #SBATCH -C knl,quad,cache #SBATCH -p regular #SBATCH -t 60 #SBATCH -S 4 #SBATCH --switches=1 module load craype-hugepages2M sbcast -f --compress ./my_program.x /tmp/my_program.x srun -n 128 -c 4 --cpu_bind=cores /tmp/my_program.x