Running Large Scale Jobs
Users face various challenges with running and scaling large scale jobs on peta-scale production systems. For example, certain applications may not have enough memory per core, the default environment variables may need to be adjusted, or I/O dominates run time. This page lists some available programming and run time tuning options and tips users can try on their large scale applications on Hopper for better performance.
Try different compilers and compiler options
The available compilers on Hopper are PGI, Cray, Intel, GNU, and Pathscale. Try with various compilers and compiler flags, and find out which options are best for your applications. Please refer to Compiling Codes for usage and recommended compiler flags.
Change default MPI rank ordering
This is a simple yet sometimes effective run time tuning option that requires no source code modification, recompilation or relinking. Default MPI rank placement on the compute nodes is SMP style but other choices are round-robin, folded rank, and custom ranking. Please refer to MPI Task Distribution Across Nodes for more detailed information.
Adjust other MPI run time environment variables
Please see a list of suggested Cray MPI Environment Variables to improve performance or to enable an application to run.
Use fewer cores per node
Using fewer cores per node is helpful when more memory per process (than the default) is needed and for having fewer processes to share the memory and interconnect bandwidth. This is another simple yet effective run time tuning option. Make sure to choose the right -N and -S options with aprun to spread processes across different NUMA nodes. For example, "-N 12 -S 3" will use 12 cores per node, with 3 cores per NUMA node, so all 4 NUMA nodes are used. Without the "-S" option, all 12 cores will be allocated onto the first 2 NUMA nodes. Please refer to Example Batch Scripts and Using aprun for more information.
Use hybrid MPI/OpenMP
Hybrid MPI/OpenMP is encouraged on Hopper since it also reduces the memory footprint. NERSC suggests not to use more than 6 threads on one node. Aprun options process (-S) and memory affinity options (-ss) are essential to ensure the MPI tasks being spreaded out across different NUMA domains. Please refer to and Using OpenMP Effectively for more information. Consider overlapping communication with computation in hybrid MPI/OpenMP.
Use huge pages
Some applications may perform better when large memory pages are used. The default page size is 4KB. The default huge page size is 2 MB. Please refer to Using Huge Pages for more usage information.
Adjust GNU malloc run time environment variables
Two of the CNL kernel malloc environment variables can be adjusted at run time to improve the performance of system memory allocation. More information is here.
Use IOBUF library
Buffering I/O on the compute node before writing or reading data from disk is a technique which can potentially improve I/O performance for applications. Please refer to IO Buffering Library for more usage information.
Parallel I/O tuning
Please refer to Optimizing I/O Performance on the Lustre File System and Introduction to Scientific IO for more details on parallel I/O performance tuning.