Memory Usage Considerations on Hopper
Most Hopper compute nodes have 32 GB of physical memory, but, not all that memory is available to user programs. Compute Node Linux (the kernel), the Lustre file system software, and message passing library buffers all consume memory, as does loading the executable into memory. Thus the precise memory available to an application varies. Approximately 31 GB of memory can be allocated from within an MPI program using all 24 cores per node, i.e., 1.29 GB per MPI task on average. If an application uses 12 MPI tasks per node, then each MPI task could use about 2.58 GB of memory.
You may see an error message such as "OOM killer terminated this process." "OOM" means Out of Memory and it means that your code has exhausted the memory available on the node. There are two relatively simple things you can try to solve the OOM problem:
- Run on the large-memory nodes. 384 Hopper nodes have 64 GB pf physical memory. Out of these, 369 nodes are available for regular batch jobs. On these big memory nodes, approximately 63 GB of memory can be used for an MPI program using all 24 cores per node, i.e., 2.62 GB per MPI task on average. Please see the Submitting Batch Jobs page for information on how to request large memory nodes.
- Use more nodes and fewer cores per node. You can choose to launch fewer than 24 tasks per node to increase the memory available for each MPI task. Note though, that your account will be charged for all 24 cores per node. Do this by using the -N option to aprun, e.g., the command aprun -n 96 -N 12 a.out in conjunction with the appropriate #PBS -l mppwidth=192 batch command will use 96 cores on eight nodes, thereby nearly doubling the memory available to each core.
You can change MPI buffer sizes by setting certain MPICH environment variables. See the man page for intro_mpi for more details.
Sometimes out-of-memory jobs may leave the compute nodes in an unhealthy state that can affect future jobs that land on these nodes. Please carefully evaluate your code's memory requirement via internal checking or by using one of the tools available for this. Craypat can track heap usage and IPM also tracks memory usage. You can use the Valgrind tool to check for memory leaks.