Programming Tuning Options
Using Huge Pages
Some applications may perform better when large memory pages are used. Huge pages can improve memory performance for common access patterns on large data sets. Huge pages also increase the maximum size of data and text in a program accessible by the high speed network.
When to Use Huge Pages
- For MPI applications, map the static data and/or heap onto huge pages.
- For SHMEM applications, map the static data and private heap onto huge pages.
- For applications written in Unified Parallel C, Co-array Fortran, and other languages based on the PGAS programming model, map the static data and private heap onto huge pages.
- For an application which uses shared memory, which needs to be concurrently registered with the high speed network drivers for remote communication.
- For an application doing heavy I/O.
- To improve memory performance for common access patterns on large data sets.
- To run ISV's with large static data that is referenced over the HSN.
The default page size is 4KB. To use "huge pages" link your application to the following library:
-lhugetlbfs
You will also need to set the following environment variables in your run script
To map the read-write sections onto huge pages, with csh shell
setenv HUGETLB_ELFMAP W
or in bash shell
export HUGETLB_ELFMAP=W
To map the heap onto huge pages, in csh shell
setenv HUGETLB_MORECORE yes
or in bash shell
export HUGETLB_MORECORE=yes
And set an additional environment variable XT_SYMMETRIC_HEAP_SIZE (the size of the symmetric HEAP per PE) for SHMEM applications. The default value for XT_SYMMETRIC_HEAP_SIZE for SHMEM applications is 0. (see "man shmem"). The recommended maximum value for this variable is 30 GB on Hopper II.
Then pass an additional flag to the aprun line in your batch script.
This example requests 500 MB of huge pages as available and uses 4k pages subsequently
aprun -m500h -n ...
This example requests 500 MB of huge pages and terminates the job launch if not available. (The difference is the additional -s flag)
aprun -m500hs -n ...
Huge Memory Request Consideration
The memory available for huge pages is less than the total memory on the node. You must leave enough memory for CNL and I/O buffers. Also, because of memory fragmentation, less memory is available for huge pages after a node has run other jobs.
There is no guaranteed amount of huge page memory available to an application. Memory allocated as huge pages is unavailable for I/O, whether the application uses the memory or not. Less available memory for I/O buffers may result in performance degradation.
Hopper2 node has 32 GB total memory per node, the recommended memory available for huge pages is 30 GB.
For more information, please refer to "intro_hugepages" man page.
Memory Touch and Allocation
On Franklin or Hopper compute nodes, the memory does not actually get allocated until it is "touched" (for example, assign initial values). A good practice is always to "first touch" the memory when it is allocated. This will let the application to fail immediately instead of much later (in a random state) when over-allocates the memory. Also, for OpenMP or thread usage, if each thread touches its own memory initially, the memory can be allocated on its local NUMA node.
I/O Buffering Library
Buffering I/O on the compute node before writing or reading data from disk is a technique which can potentially improve I/O performance for applications. Cray provides a software module which, when loaded will link your application to the I/O buffering library.
module load iobuf
Compile and link your application as usual. Next you will need so set some environment variables in your batch script to control the amount of I/O buffering. By default, even when your application is linked to the iobuf library, buffering is not enabled. You can read all the options on the iobuf man page. Only a few options are discussed here.
There are a number of parameters you can set with the I/O buffering library through the environment variable IOBUF_PARAMS, which takes list of specifications, the first of which must be the file name to apply the buffering. The simplest case is to use a wildcard to specify all files (except stdout, stderr and a few other special cases.)
with csh:
setenv IOBUF_PARAMS '*'
with bash:
export IOBUF_PARAMS='*'
The table below describes some additional parameters.
| Description | Parameter | Default | Notes |
| Sets the number of buffers | count=n | n=4 | The larger the buffer size the less memory will be available for the application on the compute nodes. |
| Sets the size of the buffers | size=n | n=1M | The larger the buffer size the less memory will be available for the application on the compute nodes. Powers of 2 will have better performance (1M, 2M, etc) |
| Controls amount of output describing I/O performance. | verbose | not on | Print a summary of I/O activity when the file is closed. (Prints one per file.) |
Specifiying additional parameters would look like this:
setenv IOBUF_PARAMS '*:verbose:count=24'
or
setenv IOBUF_PARAMS '*.in:count=4:size=32M:verbose,*.out:count=8:size=64M:verbose'
GNU Malloc
The behavior of GNU Malloc system calls can be altered via compiler options, for example, PGI's -Msmartalloc, which is similar to the following run time environment variables settings, to sometimes improve the code performance:
# to eliminate mmap use by malloc
setenv MALLOC_MMAP_MAX_ 0
# to only trims heap when this amount total is freed
setenv MALLOC_TRIM_THRESHOLD_ 536870912 (or appropriate size)
Additional References
- XE6 Porting & Tuning Tips. Jeff Larkin, Cray. XE6 Workshop at NERSC. Feb 2011.


