Programming Tuning Options
Using Huge Pages
Hugepages are virtual memory pages which are bigger than the default base page size of 4KB. Some applications may perform better when large memory pages are used. Huge pages can improve memory performance for common access patterns on large data sets. Huge pages also increase the maximum size of data and text in a program accessible by the high speed network. MPI implementation uses huge pages for its internal buffers too.
When to Use Huge Pages
- For MPI applications, map the static data and/or heap onto huge pages.
- For SHMEM applications, map the static data and private heap onto huge pages.
- For applications written in Unified Parallel C, Co-array Fortran, and other languages based on the PGAS programming model, map the static data and private heap onto huge pages.
- For an application which uses shared memory, which needs to be concurrently registered with the high speed network drivers for remote communication.
- For an application doing heavy I/O.
- To improve memory performance for common access patterns on large data sets.
- To run ISV's with large static data that is referenced over the HSN.
To use "huge pages", load one of the 6 available hugepages modules below at compile/link stage to set the default huge pagesize to 128k, 512k, 2M, 8M, 16M, or 64M. No special flags are required at compile and link time. The loaded module sets the proper library paths.
% module avail craype-hugepages
-------------- /opt/cray/xt-asyncpe/default/modulefiles -----------------------
craype-hugepages128K craype-hugepages2M craype-hugepages64M
craype-hugepages16M craype-hugepages512K craype-hugepages8M
% module load craype-hugepages2M
% cc -o mycode mycode.c
Also the following environment variables will be set by loading any of the above craype-hugepagesxxx modules.
setenv HUGETLB_MORECORE yes
setenv HUGETLB_ELFMAP W
setenv HUGETLB_FORCE_ELFMAP yes+
To run with huge pages, load the cray-hugepages module in your script.
module load craype-hugepages2M
To verify that huge pages are being implemented in your running code you can set the HUGETLB_VERBOSE environment variable:
setenv HUGETLB_VERBOSE 3
There is no guaranteed amount of huge page memory available to an application. Memory allocated as huge pages is unavailable for I/O, whether the application uses the memory or not. Less available memory for I/O buffers may result in performance degradation.
Hopper node has 32 GB total memory per node, after a system reboot, the available memory for huge pages is about 30 GB. However, due to the bugs in memory leak and fragmentation, the available huge pages memory decreases when the system stays up longer.
For more information, please refer to "intro_hugepages" man page.
Memory Touch and Allocation
On Hopper compute nodes, the memory does not actually get allocated until it is "touched" (for example, assign initial values). A good practice is always to "first touch" the memory when it is allocated. This will let the application to fail immediately instead of much later (in a random state) when over-allocates the memory. Also, for OpenMP or thread usage, if each thread touches its own memory initially, the memory can be allocated on its local NUMA node.
I/O Buffering Library
Buffering I/O on the compute node before writing or reading data from disk is a technique which can potentially improve I/O performance for applications. Cray provides a software module which, when loaded will link your application to the I/O buffering library.
module load iobuf
Compile and link your application as usual. Next you will need so set some environment variables in your batch script to control the amount of I/O buffering. By default, even when your application is linked to the iobuf library, buffering is not enabled. You can read all the options on the iobuf man page. Only a few options are discussed here.
There are a number of parameters you can set with the I/O buffering library through the environment variable IOBUF_PARAMS, which takes list of specifications, the first of which must be the file name to apply the buffering. The simplest case is to use a wildcard to specify all files (except stdout, stderr and a few other special cases.)
setenv IOBUF_PARAMS '*'
The table below describes some additional parameters.
|Sets the number of buffers||count=n||n=4||The larger the buffer size the less memory will be available for the application on the compute nodes.|
|Sets the size of the buffers||size=n||n=1M||The larger the buffer size the less memory will be available for the application on the compute nodes. Powers of 2 will have better performance (1M, 2M, etc)|
|Controls amount of output describing I/O performance.||verbose||not on||Print a summary of I/O activity when the file is closed. (Prints one per file.)|
Specifiying additional parameters would look like this:
setenv IOBUF_PARAMS '*:verbose:count=24'
setenv IOBUF_PARAMS '*.in:count=4:size=32M:verbose,*.out:count=8:size=64M:verbose'
GNU Malloc Environment Variables
The CNL kernel provides the following runtime malloc tunable environment variables to control how the system memory allocation routine "malloc" behaves (note the trailing underscores):
The two variables that have been found most useful are MALLOC_MMAP_MAX_ and MALLOC_TRIM_THRESHOLD_ . The recommended settings for these two variables are:
# to eliminate mmap use by malloc
setenv MALLOC_MMAP_MAX_ 0
# to only trims heap when this amount total is freed
setenv MALLOC_TRIM_THRESHOLD_ 536870912 (or appropriate size)
Setting MALLOC_MMAP_MAX_ limits the number of 'internal' mmap regions. The setting of 0 means that the program will not use any "non" heap mapping regions instead of the default value of 64. This eliminates the system calls to mmap/munmap.
MALLOC_TRIM_THRESHOLD_ is the amount of free space at the top of the heap after a free() that needs to exist before malloc will return the memory to the OS. Setting MALLOC_TRIM_THRESHOLD_ helps performance by reducing system time overhead by reducing the number of calls to sbrk/brk. The default setting of 128 KBytes is much too low for a node with 4 GBytes of memory and one application. We suggest setting it to 0.5 GBytes.
The behavior of GNU Malloc system calls can be altered via compiler options, for example, PGI's -Msmartalloc, which is similar to the above run time environment variables settings, to sometimes improve the code performance.
- XE6 Porting & Tuning Tips. Jeff Larkin, Cray. XE6 Workshop at NERSC. Feb 2011.