NERSCPowering Scientific Discovery Since 1974

Using High Performance Libraries and Tools

Memkind Library on Edison

 The memkind library is a user extensible heap manager built on top of jemalloc which enables control of memory characteristics and a partitioning of the heap between kinds of memory (including user defined kinds of memory). This library can be used to simulate the benefit of the high bandwidth memory that will be available on KNL system on the dual socket Edison compute nodes (the two sockets are connected by the QPI (Quick Path Interconnect) links). Using the QPI to simulate the slow memory (DDR memory) while using the memory on the near socket as the high bandwidth memory (MCDRAM or HBM) . This is not an accurate model of the bandwidth and latency characteristics of the KNL on package memory, but is a reasonable way to determine which data structures rely critically on bandwidth.

This is how you can use this library tool on Edison:

1) Add compiler directive !DIR ATTRIBUTE FASTMEM in the Fortran codes

To use this library, the arrays that you want to allocate from the high bandwidth memory should be indicated with the compiler directive !DIR$ ATTRIBUTE FASTMEM. For example if your code has three arrays, a, b, and c,

real, allocatable :: a(:,:), b(:,:), c(:)

and you want to allocate them from the high bandwidth memory, then you need to add the following compiler directive to your code

!DIR$ ATTRIBUTES FASTMEM :: a, b, c

2) Link the codes to the memkind and jemalloc libraries

module load memkind 
ftn -dynamic -g -O3 -openmp mycode.f90

 Note compiler wrappers will add the -lmemkind and -ljemalloc to the link line automatically.

3) Run the codes with the numaclt and env MEMKIND_HBW_NODES

#!/bin/bash -l
#SBATCH -J test_HBM
#SBATCH -p debug
#SBATCH -N 1


module load memkind
export MEMKIND_HBW_NODES=0

srun -n 1 numactl --membind=1 --cpunodebind=0 ./a.out

where the MEMKIND_HBW_NODES environment variable set to zero will bind high bandwidth allocations to NUMA node 0.  The --membind=1 flag to numactl will bind standard allocations and stack variables to NUMA node 1. The --cpunodebind=0 option to numactl will bind the process to CPU's associated with NUMA node 0.  With this configuration standard allocations will be fetched across the QPI bus (mimic DDR), and high bandwidth allocations will be local to the process CPU.

AutoHBW library on Edison

The AutoHBW library is an Intel development tool that can be used to automatically allocate high-bandwidth (HBW) memory without any modifications to the source code of your application. It intercepts the standard heap allocations (e.g., malloc, calloc) in your program so that it can serve those allocations out of HBW memory. To use this library you just need to link your application to the AutoHBW library as well the memkind library and its dependent libraries (the AutoHBW library depends on the Memkind library), and then at run time set a couple of environment variables to automatically allocate the arrays in a certain size range to the HBW memory. This tool can be used on the dual-socket Edison compute nodes to estimate the HBW memory performance impact to your application codes.

This is how you can use this library tool on Edison:

1) Link your code to the autohbw, memkind and jemalloc libraries

module load autohbw
ftn -g -O3 -openmp mycode.f90

2) Run the codes with the numaclt command and proper environment variables

#!/bin/bash -l
#SBATCH -J test_HBW
#SBATCH -p debug
#SBATCH -N 1

export MEMKIND_HBW_NODES=0
export AUTO_HBW_LOG=0
export AUTO_HBW_MEM_TYPE=MEMKIND_HBW # this is the default, so you do not have to set explicitly
export AUTO_HBW_SIZE=1K:5K     # all allocations between sizes 1K and 5K allocated in HBW memory

srun –n 1 numactl --membind=1 --cpunodebind=0 ./a.out

where the environment variable MEMKIND_HBW_NODES=0 means the memory on the NUMA node 0 (or socket 0) is defined as the high bandwidth memory.  The --membind=1 flag to numactl will bind standard allocations and stack variables to NUMA node 1 (socket 1). The --cpunodebind=0 option to numactl will bind the process to CPU's associated with NUMA node 0.  With this configuration the standard allocations will be fetched across the QPI bus (mimic DDR) from the NUMA node 1, and the HBW allocations will be from the NUMA node 0 which is local to the process CPU. The AUTO_HBW_SIZE(=x[:y], where x and y are the lower and upper bounds for the array sizes, respectively) env is used to set the array size range to be allocated to the HBW memory. The env AUTO_HBW_LOG is used to set the log level.  The env AUTO_HBW_MEM_TYPE(=<memory_type>) sets the type of memory type that should be automatically allocated.

For more info about autoHBW library, please read the autohbw_README file, it is located at /usr/common/usg/autohbw/default/intel/examples directory on Edison.

Note,

  • the autohbw library intercepts the malloc calls in your code, so you need to make sure that the authbw library should be in front of the system default library paths.
    • - For dynamic builds, using LD_PRELOAD or LD_LIBRARY_PATH to allow libautohbw.so:libmemkind.so in front of the system default path.
    • - For static builds, make the autohbw and memkind libraries in front of the link line. May use -Wl,-ymalloc to check
  • On the compute node where your job is running, do
module load numactl
numastat -p <pid>

Memkind and AutoHBW libraries on Cori

See instructions for Edison.

Reference

Software Development Emulator (SDE) on Edison

 SDE allows developers to use currently available compilers, assemblers, and processors (Edison) to gain insight about if applications are ready to take advantage of the opportunities created by the new instructions (AVX512) available on the future architecture (KNL).

SDE is available on Edison under Cluster Compatibility Mode (CCM) only. So you have to compile code to run under CCM. Here is how you can use SDE on Edison:

1) Compile codes for KNL using the Intel compiler option -xMIC-AVX512

module load impi
mpiifort -g -O3 -openmp -debug inline-debug-info -xMIC-AVX512 mycode.f90

2) Run under SDE

 

#!/bin/bash -l
#SBATCH -J test_SDE
#SBATCH -p regular
#SBATCH --ccm
#SBATCH -N 1
#SBATCH -t 2:00:00

#module load ccm

module load impi
module load sde
export OMP_NUM_THREADS=12

num_top_blocks=100

sde -knl -mix -omix mymix.txt -iform -top_blocks $num_top_blocks -mask_profile -omask_profile mymask_profile.txt -dyn_mask_profile -odyn_mask_profile mydyn_mask_profile.txt -- ./a.out

 Where the sde options used 

-knl                          Set chip-check and CPUID for Knights Landing
-mix                         Run mix histogram tool
-omix                       Set the output file name for mix, Implies -mix Default is "mix.out"
-iform                       [default 0] Compute ISA iform mix
-line_info                  [default 1] Add line info to the top hot blocks
-top_blocks              [default 20] specify a maximal number of top blocks for which icounts are printed
-mask_profile           [default 0] Enable mask profiling
-omask_profile         [default sde-mask-profile.txt] Specify profile file output name
-dyn_mask_profile   [default 0] Enable dynamic mask profiling
-odyn_mask_profile [default sde-dyn-mask-profile.txt] Specify profile file output name

Please type sde -help or sde -help-long for more info. 

Reference