Using on-package memory
The NERSC-8 system will include a novel feature on its node architecture: 16 GB of high-bandwidth 3D stacked memory interposed between the KNL chip and the slower off-package DDR memory. Compared to the on-node DDR4 memory, the high-bandwidth memory (HBM) has approximately 5x the bandwidth but has similar latency. This new feature has the potential to accelerate those applications which are particularly sensitive to memory bandwidth limits. To investigate how sensitive your application is to memory bandwidth, consider running the packed versus half-packed experiments described 'here' or use the Intel VTune tool on Edison.
One of the features of the HBM is that it can be used in a couple of different ways. First, the HBM can act as a 16GB cache for the off-package DRAM. In this case, use of the HBM will be managed by the compiler and run-time system. Second, it can configured in what is referred to as 'flat' mode, where the user must explicitly manage the use of the HBM, generally by modifying source code so that data structures selected by the user will be placed into the HBM as opposed to DRAM. For Fortran, this may require using directives and for C this will require using slightly different library calls (hbw_malloc/hbw_free) or different type declarations (e.g. real, dimension(10), fastmemory :: x) . In the 'cache' mode, the HBM will effectively work as another level in the cache hierarchy, with no user intervention required. In this case, cache lines from DDR are direct mapped to the HBM. Finally, in 'hybrid' mode, the HBM can be used as a mixture of cache and flat memory (e.g. 25%/75% or 50%/50%). There will be a default option for most jobs but the user will be able to select the desired mode on a per job basis.
How to best use the HBM is an issue under active consideration by NERSC staff. While the size of the HBM on NERSC-8 is 16GB, we expect this will increase on future architectures. At present, users need to consider whether their strategy for using HBM might involve simply strong scaling their production science runs until their on-node memory usage allows them to place the entirety of their run into HBM. In this case, the user may simply choose to use the HBM in a cache mode and no further intervention is required. If users expect to place their own data structures onto the HBM explicitly, it will be necessary to experimentally determine which data structures are limited by memory bandwidth (more here) We expect this will be particularly important if an application's memory footprint does not fit entirely within DRAM. If users wish to investigate if to what extent HBM might be useful to their applications before the arrival of KNL hardware, they are encouraged to investigate the open source library memkind (https://github.com/memkind/memkind, also available on Edison as a module). This library takes advantage of the fact that the memory bandwidth from a socket to its associated DRAM ('near memory') is higher than accessing DRAM on the adjacent socket ('far memory'). In this way, the user can, to some extent, simulate the effect of having data structures in HBM ('near memory') versus DRAM ('far' memory') by placing data structures on either, respectively. Currently, Intel is also working on tools that will be incorporated into VTune that will aid the user in discovering which data structures within their code use significant memory bandwidth and thus should be considered for placement into HBM.
To get a sense of how much benefit one might expect from using HBM when performing calculations using streaming memory references, we developed a simple model that is a function of (a) the fraction of time spent in the code servicing streaming memory references from DRAM only and (b) the fraction of streaming memory references that are coming from HBM. In the plot below, one can see that to achieve a an overall 2x speedup, an application would have to have to spend a significant fraction of time servicing memory requests and a significant fraction of those requests would have to be coming from HBM. While hardly a detailed model this should give some indication that HBM is not necessarily a magic bullet to improve application performance, especially when one takes into account the lower clock speed of KNL compared to say the current generation of Intel Xeon CPUs.