NERSCPowering Scientific Discovery Since 1974

KNL Processor Modes

The Xeon-Phi "Knights-Landing" 7250 processors in Cori have 68 CPU cores where are organized into 34 "tiles" (each tile comprising two CPU cores and a shared 1MB L2 cache) which are placed in a 2D mesh, connected via an on-chip interconnect as shown in the following figure:

KNLOverview

As shown in the figure, the KNL processor has 6 DDR channels, with controllers to the right and left of the mesh 8 MCDRAM channels, with controllers spread across 4 "corners" of the mesh. 

NUMA on KNL

NUMA stands for Non-Uniform Memory Access. It represents the situation where certain cores on a node can be considered "closer" to some part of the memory space than others or if some memory on the node has different lattencies or bandwidths to the cores. Both of these situations occur on the KNL nodes - there are two different types of memory available with different specs, and different channels of memory are closer to different cores in the 2d mesh.

NUMA Mode Options on KNL

The most useful NUMA modes on KNL are quadrant and sub-NUMA clustering. 

In quadrant mode the chip is divided into four virtual quadrants, but is exposed to the OS as a single NUMA domain. The diagram below shows illustrates the affinity of tag directory to the memory. In many cases, this mode is the easiest to use and will provide good performance.

cluster mode quadrant

 

In sub-NUMA mode each quadrant (or half) of the chip is exposed as a separate NUMA domain to the OS. In SNC4 (SNC2) mode the chip is analogous to a 4 (2) socket Xeon. This mode has potential for high performance, but software must be optimized for NUMA architectures to benefit. On Cori SNC4 can be complicated to fully utilize as there are non-homogenous number of cores per quadrant on a 68 core cpu (Because the 34 tiles cannot be evenly distributed among 4 quadrants).

cluster mode snc4

 

MCDRAM Memory Options on KNL

There is no shared L3 cache on the KNL processor. However, the 16 GB of MCDRAM (spread over 8 channels) can be configured either as a direct-mapped cache or as addressable memory. When configured as a cache, recently accessed data is automatically stored in cache, similarly to an L3 cache on a Xeon processor. However, there are some notable differences:

- The cache (16GB) is significantly larger than a typical L3 cache on a Xeon processor (usually in the tens of MB).

- The cache is direct-mapped. Meaning it is non-associative - each cache-line worth of data in DRAM has one location it can be cached in MCDRAM. This can lead to possible conflicts for apps with greater than 16GB working sets. 

- Data is not prefetched into the MCDRAM cache

The MCDRAM may be configured either as a cache, addressable memory or as a mix of the two. This is shown in the figure below:

KNLMemory

When the MCDRAM is configured at least in part as addressable memory, it is presented to the operating system and the user as an additional NUMA domain (quadrant mode) or multiple additional NUMA domains (SNC2 or SNC4 mode) as described above. 

On Cori, we have enabled two configurations for the MCDRAM:

cache mode - The MCDRAM is configured entirely as a last-level cache

flat mode - The MCDRAM is configured entirely as addressable memory

Other modes may be enabled in the future or on request.

To request these modes, you use the following syntax in your batch scripts (where "quad" would be replaced by your NUMA mode of choice):

#SBATCH -C knl,quad,flat

or

#SBATCH -C knl,quad,cache

In "flat" mode, the "numactl" command should be used to specify a preferred or default memory domain. For example, to allocate all your memory out of the MCDRAM in quad,flat mode you would launch your executable with the following:

srun <srun options> numactl -m 1 yourapplication.x

The above will fail if your applications requires more than 16GB of memory. To instead prefer MCDRAM for your first 16GB, but allocated remaining data in the DDR memory use the "-p" flag to numactl

srun <srun options> numactl -p 1 yourapplication.x

The above examples represent the quad,flat configuration where domain 0 corresponds to DDR and domain 1 corresponds to MCDRAM. On SNC4, domains 0-3 represent the 4 DDR NUMA domains (described above) and domains 4-7 represent the 4 MCDRAM NUMA domains.