Core Specialization (CS) is a feature of the Cray operating system that allows the user to reserve one or more cores per node for handling system services, and thus reduce the effects of timing jitter due to interruptions from the operating system at the expense of (possibly) requiring more nodes to run an application. The specialized cores may also be used in conjunction with Cray's MPI asynchronous progress engine  to improve the overlap of communication and computation for applications that use non-blocking MPI functions. In the absence of CS, the compute cores must service their own non-blocking calls.
Hyper-Threading complicates questions abouty the most effective use of processor resources. HT doubles the number of compute stream (i.e. processes or threads) that can schedule computation, but does not increase other resources such as floating-point units or cache. A key question for users is whether the additional 'logical cores' are best used by allocating additional MPI processes to the nodes, by using multiple threads per MPI task, or by devoting some to the operating system or MPI progress engine via CS. This question is addressed below by comparing the performance of several scientific applications by analyzing the effects of CS on their performance.
Use of the CS and HT features is determined at runtime by options provided to the aprun command. The aprun options used for each experiment are listed in Table 1. Additional environment variables (MPICH_NEMESIS_ASYNC_PROGRESS=1 MPICH_MAX_THREAD_SAFETY=multiple) were used to enable the progress engine whenever CS is used. For the MPI+CS experiment, a third environment variable (MPICH_GNI_USE_UNASSIGNED_CPUS=enabled) is needed to permit the progress engine to run on the same physical core as the application. For the four benchmark applications that include OpenMP directives (CAM, GTC, MILC and MiniDFT), we also perform hybrid MPI+OpenMP calculations with 16 tasks per node and use HT by assigning two OpenMP threads per MPI task.
|MPI+HT+CS||-N31 -r1 -j2|
|Hybrid+HT||-N16 -d2 -j2 -cc numa_node|
|Hybrid+HT+CS||-N15 -r1 -d2 -j2 -cc numa_node|
Figure 1 shows that the impact of CS is at best negligable and often negative. To account for a large range of absoute performance and resource allocations, MPP hours are used as the basis for comparing CS use modes. For every code where the comparison can be made, CS increases the MPP usage relative to the corresponding experiment without CS. For example, MPP_Hours(MPI+HT) ≤ MPP_Hours(MPI+HT+CS) ) and MPP_Hours(Hybrid+HT) ≤ MPP_Hours(Hybrid+HT+CS) ) Results of the MPI+CS experiment will be added when they become available.
- H. Pritchard, D. Roweth, D. Henseler, and P. Cassella, "Leveraging the Cray Linux Environment Core Specialization Feature to Realize MPI Asynchronous Progress on Cray XE Systems" Proc. Cray User Group, 2012.