Processor Frequency on the Cori Data Partition
The Haswell processors in Cori's data partition have a "Turbo Boost" feature to dynamically adjust CPU frequency and achieve the maximum possible performance. When Turbo Boost is enabled, the processor operates at the maximum frequency allowed by the available power and thermal limits. Further, on Cori (unlike Edison), each core can operate at a different frequency.
The instantaneous turbo frequency could be above or below the nominal 2.3 GHz frequency depending on the number of active cores and the type of calculations being performed. When only a few cores are active, the frequency of the active cores may be as high as 3.6 GHz for scalar code (i.e. code that does not use Advanced Vector Extensions, or AVX). The base AVX frequency (1.9 GHz) describes the minimum frequency that would be obtained in turbo mode when all cores are running heavily vectorized code. In preliminary tests however, NERSC has not observed average turbo frequencies slower than 2.3 GHz.
Setting the Processor Frequency
Turbo mode is active by default on Cori's data partition. For some experiments (such as those described below) users may wish to deactivate Turbo Boost and run at a constant frequency lower than 2.3 GHz for the duration of their job. (Currently, there is no way to change the clockspeed for different phases of a job.) This is accomplished using the --cpu-freq=<frequency in kHz> option for the srun job launch command. Valid frequencies are between 1.2 and 2.3 GHz, with increments of 0.1 GHz. For example, the following command will launch a single process at 1.9 GHz.
srun -n 1 --cpu-freq=1900000 ./a.out
Frequency Scaling Experiments
We have investigated the behavior of Turbo Boost using two kernels: dense matrix multiplication (DGEMM), which uses AVX, and streaming evaluation of the exponent function (EXP), which may be compiled with or without AVX by using the -no-vec compiler option. The array sizes for both kernels were adjusted to ensure that their performance would be limited by CPU frequency, not memory bandwidth.
The DGEMM and EXP kernels were run on a single core while manually adjusting the CPU frequency to constant values between 1.2 and 2.3 GHz. Figure 1. shows that the single-core performance of both kernels increase linearly with CPU frequency. The measured DGEMM performance is 88% of the theoretical peak performance at all frequencies. The performance of the EXP kernel increases by more than a factor of three when AVX is used.
|Figure 1. Single-core performance for the DGEMM (left) and EXP (right) kernels.|
Turbo Boost Behavior
The time-averaged single-core turbo frequency was estimated by measuring the performance in turbo mode, and extrapolating from a linear fit to the constant-frequency data. The single-core turbo frequencies for DGEMM, EXP(AVX) and EXP(scalar) were 3.3 GHz, 2.9 GHz and 3.5 GHz, respectively. The turbo frequency has a nonlinear dependency on the number of active cores per socket. We repeated the DGEMM and EXP frequency scaling experiment described above using 1-16 cores per socket. The estimated turbo frequencies are plotted in Figure 2, which shows that the turbo frequency decreases for higher core counts. It is also evident that lower frequencies are obtained for code that uses AVX.
|Figure 2. Lower Turbo Boosts at higher concurrency.|
Figure 3 compares the performance benefits of vectorization to those of Turbo Boost when all cores are active. At constant frequency, AVX increases performance of the EXP kernel by a factor of 3.3. The effects of turbo boost are significantly smaller: 23% for scalar code, and 9% for AVX code. In turbo mode, EXP is 2.9 times faster with AVX than without.
|Figure 3. Vectorization performance with Turbo Boost|
1. Intel has written a white paper describing the behavior of Turbo Boost with and without AVX instructions. Optimizing Performance with Intel® Advanced Vector Extensions