NERSC's newest supercomputer, named Cori, includes the Haswell partition (Phase I) and the KNL partition (Phase II). The Haswell partition has a (theoretical) peak performance of 1.92 petaflops/sec, 2,004 compte nodes (64,128 compute cores) for running scientific applications, 203 terabytes of memory. The KNL partition has a (theoretical) peak performance of 27.9 petaflops/sec, 9,688 compute nodes (658,784 cores in total), and 1 PB of memory. The Cori system also has a 28 petabytes of online disk storage with a peak I/O bandwidth of > 700 gigabytes (GB) per second, and 1.8 petabytes (PB) of SSDs in the Burst Buffer, with >1.7TB/s aggregate I/O bandwidth.
Cori Phase I and II system configuration details are given below.
- Cray XC40 supercomputer
- Theoretical peak performance: Phase I Haswell: 1.92 PFlops/sec; Phase II KNL: 29.1 PFlops/sec.
- Sustained application performance on NERSC SSP codes: Phase I Haswell: 83 TFlop/s (vs. 129 TFlop/s for Edison and 52.1 TFlop/s for Hopper); Phase II KNL: 562.31 TFlop/s.
- Total compute nodes: Phase I Haswell: 2,004 computes nodes, 64,128 cores in total (32 cores per node); Phase II KNL: 9,688 compute nodes, 658,784 cores in total (68 cores per node).
- Cray Aries high-speed interconnect with Dragonfly topology as on Edison (0.25 μs to 3.7 μs MPI latency, ~8GB/sec MPI bandwidth)
- Aggregate memory: Phase I Haswell partition: 203 TB; Phase II KNL partition: 1 PB.
- Scratch storage capacity: 30 PB
- Burst Buffer capacity: 1.8 PB
|Haswell Cabinets||12||Each cabinet has 3 chassis; each chassis has 16 compute blades, each compute blade has 4 dual socket nodes|
|Haswell Compute nodes||2,004||Each node has two sockets, each socket is populated with a 16-core Intel® Xeon™ Processor E5-2698 v3 ("Haswell") at 2.3 GHz|
|32 cores per node|
|Each core supports 2 hyper-threads, and has 2 256-bit-wide vector units|
|36.8 Gflops/core; 1.2 TFlops/node; 1.92 PFlops total (theoretical peak)|
|Each node has 128 GB DDR4 2133 MHz memory (four 16 GB DIMMs per socket); 203 TB total aggregate memory.|
|Each core has its own L1 and L2 caches, with 64 KB (32 KB instruction cache, 32 KB data) and 256 KB, respectively; there is also a 40-MB shared L3 cache per socket|
|KNL Cabinets||52||Each cabinet has 3 chassis; each chassis has 16 compute blades, each compute blade has 4 nodes|
|KNL Compute nodes||9688||Each node is a single-socket Intel® Xeon Phi™ Processor 7250 ("Knights Landing") processor with 68 cores per node @ 1.4 GHz|
|Each core has two 512-bit-wide vector processing units. Each core has 4 hardware threads (272 threads total). Two cores form a tile.|
|44 GFlops/core; 3 TFlops/node; 29.1 PFlops total (theoretical peak)|
|Each node has 96 GB DDR4 2400 MHz memory, six 16 GB DIMMs (102 GB/s peak bandwidth). Total aggregate memory (combined with MCDRAM) is 1 PB.|
|Each node has 16 GB MCDRAM (multi-channel DRAM), > 460 GB/s peak bandwidth|
|Each core has its own L1 caches, with 64 KB (32 KB instruction cache, 32 KB data). Each tile (2 cores) shares a 1MB L2 cache.|
|Interconnect||Cray Aries with Dragonfly topology with 5.625 TB/s global bandwidth (Phase I). 45.0 TB/s global peak bisection bandwidth (Phase II).|
|Login nodes||12||Dual socket (16 cores per socket, 32 total cores), 2.3 GHz Intel® Xeon™ Processor E5-2698 v3 ("Haswell") with 512 GB memory.|
|MOM nodes||--||Unlike Edison there are no dedicated MOM nodes on Cori|
|Shared Root Server Nodes||16|
|Lustre Router nodes||130|
|DVS Server Nodes||32|
|Scratch storage system||Cray Sonexion 2000 Lustre appliance. Scratch storage maximum aggregate bandwidth: > 700 GB/sec|
|Burst Buffer||Cray DataWarp system consisting of 288 Burst Buffer nodes, with maximum aggregate I/O >1.7 TB/s, and >28M IOP/s|
|Operating System||CNL on compute nodes||Compute nodes run a lightweight kernel and run-time environment based on the SuSE Linux Enterprise Server (SLES) Linux distribution.|
|Full SUSE Linux on Login nodes||External login nodes run a standard SLES distribution similar to the internal service nodes.|
The Haswell processors in Cori's data partition have a "Turbo Boost" feature to dynamically adjust CPU frequency and achieve the maximum possible performance. When Turbo Boost is enabled, the processor operates at the maximum frequency allowed by the available power and thermal limits. Further, on Cori (unlike Edison), each core can operate at a different frequency. The instantaneous turbo frequency could be above or below the nominal 2.3 GHz frequency depending on the number of active cores… Read More »
The Xeon-Phi "Knights-Landing" 7250 processors in Cori have 68 CPU cores where are organized into 34 "tiles" (each tile comprising two CPU cores and a shared 1MB L2 cache) which are placed in a 2D mesh, connected via an on-chip interconnect as shown in the following figure: As shown in the figure, the KNL processor has 6 DDR channels, with controllers to the right and left of the mesh 8 MCDRAM channels, with controllers spread across 4 "corners" of the mesh. NUMA on KNL NUMA stands for… Read More »