NERSCPowering Scientific Discovery Since 1974

Throughput on Cori

September 5, 2019 by Richard Gerber

NERSC's Cori system is a unique resource for scientific research that can't be done anywhere else, owing to its more than 10,000 nodes of high-end processors and associated large data systems. The system is popular with NERSC's 7,000 users and demands for its use can result in long waits in the batch queues for jobs to start. The article discusses ways for NERSC users to best take advantage of queues, policies, and different node types to optimize their job throughput on Cori.

Cori has two different kinds of nodes that target different workloads. Its 2,000 Intel Xeon Haswell nodes accommodate traditional codes,  providing continuity with previous-generation systems. The 10,000 Intel Xeon Phi codes enable large-scale science problems to be addressed using advanced, energy-efficient,  nodes that contain features like fast MCDAM, long vector units, and high thread-level parallelism that represent the future of computing architectures. 

Because the Xeon Phi processors are compatible with the Haswell processors, almost any code that can run on Haswell can run on Phi.  The rub is that in most cases, programming effort is required to modernize older codes that were not designed to take advantage of many-core processors like the Phi. Due to the lower clock speed on the Phi cores compared to Haswell,  out-of-the-box codes generally run slower on KNL compared to the equivalent number of processing units on KNL. Taking this all into account, there are strategies NERSC users can use to accelerate their job throughput. 

  • Update your code. If you have a code that has been around for a few years, chances are that it can be restructured to more efficiently manage data structures in memory, increase fine-grained parallelism, and increase instruction-level parallelism. In NERSC's experience, codes that make the effort to do this are able to run faster on KNL than Haswell (on a node-to-node basis) and usually run at least twice as fast on Haswell afterwards as well. NERSC is here to help you with this; you can file a ticket and ask for assistance. We also have a number of training events you can attend or watch the archived version of. If successful, your code will finish faster and you can take advantage of the shorter queue waits on KNL nodes. 
  • Use Cori's KNL nodes. If your code has been optimized for KNL, then you'll run faster and cheaper on KNL, with much better queue turnaround time. If not, you still have a number of options to get answers faster and cheaper on KNL. See the next section.