NERSCPowering Scientific Discovery Since 1974

Application Porting and Performance

Many applications will need code modifications in order to run efficiently on  Cori's Intel Xeon Phi "Knights Landing" manycore processors.    Applications need to have good thread scalability to take advantage of the 68-core Xeon Phi processor, a data structure layout that can effectively use the 16 GB of onboard MCDRAM fast memory,  and loop structures that exploit the 512-bit vector units.  In the web pages that follow we document strategies that can help you improve your application's performance.  While achieving good performance on Cori may take some work,  the good news is that optimizations made for Cori will very likely improve your code's performance on other architectures.

Getting Started and Optimization Strategy

The purpose of this page is to get you started thinking about how to optimize your application for the Knights Landing (KNL) Architecture that will be on Cori.  This page will walk you through the high level steps and give an example using a real application that runs at NERSC. How Cori Differs From Edison There are several important differences between the Cori (Knight's Landing) node architecture and the Edison (Ivy Bridge) node architecture that require special attention from application… Read More »

Application Case Studies

NERSC staff along with engineers have worked with NESAP applications to prepare for the Cori-Phase 2 system based on the Xeon Phi "Knights Landing" processor. We document the several optimization case studies below. Our presentations at ISC 16 IXPUG Workshop can all be found: Other pages of interest for those wishing to learn optimization strategies of Cori Phase 2 (Knights Landing): Getting Started Measuring Arithmetic Intensity Measuring and… Read More »

Profiling Your Application

There are a number of tools which can help users profile applications to determine the best strategy for improving code performance on Cori. Some popular tools are Vtune, CrayPat, and MAP. Read More »

Improving OpenMP Scaling

Each processor on Cori will have over 60 processor cores with 4 hardware threads each. Efficient thread scalability will be important to achieving good performance. Read More »

Measuring and Understanding Memory Bandwidth

It is important to understand if your application is memory bandwidth bound, memory latency bound or compute bound. Understanding the characteristics of your application will determine what tactics you use to optimize your application. Read More »


Enabling your application to take advantage of vectorization is an important component of achieving high performance on today's supercomputers. Vectorization allows you to execute a single instruction on multiple data objects in parallel within a single CPU core, thus improving performance. Read More »

Using on-package memory

The Knights Landing processor on Cori will have 16GB of on-package 'fast' memory with up to 5 times the bandwidth of DRAM memory. Read More »

Using High Performance Libraries and Tools

Using libraries that are already highly tuned for the Cori architecture can help improve the performance of your application with minimal effort. Read More »

Application Readiness Papers

Deslippe, Jack, Brian Austin, Chris Daley, Woo-Sun Yang. “Lessons Learned from Optimizing Science Kernels for Intel's “Knights Corner" Architecture.” Computing in Science & Engineering, 17(3), pp.30-42. 2015 - Zhao, Zhengji, Martijn Marsman, "Estimating the Performance Impact of the MCDRAM on KNL Using Dual-Socket Ivy Bridge nodes on Cray XC30",, London, UK, May 11, 2016 -… Read More »