Latest NERSC/Intel/Cray ‘Dungeon Session’ Yields Impressive Code Speedups
Six NESAP Teams + 3 Days = Optimized Codes for KNL, Cori
August 19, 2016
Six application development teams participating in NESAP, NERSC’s next-generation code optimization effort, gathered at Intel in early August for a marathon “dungeon” session designed to help tweak their codes for the next-generation Intel Xeon Phi Knight’s Landing manycore architecture and NERSC’s new Cori supercomputer.
Approximately 20 members of the NESAP teams made the journey, where they spent three days working closely with an additional 20 Intel engineers, tapping into their expertise in compilers, vectorization, KNL architecture and more. The teams represented six specific codes that were chosen last year for the NESAP program: Quantum ESPRESSO, a materials modeling code; M3D-C1, a plasma simulation code; ACME, CESM and MPAS, all climate modeling codes; and Chombo, an adaptive mesh refinement code used in flow simulations.
This was the eighth dungeon session so far, which give the NESAP teams unprecedented access to Intel and Cray engineers and their expertise. Among those attending from NERSC were Taylor Barnes, Brandon Cook, Jack Deslippe, Helen He, Thorsten Kurth, Tareq Malas, Andre Ovsyannikov and Woo-Sun Yang. There were also representatives from several other facilities, including Princeton Plasma Physics Laboratory, Argonne National Laboratory, RPI and NCAR.
“Each team came in with a set of goals, parts of their applications that they wanted to investigate at a very deep level,” Deslippe said. “We try to prepare ahead of time to bring the types of problems that can only be solved with the experts at Intel and Cray present—deep questions about the architecture and how applications use the Xeon Phi processor. It’s all geared toward optimizing the codes to run on the new manycore architecture and on Cori.”
Deslippe, Barnes and Kurth joined engineers from Cray and Intel to work on the Quantum ESPRESSO application. The team was able to achieve speedups of approximately 2x in the benchmark time, primarily by improving thread scaling and vector and streaming instruction generation with OpenMP (an application program interface) pragmas and employing cache blocking/tiling techniques where appropriate. They also investigated the explicit management of data-structures within the KNL high-bandwidth memory and are able to achieve speedups beyond the KNL performance in “cache mode”, where the high-bandwidth memory is instead configured as a cache.
Using OpenMP also led to performance improvements with the M3D-C1 code, according to Yang. Bringing two source kernels to the dungeon session, the M3D-C1 team spent their time focusing on two key aspects of the M3D-C1 code: optimizing the matrix assembly stage and testing particle-in-cell (PIC) codes within M3D-C1. In the first instance, they used the Intel Math library to streamline one of the most time-consuming processes in matrix assembly; and also restructured some functions to eliminate overhead and bad speculations, which led to an overall 2.8x speedup. They also parallelized the code using OpenMP and saw a good parallel scaling for up to 68 cores on a KNL node. Adding OpenMP to the code was done in “a very optimal way of optimizing this code,” Yang said, allowing the team to achieve almost perfect parallel scaling.
They achieved good code optimization in the PIC code, too. “By looking at the performance profiling, where the code spends most of its time, we affected one major function and restructured the code to optimize that part,” he explained. “This kernel has about 200,000 particles, and initially it was running around 200 seconds, but after doing a series of optimizations, we cut the runtime to 50 seconds—a 3.9x speedup.”
The CESM and ACME teams came away from the latest dungeon session with code improvements as well, according to He. One of their goals was to understand the process and thread affinity, which is the basis for getting optimal performance on KNL and for guiding further performance optimizations for all NESAP teams.
During a mini-dungeon session at NERSC in July, He discovered that using the OpenMP4 standard affinity settings was slower than Intel compiler-specific KMP affinity settings for the CESM code, which was not the case for other applications the team has tested. So she ran seven test scenarios of the CESM code, which in theory should have all performed the same, but she found that some performed two times slower. At this most recent dungeon session, He worked closely with Intel engineers (special thanks to Karthik Raman and Larry Meadows). In the end, a glitch in the code was identified that was causing two extra threads to be spawned in the nested OpenMP level. The glitch was fixed and the seven cases then all ran at the same speed.
This understanding is also directly applied to getting best process and thread affinity for nested OpenMP, which is used by both the CESM and ACME teams to achieve better performance as compared to single level OpenMP. The ACME team also looked at optimizing its code with vectorization and threading, and achieved 35% speedup for one of the key functions.
“The ACME and CESM teams appreciated the expert knowledge of the Intel people who were at the dungeon session and the guidance and tools they provided,” He said. “As one team member said, ‘now I am weaponized.’”
The dungeon sessions are also proving beneficial for Intel, according to Mike Greenfield, director of technical computing engineering in Intel’s Developer Relations Division. While the dungeon process is not new to the company’s product development practices, inviting a customer to participate in the sessions is, he added, emphasizing that the collaboration with NERSC and Cray has been “a very positive experience.”
“Working with NERSC is a unique opportunity for us because most of the applications NERSC has targeted are of interest to the worldwide HPC community,” Greenfield said. “It’s not just getting something to run on KNL; it’s about getting in-depth developer feedback, accelerating product readiness and advancing the scalability of applications that are of great interest to the HPC community. Through this collaboration, we want to help NERSC be even better positioned to get more value from its huge commitment to this next-generation system.”
The NESAP teams are now busy updating their production codes with the improvements they achieved during the latest dungeon session and looking forward to the next dungeon session, which will likely take place in the fall.
“Most everyone who has attended these sessions has come out with not just faster code but a much deeper knowledge of what it takes to optimize code and how the architecture and the processors work, which I think is invaluable,” Deslippe said.
About NERSC and Berkeley Lab
The National Energy Research Scientific Computing Center (NERSC) is a U.S. Department of Energy Office of Science User Facility that serves as the primary high-performance computing center for scientific research sponsored by the Office of Science. Located at Lawrence Berkeley National Laboratory, the NERSC Center serves more than 7,000 scientists at national laboratories and universities researching a wide range of problems in combustion, climate modeling, fusion energy, materials science, physics, chemistry, computational biology, and other disciplines. Berkeley Lab is a DOE national laboratory located in Berkeley, California. It conducts unclassified scientific research and is managed by the University of California for the U.S. Department of Energy. »Learn more about computing sciences at Berkeley Lab.