NERSCPowering Scientific Discovery Since 1974

NERSC Staff Publications & Presentations

This page displays a bibliography of staff publications and presentations from Jan. 1, 2017 to present. Earlier publications are available in the archive.


T. Kurth, C. Yang, et al, "Roofline Performance Analysis of Deep Learning Kernels and Applications", The Platform for Advanced Scientific Computing Conference (PASC’21) (accepted), July 2021,


M. Del Ben, C. Yang, et al, Achieving Performance Portability on Leadership Class HPC Systems for Large Scale GW Calculations, Pacifichem Congress, December 2020,

Y. Wang, C. Yang, S. Farrel, Y. Zhang, T. Kurth, and S. Williams, "Time-Based Roofline for Deep Learning Performance Analysis", IEEE/ACM Deep Learning on Supercomputers Workshop 2020, November 2020,

S. Williams, A. Ilic, Z. Matveev, M. Katz, J. Kwack, C. Yang, and C. Bertoni, Performance Tuning with the Roofline Model on GPUs and CPUs, Half-Day Tutorial, Supercomputing Conference (SC’20), November 2020,

Zhengji Zhao, Rebecca Hartman-Baker, and Gene Cooperman, Deploying Checkpoint/Restart for ProductionWorkloads at NERSC, A presentation at SC20 State of the Practice Talks, November 17, 2020,

M. Del Ben, C. Yang, Z. Li, F. H. da Jornada, S. G. Louie, and J. Deslippe, "Accelerating Large-Scale Excited-State GW Calculations on Leadership HPC Systems", ACM Gordon Bell Finalist, Supercomputing Conference (SC’20), November 2020,

Y. Wang, C. Yang, S. Farrel, T. Kurth, and S. Williams, "Hierarchical Roofline Analysis for Deep Learning Applications", IEEE International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems Workshop (PMBS'20), November 2020,

Benjamin Driscoll, and Zhengji Zhao, "Automation of NERSC Application Usage Report", Seventh Annual Workshop on HPC User Support Tools (HUST 2020), held in conjunction with SC20, Online, November 11, 2020,

Nicholas Balthaser, Wayne Hurlbert, Melinda Jacobsen, Owen James, Kristy Kallback-Rose, Kirill Lozinskiy, NERSC HPSS Site Update, 2020 HPSS User Forum, October 9, 2020,

Report on recent projects and challenges running HPSS at NERSC, including recent AQI issues and upcoming HPSS upgrade.

M. Del Ben, C. Yang, and J. Deslippe, Achieving Performance Portability on Hybrid GPU-CPU Architectures for a Large-Scale Material Science Code: The BerkeleyGW Case Study, IEEE International Workshop on Performance, Portability and Productivity in HPC (P3HPC), September 2020,

C. Yang, "8 Steps to 3.7 TFLOP/s on NVIDIA V100 GPU: Roofline Analysis and Other Tricks", arXiv preprint arXiv:2008.11326, August 2020,

C. Yang, GW Calculations at Scale, NERSC GPU for Science Day 2020, July 2020,

Zhengji Zhao, Using VASP on Cori, VASP Online Hands-on User Training, Berkeley, CA, June 30, 2020,

J. R. Madsen, M. G. Awan, H. Brunie, J. Deslippe, R. Gayatri, L. Oliker, Y. Wang,
C. Yang, and S. Williams,
"Timemory: Modular Performance Analysis for HPC", International Supercomputing Conference (ISC’20), June 2020,

Zhengji Zhao, Programming Environment and Compilation, NERSC New User Training, Berkeley, CA, June 16, 2020,

Zhengji Zhao, and Rebecca Hartman-Baker, Proposal to increase the flex queue priority, NERSC Queue Committee Meeting, Berkeley, CA, June 1, 2020,

Zhengji Zhao, Running Variable-Time Jobs on Cori, Hans-on User training on Variable-Time Jobs, Berkeley, CA, May 21, 2020,

Abhinav Bhatele, Jayaraman J. Thiagarajan, Taylor Groves, Rushil Anirudh, Staci A. Smith, Brandon Cook, David Lowenthal, "The Case of Performance Variability on Dragonfly-based Systems", IPDPS 2020, May 21, 2020,

Zhengji Zhao, Rebecca Hartman-Baker, Checkpoint/Restart (C/R) Project Plan, Small P Project Meeting, Berkeley CA, May 15, 2020,

C. Yang, M. Del Ben, Accelerating Large-Scale GW Calculations in Material Science, NVIDIA GPU Technology Conference (GTC’20), March 2020,

C. Yang, S. Williams, Y. Wang, Roofline Performance Model for HPC and Deep Learning Applications, NVIDIA GPU Technology Conference (GTC’20), March 2020,

Shahzeb Siddiqui, "Buildtest: A Software Testing Framework with Module Operations for HPC Systems", HUST, Springer, March 25, 2020, doi:

M. Del Ben, C. Yang, S. G. Louie, and J. Deslippe, Accelerating Large-Scale GW Calculations on Hybrid GPU-CPU Systems, American Physical Society (APS) March Meeting 2020, March 2020,

C. Yang, and J. Deslippe, Accelerate Science on Perlmutter with NERSC, American Physical Society (APS) March Meeting 2020, March 2020,

Greg Butler, Ravi Cheema, Damian Hazen, Kristy Kallback-Rose, Rei Lee, Glenn Lockwood, NERSC Community File System, March 4, 2020,

Presentation at Storage Technology Showcase providing an update on NERSC's Storage 2020 Strategy & Progress and newly deployed Community File System, including data migration process.

Nicholas Balthaser, Damian Hazen, Wayne Hurlbert, Owen James, Kristy Kallback-Rose, Kirill Lozinskiy, Moving the NERSC Archive to a Green Data Center, Storage Technology Showcase 2020, March 3, 2020,

Description of methods used and challenges involved in moving the NERSC tape archive to a new data center with environmental cooling.

Paul T. Lin, John N. Shadid, Paul H. Tsuji, "Krylov Smoothing for Fully-Coupled AMG Preconditioners for VMS Resistive MHD", Numerical Methods for Flows, Lecture Notes in Computational Science and Engineering, Volume 132, ( February 23, 2020)

Zhengji Zhao, and Jessica Nettleblad, NERSC Dotfile Migration to /etc/profile.d, Consulting Team Meeting, Berkeley CA, February 18, 2020,

O. Hernandez, C. Yang, et al, Early Experience of Application Developers with OpenMP Offloading at ALCF, NERSC, and OLCF, Birds of a Feather (BoF), Exascale Computing Program (ECP) Annual Meeting, February 2020,

J. Doerfert, C. Yang, et al, OpenMP Roadmap for Accelerators Across DOE Pre- Exascale/Exascale Machines, Birds of a Feather (BoF), Exascale Computing Program (ECP) Annual Meeting, February 2020,

J. Srinivasan, C. Yang, et al, Perlmutter - a Waypoint for ECP Teams, Birds of a Feather (BoF), Exascale Computing Program (ECP) Annual Meeting, February 2020,

S. Williams, C. Yang, and J. Deslippe, Performance Tuning with the Roofline Model on GPUs and CPUs, Half-Day Tutorial, Exascale Computing Program (ECP) Annual Meeting, February 2020,

Nicholas Balthaser, Wayne Hurlbert, Long-term Data Management in the NERSC Archive, NITRD Middleware and Grid Interagency Coordination (MAGIC), February 5, 2020,

Description of data management practices contributing to long-term data integrity in the NERSC Archive.

Shahzeb Siddiqui, HPC Software Stack Testing Framework, FOSDEM, February 2, 2020,

Shahzeb Siddiqui, Buildtest: HPC Software Stack Testing Framework, Easybuild User Meeting, January 30, 2020,

Shahzeb Siddiqui, Building an Easybuild Container Library in Sylabs Cloud, Easybuild User Meeting, January 29, 2020,


Paul T. Lin, John N. Shadid, Paul H. Tsuji, "On the performance of Krylov smoothing for fully coupled AMG preconditioners for VMS resistive MHD", International Journal for Numerical Methods in Engineering, December 21, 2019, 120:1297-1309, doi: 10.1002/nme.6178

Nicholas Balthaser, Wayne Hurlbert, Kirill Lozinskiy, Owen James, Regent System Move Update, NERSC All-to-All Meeting, December 16, 2019,

Update on moving the NERSC center backup system from the Oakland Scientific Facility to LBL Building 59.

Zhengji Zhao, Automation of NERSC Application Usage Report, Application Usage Page Support Transition Meeting, Berkeley CA, December 4, 2019,

Glenn K. Lockwood, Kirill Lozinskiy, Lisa Gerhardt, Ravi Cheema, Damian Hazen, Nicholas J. Wright, "A Quantitative Approach to Architecting All-Flash Lustre File Systems", ISC High Performance 2019: High Performance Computing, edited by Michele Weiland, Guido Juckeland, Sadaf Alam, Heike Jagode, (Springer International Publishing: 2019) Pages: 183--197 doi: 10.1007/978-3-030-34356-9_16

New experimental and AI-driven workloads are moving into the realm of extreme-scale HPC systems at the same time that high-performance flash is becoming cost-effective to deploy at scale. This confluence poses a number of new technical and economic challenges and opportunities in designing the next generation of HPC storage and I/O subsystems to achieve the right balance of bandwidth, latency, endurance, and cost. In this work, we present quantitative models that use workload data from existing, disk-based file systems to project the architectural requirements of all-flash Lustre file systems. Using data from NERSC’s Cori I/O subsystem, we then demonstrate the minimum required capacity for data, capacity for metadata and data-on-MDT, and SSD endurance for a future all-flash Lustre file system.

M. Del Ben, C. Yang, F. H. da Jornada, S. G. Louie, and J. Deslippe, "Accelerating Large-Scale GW Calculations on Hybrid CPU-GPU Architectures", Supercomputing Conference (SC’19), November 2019,

Abe Singer, Shane Canon, Rebecca Hartman-Baker, Kelly L. Rowland, David Skinner, Craig Lant, "What Deploying MFA Taught Us About Changing Infrastructure", HPCSYSPROS19: HPC System Professionals Workshop, November 2019, doi: 10.5281/zenodo.3525375

NERSC is not the first organization to implement multi-factor authentication (MFA) for its users. We had seen multiple talks by other supercomputing facilities who had deployed MFA, but as we planned and deployed our MFA implementation, we found that nobody had talked about the more interesting and difficult challenges, which were largely social rather than technical. Our MFA deployment was a success, but, more importantly, much of what we learned could apply to any infrastructure change. Additionally, we developed the sshproxy service, a key piece of infrastructure technology that lessens user and staff burden and has made our MFA implementation more amenable to scientific workflows. We found great value in using robust open-source components where we could and developing tailored solutions where necessary.

S. Williams, C. Yang, A. Ilic, and K. Rogozhin, Performance Tuning with the Roofline Model on GPUs and CPUs, Half-Day Tutorial, Supercomputing Conference (SC’19), November 2019,

Glenn K. Lockwood, Kirill Lozinskiy, Kristy Kallback-Rose, NERSC's Perlmutter System: Deploying 30 PB of all-NVMe Lustre at scale, Lustre BoF at SC19, November 19, 2019,

Update at SC19 Lustre BoF on collaborative work with Cray on deploying an all-flash Lustre tier for NERSC's Perlmutter Shasta system.

Timothy G. Mattson, Yun (Helen) He, Alice E. Koniges, The OpenMP Common Core: Making OpenMP Simple Again, Book: Scientific and Engineering Computation Series, edited by William Gropp, Ewing Lusk, (The MPI Press: November 19, 2019) Pages: 320 pp

How to become a parallel programmer by learning the twenty-one essential components of OpenMP.

Hongzhang Shan, Zhengji Zhao, and Marcus Wagner, Accelerating the Performance of Modal AerosolModule of E3SM Using OpenACC, Sixth Workshop on Accelerator Programming Using Directives (WACCPD) in SC19, November 18, 2019,

Hongzhang Shan, Zhengji Zhao, and Marcus Wagner, "Accelerating the Performance of Modal Aerosol Module of E3SM Using OpenACC", (won the best paper award), Sixth Workshop on Accelerator Programming Using Directives (WACCPD) in SC19, November 18, 2019,

Glenn K. Lockwood, Shane Snyder, Suren Byna, Philip Carns, Nicholas J. Wright, "Understanding Data Motion in the Modern HPC Data Center", 2019 IEEE/ACM Fourth International Parallel Data Systems Workshop (PDSW), Denver, CO, USA, IEEE, 2019, 74--83, doi: 10.1109/PDSW49588.2019.00012

The utilization and performance of storage, compute, and network resources within HPC data centers have been studied extensively, but much less work has gone toward characterizing how these resources are used in conjunction to solve larger scientific challenges. To address this gap, we present our work in characterizing workloads and workflows at a data-center-wide level by examining all data transfers that occurred between storage, compute, and the external network at the National Energy Research Scientific Computing Center over a three-month period in 2019. Using a simple abstract representation of data transfers, we analyze over 100 million transfer logs from Darshan, HPSS user interfaces, and Globus to quantify the load on data paths between compute, storage, and the wide-area network based on transfer direction, user, transfer tool, source, destination, and time. We show that parallel I/O from user jobs, while undeniably important, is only one of several major I/O workloads that occurs throughout the execution of scientific workflows. We also show that this approach can be used to connect anomalous data traffic to specific users and file access patterns, and we construct time-resolved user transfer traces to demonstrate that one can systematically identify coupled data motion for individual workflows.

George Michelogiannakis, Yiwen Shen, Min Yee Teh, Xiang Meng, Benjamin Aivazi, Taylor Groves, John Shalf, Madeleine Glick, Manya Ghobadi, Larry Dennison, Keren Bergman, "Bandwidth Steering in HPC using Silicon Nanophotonics", International Conference on High Performance Computing, Networking, Storage and Analysis (SC'19), November 17, 2019,

Sudheer Chunduri, Taylor Groves, Peter Mendygral, Brian Austin, Jacob Balma, Krishna Kandalla, Kalyan Kumaran, Glenn Lockwood, Scott Parker, Steven Warren, Nathan Wichmann, Nicholas Wright, "GPCNeT: Designing a Benchmark Suite for Inducing and Measuring Contention in HPC Networks", International Conference on High Performance Computing, Networking, Storage and Analysis (SC'19), November 16, 2019,

Network congestion is one of the biggest problems facing HPC systems today, affecting system throughput, performance, user experience and reproducibility. Congestion manifests as run-to-run variability due to contention for shared resources like filesystems or routes between compute endpoints. Despite its significance, current network benchmarks fail to proxy the real-world network utilization seen on congested systems. We propose a new open-source benchmark suite called the Global Performance and Congestion Network Tests (GPCNeT) to advance the state of the practice in this area. The guiding principles used in designing GPCNeT are described and the methodology employed to maximize its utility is presented. The capabilities of GPCNeT evaluated by analyzing results from several world’s largest HPC systems, including an evaluation of congestion management on a next-generation network. The results show that systems of all technologies and scales are susceptible to congestion and this work motivates the need for congestion control in next-generation networks.

C. Yang, T. Kurth, and S. Williams, "Hierarchical Roofline Analysis for GPUs: Accelerating Performance Optimization for the NERSC-9 Perlmutter System", Concurrency and Computation: Practice and Experience, DOI: 10.1002/cpe.5547, November 2019,

Zhengji Zhao, Checkpointing and Restarting Jobs with DMTCP, NERSC User Training on Checkpointing and Restarting Jobs Using DMTCP, November 6, 2019,

Lixin Ge, Chou Ng, Zhengji Zhao, Jeff Hammond, and Karen Zhou, OpenMP Hackathon Acrum ACE3P, NERSC Application Readiness Meeting, October 30, 2019,

Nicholas Balthaser, Kirill Lozinskiy, Melinda Jacobsen, Kristy Kallback-Rose, NERSC Migration from Oracle tape libraries GPFS-HPSS-Integration Proof of Concept, October 16, 2019,

NERSC updates on Storage 2020 Strategy & Progress, GHI Testing, Tape Library Update, Futures

Ravi Cheema, Kristy Kallback-Rose, Storage 2020 Strategy & Progress - NERSC Site Update at HPCXXL User Group Meeting, September 24, 2019,

NERSC site update including Systems Overview, Storage 2020 Strategy & Progress, GPFS-HPSS-Integration Testing, Tape Library Update and Futures.

S. Williams, C. Yang, K. Ibrahim, T. Kurth, N. Ding, J. Deslippe, L. Oliker, "Performance Analysis using the Roofline Model", SciDAC PIs Meeting, 2019,

C. Yang, Performance-Related Activities at NERSC/CRD, RRZE Thomas Gruber Visit NERSC, July 2019,

C. Yang, The Current and Future of Roofline, LBNL Brown Bag Seminar, July 2019,

C. Yang, NERSC Application Readiness Process and Strategy, NERSC GPU for Science Day 2019, July 2019,

Zhengji Zhao, Programming Environment and Compilation, NERSC New User Training, June 21, 2019,

C. Yang, Z. Matveev, A. Ilic, and D. Marques, Performance Optimization of Scientific Codes with the Roofline Model, Half-Day Tutorial, International Supercomputing Conference (ISC’19), June 2019,

Yuping Fan, Zhiling Lan, Paul Rich, William E Allcock, Michael E Papka, Brian Austin, David Paul, "Scheduling Beyond CPUs for HPC", Proceedings of the 28th International Symposium on High-Performance Parallel and Distributed Computing, Pheonix, AZ, ACM, June 19, 2019, 97-108, doi: 10.1145/3307681.3325401

High performance computing (HPC) is undergoing significant changes. The emerging HPC applications comprise both compute- and data-intensive applications. To meet the intense I/O demand from emerging data-intensive applications, burst buffers are deployed in production systems. Existing HPC schedulers are mainly CPU-centric. The extreme heterogeneity of hardware devices, combined with workload changes, forces the schedulers to consider multiple resources (e.g., burst buffers) beyond CPUs, in decision making. In this study, we present a multi-resource scheduling scheme named BBSched that schedules user jobs based on not only their CPU requirements, but also other schedulable resources such as burst buffer. BBSched formulates the scheduling problem into a multi-objective optimization (MOO) problem and rapidly solves the problem using a multi-objective genetic algorithm. The multiple solutions generated by BBSched enables system managers to explore potential tradeoffs among various resources, and therefore obtains better utilization of all the resources. The trace-driven simulations with real system workloads demonstrate that BBSched improves scheduling performance by up to 41% compared to existing methods, indicating that explicitly optimizing multiple resources beyond CPUs is essential for HPC scheduling.

Zhengji Zhao, Running VASP on Cori KNL, VASP User Hands-on KNL Training, June 18, 2019, Berkeley CA, June 18, 2019,

C. Yang, Using Intel Tools at NERSC, Intel KNL Training, May 2019,

Nicholas Balthaser, Tape's Not Dead at LBNL/NERSC, MSST 2019 Conference, May 21, 2019,

Lightning talk on archival storage projects at NERSC for 2019 MSST conference.

Tiffany Connors, Taylor Groves, Tony Quan, Scott Hemmert, "Simulation Framework for Studying Optical Cable Failures in Dragonfly Topologies", Workshop on Scalable Networks for Advanced Computing Systems in conjunction with IPDPS, May 17, 2019,

Kirill Lozinskiy, Glenn K. Lockwood, Lisa Gerhardt, Ravi Cheema, Damian Hazen, Nicholas J. Wright, A Quantitative Approach to Architecting All‐Flash Lustre File Systems, Lustre User Group (LUG) 2019, May 15, 2019,

Kirill Lozinskiy, Lisa Gerhardt, Annette Greiner, Ravi Cheema, Damian Hazen, Kristy Kallback-Rose, Rei Lee, User-Friendly Data Management for Scientific Computing Users, Cray User Group (CUG) 2019, May 9, 2019,

Wrangling data at a scientific computing center can be a major challenge for users, particularly when quotas may impact their ability to utilize resources. In such an environment, a task as simple as listing space usage for one's files can take hours. The National Energy Research Scientific Computing Center (NERSC) has roughly 50 PBs of shared storage utilizing more than 4.6B inodes, and a 146 PB high-performance tape archive, all accessible from two supercomputers. As data volumes increase exponentially, managing data is becoming a larger burden on scientists. To ease the pain, we have designed and built a “Data Dashboard”. Here, in a web-enabled visual application, our 7,000 users can easily review their usage against quotas, discover patterns, and identify candidate files for archiving or deletion. We describe this system, the framework supporting it, and the challenges for such a framework moving into the exascale age.

Thorsten Kurth, Joshua Romero, Everett Phillips, and Massimiliano Fatica, Brandon Cook, Rahul Gayatri, Zhengji Zhao, and Jack Deslippe, Porting Quantum ESPRESSO Hybrid Functional DFT to GPUs Using CUDA Fortran, Cray User Group Meeting, Montreal, Canada, May 5, 2019,

Zhengji Zhao, Introduction - The basics of compiling and running on KNL, Cori KNL: Programming and Optimization, Cray KNL training, April 16-18, 2018, Berkeley CA, April 16, 2019,

C. Yang, Performance Analysis of GPU-Accelerated Applications using the Roofline Model, Cray Center of Excellence (COE) Webinar, April 2019,

NERSC site update including Systems Overview, Storage 2020 Strategy & Progress and Superfacility Initiative.

C. Yang, Preparing NERSC Applications for Perlmutter as an Exascale Waypoint, Meet the Mentors, Pawsey Supercomputing Centre, March 2019,

R. Gayatri, and C. Yang, Optimizing Large Reductions in BerkeleyGW with CUDA, OpenACC, OpenMP 4.5 and Kokkos, NVIDIA GPU Technology Conference (GTC’19), March 2019,

C. Yang, S. Williams, Performance Analysis of GPU-Accelerated Applications using the Roofline Model, NVIDIA GPU Technology Conference (GTC’19), March 2019,

C. Yang, Preparing NERSC Applications for Perlmutter as an Exascale Waypoint, Tsukuba/LBNL Meeting, March 2019,

C. Yang, OpenACC Updates, NERSC Application Readiness Seminar, February 2019,

Zhengji Zhao, Introduction: The basics of compiling and running jobs on KNL, Cori KNL: Programming and Optimization, Cray Training, Feb 12-13, 2019, Berkeley CA, February 12, 2019,

C. Yang, Roofline Performance Analysis with nvprof, NERSC-NVIDIA Face-to-Face Meeting, February 2019,

J. Pennycook, C. Yang, and J. Deslippe, Quantitatively Assessing Performance Portability with Roofline, Exascale Computing Project (ECP) Interoperable Design of Extreme-scale Application Software (IDEAS) Webinar, January 2019,

S. Williams, J. Deslippe, C. Yang, Performance Tuning of Scientific Codes with the Roofline Model, Half-Day Tutorial, Exascale Computing Project (ECP) Annual Meeting, January 2019,

Phuong Hoai Ha, Otto J. Anshus, Ibrahim Umar, "Efficient concurrent search trees using portable fine-grained locality", IEEE Transactions on Parallel and Distributed Systems, January 14, 2019,

Abhinav Thota, Yun He, "Foreword to the Special Issue of the Cray User Group (CUG 2018)", Concurrency and Computation: Practice and Experience, January 11, 2019,

Weiqun Zhang, Ann Almgren, Vince Beckner, John Bell, Johannes Blaschke, Cy Chan, Marcus Day, Brian Friesen, Kevin Gott, Daniel Graves, Max P. Katz, Andrew Myers, Tan Nguyen, Andrew Nonaka, Michele Rosso, Samuel Williams, Michael Zingale, "AMReX: a framework for block-structured adaptive mesh refinement", Journal of Open Source Software, 2019, 4:1370, doi:

Glenn K. Lockwood, Kirill Lozinskiy, Lisa Gerhardt, Ravi Cheema, Damian Hazen, Nicholas J. Wright, "Designing an All-Flash Lustre File System for the 2020 NERSC Perlmutter System", Proceedings of the 2019 Cray User Group, Montreal, January 1, 2019,

New experimental and AI-driven workloads are moving into the realm of extreme-scale HPC systems at the same time that high-performance flash is becoming cost-effective to deploy at scale. This confluence poses a number of new technical and economic challenges and opportunities in designing the next generation of HPC storage and I/O subsystems to achieve the right balance of bandwidth, latency, endurance, and cost. In this paper, we present the quantitative approach to requirements definition that resulted in the 30 PB all-flash Lustre file system that will be deployed with NERSC's upcoming Perlmutter system in 2020. By integrating analysis of current workloads and projections of future performance and throughput, we were able to constrain many critical design space parameters and quantitatively demonstrate that Perlmutter will not only deliver optimal performance, but effectively balance cost with capacity, endurance, and many modern features of Lustre.

Osni A. Marques, David E. Bernholdt, Elaine M. Raybourn, Ashley D. Barker, Rebecca J. Hartman-Baker, "The HPC Best Practices Webinar Series", Journal of Computational Science Education, January 2019, doi: 10.22369/issn.2153-4136/10/1/19

In this contribution, we discuss our experiences organizing the Best Practices for HPC Software Developers (HPC-BP) webinar series, an effort for the dissemination of software development methodologies, tools and experiences to improve developer productivity and software sustainability. HPC-BP is an outreach component of the IDEAS Productivity Project and has been designed to support the IDEAS mission to work with scientific software development teams to enhance their productivity and the sustainability of their codes. The series, which was launched in 2016, has just presented its 22nd webinar. We summarize and distill our experiences with these webinars, including what we consider to be “best practices” in the execution of both individual webinars and a long-running series like HPC-BP. We also discuss future opportunities and challenges in continuing the series.



Thomas Heller, Bryce Adelstein Lelbach, Kevin A. Huck, John Biddiscombe, Patricia Grubel, Alice E. Koniges, Matthias Kretz, Dominic Marcello, David Pfander, Adrian SerioL, Juhan Frank, Geoffrey C. Clayton, Dirk Pflu ̈ger, David Eder, and Hartmut Kaiser, "Harnessing Billions of Tasks for a Scalable Portable Hydrodynamic Simulation of the Merger of Two Stars", The International Journal of High Performance Computing Applications, 2018, Accepted,

Florian Wende, Martijn Marsman, Jeongnim Kim, Fedor Vasilev, Zhengji Zhao, Thomas Steinke, "OpenMP in VASP: Threading and SIMD", International Journal of Quantum Chemistry, December 19, 2018,

Paul T. Lin, John N. Shadid, Jonathan J. Hu, Roger P. Pawlowki, Eric C. Cyr, "Performance of fully-coupled algebraic multigrid preconditioners for large-scale VMS resistive MHD", Journal of Computational and Applied Mathematics, December 15, 2018, 344:782-793, doi: 10.1016/

Zhengji Zhao, Transition from Edison to Cori KNL How to compile and run on KNL, Edison to KNL User Training, December 12, 2018, Berkeley CA, December 12, 2018,

Vetter, Jeffrey S.; Brightwell, Ron; Gokhale, Maya; McCormick, Pat; Ross, Rob; Shalf, John; Antypas, Katie; Donofrio, David; Humble, Travis; Schuman, Catherine; Van Essen, Brian; Yoo, Shinjae; Aiken, Alex; Bernholdt, David; Byna, Suren; Cameron, Kirk; Cappello, Frank; Chapman, Barbara; Chien, Andrew; Hall, Mary; Hartman-Baker, Rebecca; Lan, Zhiling; Lang, Michael; Leidel, John; Li, Sherry; Lucas, Robert; Mellor-Crummey, John; Peltz Jr., Paul; Peterka, Thomas; Strout, Michelle; Wilke, Jeremiah, "Extreme Heterogeneity 2018 - Productive Computational Science in the Era of Extreme Heterogeneity: Report for DOE ASCR Workshop on Extreme Heterogeneity", December 2018, doi: 10.2172/1473756

Kurt Ferreira, Ryan E. Grant, Michael J. Levenhagen, Scott Levy, Taylor Groves, "Hardware MPI Message Matching: Insights into MPI Matching Behavior to Inform Design", Concurrency and Computation Practice and Experience, December 1, 2018,

S. Williams, A. Ilic, Z. Matveev, C. Yang, Performance Tuning of Scientific Codes with the Roofline Model,, Half-Day Tutorial, Supercomputing Conference (SC’18), November 2018,

C. Yang, R. Gayatri, T. Kurth, P. Basu, Z. Ronaghi, A. Adetokunbo, B. Friesen, B.
Cook, D. Doerfler, L. Oliker, J. Deslippe, and S. Williams,
"An Empirical Roofline Methodology for Quantitatively Assessing Performance Portability", IEEE International Workshop on Performance, Portability and Productivity in HPC (P3HPC'18), November 2018,

R. Gayatri, C. Yang, T. Kurth, and J. Deslippe, "A Case Study for Performance Portability Using OpenMP 4.5", IEEE International Workshop on Accelerator Programming Using Directives (WACCPD'18), November 2018,

B. Austin, C. Daley, D. Doerfler, J. Deslippe, B. Cook, B. Friesen, T. Kurth, C. Yang,
and N. Wright,
"A Metric for Evaluating Supercomputer Performance in the Era of Extreme Heterogeneity", 9th IEEE International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS'18), November 2018,

Tim Mattson, Alice Koniges, Yun (Helen) He, David Eder, The OpenMP Common Core: A hands-on exploration, SuperComputing 2018 Tutorial, November 11, 2018,

Glenn K. Lockwood, Shane Snyder, Teng Wang, Suren Byna, Philip Carns, Nicholas J. Wright, "A Year in the Life of a Parallel File System", Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, Dallas, TX, IEEE Press, November 11, 2018, 71:1--74:1,

I/O performance is a critical aspect of data-intensive scientific computing. We seek to advance the state of the practice in understanding and diagnosing I/O performance issues through investigation of a comprehensive I/O performance data set that captures a full year of production storage activity at two leadership-scale computing facilities. We demonstrate techniques to identify regions of interest, perform focused investigations of both long-term trends and transient anomalies, and uncover the contributing factors that lead to performance fluctuation.

We find that a year in the life of a parallel file system is comprised of distinct regions of long-term performance variation in addition to short-term performance transients. We demonstrate how systematic identification of these performance regions, combined with comprehensive analysis, allows us to isolate the factors contributing to different performance maladies at different time scales. From this, we present specific lessons learned and important considerations for HPC storage practitioners.

Tianmu Xin, ​Zhengji Zhao​, Yue Hao, Binping Xiao, Qiong Wu, Alexander Zaltsman, Kevin Smith, and Xinmin Tian, Performance Tuning to close Ninja Gap for Accelerator Physics Emulation System (APES) on Intel Xeon Phi Processors, A talk presented in the 14th ​International Workshop on OpenMP (​IWOMP18​), Barcelona Spain, September 26, 2018,

Tianmu Xin, ​Zhengji Zhao​, Yue Hao, Binping Xiao, Qiong Wu, Alexander Zaltsman, Kevin Smith, and Xinmin Tian, "Performance Tuning to close Ninja Gap for Accelerator Physics Emulation System (APES) on Intel Xeon Phi Processors", Proceeding of the 14th ​International Workshop on OpenMP (​IWOMP18​), Barcelona Spain, September 26, 2018,

Yun (Helen) He, Barbara Chapman, Oscar Hernandez, Tim Mattson, Alice Koniges, Introduction to "OpenMP Common Core", OpenMPCon / IWOMP 2018 Tutorial Day, September 26, 2018,

Oscar Hernandez, Yun (Helen) He, Barbara Chapman, Using MPI+OpenMP for Current and Future Architectures, OpenMPCon 2018, September 24, 2018,

Robert Ross, Lee Ward, Philip Carns, Gary Grider, Scott Klasky, Quincey Koziol, Glenn K. Lockwood, Kathryn Mohror, Bradley Settlemyer, Matthew Wolf, "Storage Systems and I/O: Organizing, Storing, and Accessing Data for Scientific Discovery", 2018, doi: 10.2172/1491994

In September, 2018, the Department of Energy, Office of Science, Advanced Scientific Computing Research Program convened a workshop to identify key challenges and define research directions that will advance the field of storage systems and I/O over the next 5–7 years. The workshop concluded that addressing these combined challenges and opportunities requires tools and techniques that greatly extend traditional approaches and require new research directions. Key research opportunities were identified.

Yun (Helen) He, Michael Klemm, Bronis R. De Supinski, OpenMP: Current and Future Directions, 8th NCAR MultiCore Workshop (MC8), September 19, 2018,

Suzanne M. Kosina, Annette M. Greiner, Rebecca K. Lau, Stefan Jenkins, Richard Baran, Benjamin P. Bowen and Trent R. Northen, "Web of microbes (WoM): a curated microbial exometabolomics database for linking chemistry and microbes", BMC Microbiology, September 12, 2018, 18, doi:

As microbiome research becomes increasingly prevalent in the fields of human health, agriculture and biotechnology, there exists a need for a resource to better link organisms and environmental chemistries. Exometabolomics experiments now provide assertions of the metabolites present within specific environments and how the production and depletion of metabolites is linked to specific microbes. This information could be broadly useful, from comparing metabolites across environments, to predicting competition and exchange of metabolites between microbes, and to designing stable microbial consortia. Here, we introduce Web of Microbes (WoM; freely available at:, the first exometabolomics data repository and visualization tool.

Teng Wang, Shane Snyder, Glenn K. Lockwood, Philip Carns, Nicholas Wright, Suren Byna, "IOMiner: Large-Scale Analytics Framework for Gaining Knowledge from I/O Logs", 2018 IEEE International Conference on Cluster Computing (CLUSTER), Belfast, UK, IEEE, 2018, 466--476, doi: 10.1109/CLUSTER.2018.00062

Modern HPC systems are collecting large amounts of I/O performance data. The massive volume and heterogeneity of this data, however, have made timely performance of in-depth integrated analysis difficult. To overcome this difficulty and to allow users to identify the root causes of poor application I/O performance, we present IOMiner, an I/O log analytics framework. IOMiner provides an easy-to-use interface for analyzing instrumentation data, a unified storage schema that hides the heterogeneity of the raw instrumentation data, and a sweep-line-based algorithm for root cause analysis of poor application I/O performance. IOMiner is implemented atop Spark to facilitate efficient, interactive, parallel analysis. We demonstrate the capabilities of IOMiner by using it to analyze logs collected on a large-scale production HPC system. Our analysis techniques not only uncover the root cause of poor I/O performance in key application case studies but also provide new insight into HPC I/O workload characterization.

Benjamin Driscoll, and Zhengji Zhao, Automating NERSC Reporting, 2018 CS Summer Student POSTER SESSION, August 2, 2018, Berkeley CA, August 2, 2018,

Nathan Hjelm, Matthew Dosanjh, Ryan Grant, Taylor Groves, Patrick Bridges, Dorian Arnold, "Improving MPI Multi-threaded RMA Communication Performance", ACM International Conference on Parallel Processing (ICPP), August 1, 2018,

C. Yang, Introduction to Performance Scalability Tools, Department of Energy (DOE) Computational Science Graduate Fellowship (CSGF) Annual Review, July 2018,

Adam P Arkin, Robert W Cottingham, Christopher S Henry, Nomi L Harris, Rick L Stevens, Sergei Maslov, Paramvir Dehal, Doreen Ware, Fernando Perez, Shane Canon, Michael W Sneddon, Matthew L Henderson, William J Riehl, Dan Murphy-Olson, Stephen Y Chan, Roy T Kamimura, Sunita Kumari, Meghan M Drake, Thomas S Brettin, Elizabeth M Glass, Dylan Chivian, Dan Gunter, David J Weston, Benjamin H Allen, Jason Baumohl, Aaron A Best, Ben Bowen, Steven E Brenner, Christopher C Bun, John-Marc Chandonia, Jer-Ming Chia, Ric Colasanti, Neal Conrad, James J Davis, Brian H Davison, Matthew DeJongh, Scott Devoid, Emily Dietrich, Inna Dubchak, Janaka N Edirisinghe, Gang Fang, José P Faria, Paul M Frybarger, Wolfgang Gerlach, Mark Gerstein, Annette Greiner, James Gurtowski, Holly L Haun, Fei He, Rashmi Jain, Marcin P Joachimiak, Kevin P Keegan, Shinnosuke Kondo, Vivek Kumar, Miriam L Land, Folker Meyer, Marissa Mills, Pavel S Novichkov, Taeyun Oh, Gary J Olsen, Robert Olson, Bruce Parrello, Shiran Pasternak, Erik Pearson, Sarah S Poon, Gavin A Price, Srividya Ramakrishnan, Priya Ranjan, Pamela C Ronald, Michael C Schatz, Samuel M D Seaver, Maulik Shukla, Roman A Sutormin, Mustafa H Syed, James Thomason, Nathan L Tintle, Daifeng Wang, Fangfang Xia, Hyunseung Yoo, Shinjae Yoo, Dantong Yu, "KBase: the United States department of energy systems biology knowledgebase", Nature Biotechnology, July 6, 2018, 36.7, doi: 10.1038/nbt.4163.

Here we present the DOE Systems Biology Knowledgebase (KBase,, an open-source software and data platform that enables data sharing, integration, and analysis of microbes, plants, and their communities. KBase maintains an internal reference database that consolidates information from widely used external data repositories. This includes over 90,000 microbial genomes from RefSeq4, over 50 plant genomes from Phytozome5, over 300 Biolog media formulations6, and >30,000 reactions and compounds from KEGG7, BIGG8, and MetaCyc9. These public data are available for integration with user data where appropriate (e.g., genome comparison or building species trees). KBase links these diverse data types with a range of analytical functions within a web-based user interface. This extensive community resource facilitates large-scale analyses on scalable computing infrastructure and has the potential to accelerate scientific discovery, improve reproducibility, and foster open collaboration.

Zhengji Zhao, Using VASP at NERSC, Chemistry and Materials Science Application Training, Berkeley CA, June 29, 2018,

T. Koskela, A. Ilic, Z. Matveev, R. Belenov, C. Yang, and L. Sousa, A Practical Approach to Application Performance Tuning with the Roofline Model, Half-Day Tutorial, International Supercomputing Conference (ISC’18), June 2018,

B. Cook, C. Yang, B. Friesen, T. Kurth and J. Deslippe, "Sparse CSB_Coo Matrix-Vector and Matrix-Matrix Performance on Intel Xeon Architectures", Intel eXtreme Performance Users Group (IXPUG) at International Supercomputing Conference (ISC'18), June 2018,

T. Koskela, Z. Matveev, C. Yang, A. Adetokunbo, R. Belenov, P. Thierry, Z. Zhao,
R. Gayatri, H. Shan, L. Oliker, J. Deslippe, R. Green, and S. Williams,
"A Novel Multi-Level Integrated Roofline Model Approach for Performance Characterization", International Supercomputing Conference (ISC'18), June 2018,

Tuomas Koskela, Zakhar Matveev, Charlene Yang, Adetokunbo Adedoyin, Roman Belenov, Philippe Thierry, Zhengji Zhao, Rahulkumar Gayatri, Hongzhang Shan, Leonid Oliker, Jack Deslippe, Ron Green, Samuel Williams, "​A Novel Multi-Level Integrated Roofline Model Approach for Performance Characterization", International Conferences on High Performance Computing 2018, June 24, 2018,

Shahzeb Siddiqui, Software Stack Testing Framework, HPCKP, June 22, 2018,

Yun (Helen) He, Introduction to NERSC Resources, LBNL Computer Sciences Summer Student Classes #1, June 11, 2018,

Glenn K. Lockwood, Nicholas J. Wright, Shane Snyder, Philip Carns, George Brown, Kevin Harms, "TOKIO on ClusterStor: Connecting Standard Tools to Enable Holistic I/O Performance Analysis", Proceedings of the 2018 Cray User Group, Stockholm, SE, May 24, 2018,

At present, I/O performance analysis requires different tools to characterize individual components of the I/O subsystem, and institutional I/O expertise is relied upon to translate these disparate data into an integrated view of application performance. This process is labor-intensive and not sustainable as the storage hierarchy deepens and system complexity increases. To address this growing disparity, we have developed the Total Knowledge of I/O (TOKIO) framework to combine the insights from existing component-level monitoring tools and provide a holistic view of performance across the entire I/O stack. 

A reference implementation of TOKIO, pytokio, is presented here. Using monitoring tools included with Cray XC and ClusterStor systems alongside commonly deployed community-supported tools, we demonstrate how pytokio provides a lightweight foundation for holistic I/O performance analyses on two Cray XC systems deployed at different HPC centers. We present results from integrated analyses that allow users to quantify the degree of I/O contention that affected their jobs and probabilistically identify unhealthy storage devices that impacted their performance.We also apply pytokio to inspect the utilization of NERSC’s DataWarp burst buffer and demonstrate how pytokio can be used to identify users and applications who may stand to benefit most from migrating their workloads from Lustre to the burst buffer.

C. Yang, B. Friesen, T. Kurth, B. Cook, S. Williams, "Toward Automated Application Profiling on Cray Systems", Cray User Group conference (CUG'18), May 2018,

Ville Ahlgren, Stefan Andersson, Jim Brandt, Nicholas Cardo, Sudheer Chunduri, Jeremy Enos, Parks Fields, Ann Gentile, Richard Gerber, Joe Greenseid, Annette Greiner, Bilel Hadri, Helen He, Dennis Hoppe, Urpo Kaila, Kaki Kelly, Mark Klein, Alex Kristiansen, Steve Leak, Michael Mason, Kevin Pedretti, Jean-Guillaume Piccinali, Jason Repik, Jim Rogers, Susanna Salminen, Michael Showerman, Cary Whitney, Jim Williams, "Cray System Monitoring: Successes, Priorities, Visions", CUG 2018 Proceedings, Stockholm, Cray User Group, May 22, 2018,

Effective HPC system operations and utilization require unprecedented insight into system state, applications’ demands for resources, contention for shared resources, and system demands on center power and cooling. Monitoring can provide such insights when the necessary fundamental capabilities for data availability and usability are provided. In this paper, multiple Cray sites seek to motivate monitoring as a core capability in HPC design, through the presentation of success stories illustrating enhanced understanding and improved performance and/or operations as a result of monitoring and analysis.We present the utility, limitations, and gaps of the data necessary to enable the required insights. The capabilities developed to enable the case successes drive our identification and prioritization of monitoring system requirements. Ultimately, we seek to engage all HPC stakeholders to drive community and vendor progress on these priorities.

Stephen Leak, Annette Greiner, Ann Gentile, James Brandt, "Supporting failure analysis with discoverable, annotated log datasets", CUG 2018 Proceedings, Stockholm, Cray User Group, May 22, 2018,

Detection, characterization, and mitigation of faults on supercomputers is complicated by the large variety of interacting subsystems. Failures often manifest as vague observations like ``my job failed" and may result from system hardware/firmware/software, filesystems, networks, resource manager state, and more. Data such as system logs, environmental metrics, job history, cluster state snapshots, published outage notices and user reports is routinely collected. These data are typically stored in different locations and formats for specific use by targeted consumers. Combining data sources for analysis generally requires a consumer-dependent custom approach. We present a vocabulary for describing data, including format and access details, an annotation schema for attaching observations to a dataset, and tools to aid in discovery and publishing system-related insights. We present case studies in which our analysis tools utilize information from disparate data sources to investigate failures and performance issues from user and administrator perspectives.

Nicholas Balthaser, NERSC Tape Technology, MSST 2018 Conference, May 16, 2018,

Description of tape storage technology in use at NERSC for 2018 MSST conference.

Tim Mattson, Yun (Helen) He, Beyond OpenMP Common Core, NERSC Training, May 4, 2018,

NERSC site update focusing on plans to implement new tape technology at the Berkeley Data Center. 

NERSC Site Report focusing on plans for migration of tape-based system to new location and new technology, and collection of metrics for GPFS.

Zhengji Zhao, New User Training: March 21, 2018, New User Training, March 21, 2018, Berkeley CA, March 21, 2018,

C. Yang, T. Kurth, Roofline Performance Model and Intel Advisor, Performance Analysis and Modeling (PAM) Workshop 2018, February 2018,

Barbara Chapman, Oscar Hernandez, Yun (Helen) He, Martin Kong, Geoffroy Vallee, MPI + OpenMP Tutorial, DOE ECP Annual Meeting Tutorial, 2018, February 9, 2018,

S. Williams, J. Deslippe, C. Yang, P. Basu, Performance Tuning of Scientific Codes with the Roofline Model, Half-Day Tutorial, Exascale Computing Project (ECP) Annual Meeting, February 2018,

Alice Kong's, Yun (Helen) He, OpenMP Common Core, NERSC Training, February 6, 2018,

Lee, Jason R., et al, "Enhancing supercomputing with software defined networking", IEEE Conference on Information Networking (ICOIN), January 10, 2018,

Felix Ruehle, Johannes Blaschke, Jan-Timm Kuhr, Holger Stark, "Gravity-induced dynamics of a squirmer microswimmer in wall proximity", New J. Phys., 2018, 20:025003,

Zingale M, Almgren AS, Barrios Sazo MG, Beckner VE, Bell JB, Friesen B, Jacobs AM, Katz MP, Malone CM, Nonaka AJ, Willcox DE, Zhang W, "Meeting the Challenges of Modeling Astrophysical Thermonuclear Explosions: Castro, Maestro, and the AMReX Astrophysics Suite", 2018, doi: 10.1088/1742-6596/1031/1/012024

We describe the AMReX suite of astrophysics codes and their application to modeling problems in stellar astrophysics. Maestro is tuned to efficiently model subsonic convective flows while Castro models the highly compressible flows associated with stellar explosions. Both are built on the block-structured adaptive mesh refinement library AMReX. Together, these codes enable a thorough investigation of stellar phenomena, including Type Ia supernovae and X-ray bursts. We describe these science applications and the approach we are taking to make these codes performant on current and future many-core and GPU-based architectures.

Jialin Liu, Debbie Bard, Quincey Koziol, Stephen Bailey, Prabhat, "Searching for Millions of Objects in the BOSS Spectroscopic Survey Data with H5Boss", IEEE NYSDS'17, January 1, 2018,


C. S. Daley, D. Ghoshal, G. K. Lockwood, S. Dosanjh, L. Ramakrishnan, N. J. Wright, "Performance characterization of scientific workflows for the optimal use of Burst Buffers", Future Generation Computer Systems, December 28, 2017, doi: 10.1016/j.future.2017.12.022

Scientific discoveries are increasingly dependent upon the analysis of large volumes of data from observations and simulations of complex phenomena. Scientists compose the complex analyses as workflows and execute them on large-scale HPC systems. The workflow structures are in contrast with monolithic single simulations that have often been the primary use case on HPC systems. Simultaneously, new storage paradigms such as Burst Buffers are becoming available on HPC platforms. In this paper, we analyze the performance characteristics of a Burst Buffer and two representative scientific workflows with the aim of optimizing the usage of a Burst Buffer, extending our previous analyses (Daley et al., 2016). Our key contributions are a). developing a performance analysis methodology pertinent to Burst Buffers, b). improving the use of a Burst Buffer in workflows with bandwidth-sensitive and metadata-sensitive I/O workloads, c). highlighting the key data management challenges when incorporating a Burst Buffer in the studied scientific workflows.

Tyler Allen, Christopher S. Daley, Douglas Doerfler, Brian Austin, Nicholas J. Wright, "Performance and Energy Usage of Workloads on KNL and Haswell Architectures", High Performance Computing Systems. Performance Modeling, Benchmarking, and Simulation. PMBS 2017. Lecture Notes in Computer Science, Volume 10724., December 23, 2017,

Tiffany A. Connors, Apan Qasem, "Automatically Selecting Profitable Thread Block Sizes for Accelerated Kernels", 2017 IEEE 19th International Conference on High Performance Computing and Communications (HPCC17), December 18, 2017, doi: 10.1109/HPCC-SmartCity-DSS.2017.58

Biplab Kumar Saha, Tiffany Connors, Saami Rahman, Apan Qasem, "A Machine Learning Approach to Automatic Creation of Architecture-sensitive Performance Heuristics", 2017 IEEE 19th International Conference on High Performance Computing and Communications (HPCC), Bangkok, Thailand, December 18, 2017, doi: 10.1109/HPCC-SmartCity-DSS.2017.3

Scott Michael, Yun He, "Foreword to the Special Issue of the Cray User Group (CUG 2017)", Concurrency and Computation: Practice and Experience, December 5, 2017,

Wahid Bhimji, Debbie Bard, Kaylan Burleigh, Chris Daley, Steve Farrell, Markus Fasel, Brian Friesen, Lisa Gerhardt, Jialin Liu, Peter Nugent, Dave Paul, Jeff Porter, Vakho Tsulaia, "Extreme I/O on HPC for HEP using the Burst Buffer at NERSC", Journal of Physics: Conference Series, December 1, 2017, 898:082015,

Brian Van Straalen, David Trebotich, Andrey Ovsyannikov, Daniel T. Graves, "Exascale Scientific Applications: Programming Approaches for Scalability Performance and Portability", edited by Tjerk Straatsma, Timothy William, Katie Antypas, (CRC Press: December 1, 2017)

Taylor Groves, Ryan Grant, Aaron Gonzales, Dorian Arnold, "Unraveling Network-induced Memory Contention: Deeper Insights with Machine Learning", Transactions on Parallel and Distributed Systems, November 21, 2017, doi: 10.1109/TPDS.2017.2773483

Remote Direct Memory Access (RDMA) is expected to be an integral communication mechanism for future exascale systems enabling asynchronous data transfers, so that applications may fully utilize CPU resources while simultaneously sharing data amongst remote nodes. In this work we examine Network-induced Memory Contention (NiMC) on Infiniband networks. We expose the interactions between RDMA, main-memory and cache, when applications and out-of-band services compete for memory resources. We then explore NiMCs resulting impact on application-level performance. For a range of hardware technologies and HPC workloads, we quantify NiMC and show that NiMCs impact grows with scale resulting in up to 3X performance degradation at scales as small as 8K processes even in applications that previously have been shown to be performance resilient in the presence of noise. Additionally, this work examines the problem of predicting NiMC's impact on applications by leveraging machine learning and easily accessible performance counters. This approach provides additional insights about the root cause of NiMC and facilitates dynamic selection of potential solutions. Lastly, we evaluated three potential techniques to reduce NiMCs impact, namely hardware offloading, core reservation and network throttling.

T. Koskela, A. Ilic, Z. Matveev, S. Williams, P. Thierry, and C. Yang, Performance Tuning of Scientific Codes with the Roofline Model, Half-Day Tutorial, Supercomputing Conference (SC’17), November 2017,

Colin A. MacLean, HonWai Leong, Jeremy Enos, "Improving the start-up time of python applications on large scale HPC systems", Proceedings of HPCSYSPROS 2017, Denver, CO, 2017,

B Friesen, MMA Patwary, B Austin, N Satish, Z Slepian, N Sundaram, D Bard, DJ Eisenstein, J Deslippe, P Dubey, Prabhat, "Galactos: Computing the Anisotropic 3-Point Correlation Function for 2 Billion Galaxies", November 2017, doi: 10.1145/3126908.3126927

The nature of dark energy and the complete theory of gravity are two central questions currently facing cosmology. A vital tool for addressing them is the 3-point correlation function (3PCF), which probes deviations from a spatially random distribution of galaxies. However, the 3PCF's formidable computational expense has prevented its application to astronomical surveys comprising millions to billions of galaxies. We present Galactos, a high-performance implementation of a novel, O(N2) algorithm that uses a load-balanced k-d tree and spherical harmonic expansions to compute the anisotropic 3PCF. Our implementation is optimized for the Intel Xeon Phi architecture, exploiting SIMD parallelism, instruction and thread concurrency, and significant L1 and L2 cache reuse, reaching 39% of peak performance on a single node. Galactos scales to the full Cori system, achieving 9.8 PF (peak) and 5.06 PF (sustained) across 9636 nodes, making the 3PCF easily computable for all galaxies in the observable universe.

Damian Rouson, Ethan D Gutmann, Alessandro Fanfarillo, Brian Friesen, "Performance portability of an intermediate-complexity atmospheric research model in coarray Fortran", November 2017, doi: 10.1145/3144779.3169104

We examine the scalability and performance of an open-source, coarray Fortran (CAF) mini-application (mini-app) that implements the parallel, numerical algorithms that dominate the execution of The Intermediate Complexity Atmospheric Research (ICAR) [4] model developed at the the National Center for Atmospheric Research (NCAR). The Fortran 2008 mini-app includes one Fortran 2008 implementation of a collective subroutine defined in the Committee Draft of the upcoming Fortran 2018 standard. The ability of CAF to run atop various communication layers and the increasing CAF compiler availability facilitated evaluating several compilers, runtime libraries and hardware platforms. Results are presented for the GNU and Cray compilers, each of which offers different parallel runtime libraries employing one or more communication layers, including MPI, OpenSHMEM, and proprietary alternatives. We study performance on multi- and many-core processors in distributed memory. The results show promising scaling across a range of hardware, compiler, and runtime choices on up to ~100,000 cores.

Glenn K. Lockwood, Wucherl Yoo, Suren Byna, Nicholas J. Wright, Shane Snyder, Kevin Harms, Zachary Nault, Philip Carns, "UMAMI: a recipe for generating meaningful metrics through holistic I/O performance analysis", Proceedings of the 2nd Joint International Workshop on Parallel Data Storage & Data Intensive Scalable Computing Systems (PDSW-DISCS'17), Denver, CO, ACM, November 2017, 55-60, doi: 10.1145/3149393.3149395

I/O efficiency is essential to productivity in scientific computing, especially as many scientific domains become more data-intensive. Many characterization tools have been used to elucidate specific aspects of parallel I/O performance, but analyzing components of complex I/O subsystems in isolation fails to provide insight into critical questions: how do the I/O components interact, what are reasonable expectations for application performance, and what are the underlying causes of I/O performance problems? To address these questions while capitalizing on existing component-level characterization tools, we propose an approach that combines on-demand, modular synthesis of I/O characterization data into a unified monitoring and metrics interface (UMAMI) to provide a normalized, holistic view of I/O behavior.

We evaluate the feasibility of this approach by applying it to a month-long benchmarking study on two distinct large-scale computing platforms. We present three case studies that highlight the importance of analyzing application I/O performance in context with both contemporaneous and historical component metrics, and we provide new insights into the factors affecting I/O performance. By demonstrating the generality of our approach, we lay the groundwork for a production-grade framework for holistic I/O analysis.

Tim Mattson, Alice Koniges, Yun (Helen) He, Barbara Chapman, The OpenMP Common Core: A hands-on exploration, SuperComputing 2017 Tutorial, November 12, 2017,

Kurt Ferreira, Ryan E. Grant, Michael J. Levenhagen, Scott Levy, Taylor Groves, "Hardware MPI Message Matching: Insights into MPI Matching Behavior to Inform Design", ExaMPI in association with SC17, November 12, 2017,

Richard A. Gerber, Jack Deslippe, Manycore for the Masses Part 2, Intel HPC DevCon, November 11, 2017,

Richard A Gerber, NERSC Overview - Focus: Energy Technologies, November 6, 2017,

Glenn K. Lockwood, Damian Hazen, Quincey Koziol, Shane Canon, Katie Antypas, Jan Balewski, Nicholas Balthaser, Wahid Bhimji, James Botts, Jeff Broughton, Tina L. Butler, Gregory F. Butler, Ravi Cheema, Christopher Daley, Tina Declerck, Lisa Gerhardt, Wayne E. Hurlbert, Kristy A. Kallback-
Rose, Stephen Leak, Jason Lee, Rei Lee, Jialin Liu, Kirill Lozinskiy, David Paul, Prabhat, Cory Snavely, Jay Srinivasan, Tavia Stone Gibbins, Nicholas J. Wright,
"Storage 2020: A Vision for the Future of HPC Storage", October 20, 2017, LBNL LBNL-2001072,

As the DOE Office of Science's mission computing facility, NERSC will follow this roadmap and deploy these new storage technologies to continue delivering storage resources that meet the needs of its broad user community. NERSC's diversity of workflows encompass significant portions of open science workloads as well, and the findings presented in this report are also intended to be a blueprint for how the evolving storage landscape can be best utilized by the greater HPC community. Executing the strategy presented here will ensure that emerging I/O technologies will be both applicable to and effective in enabling scientific discovery through extreme-scale simulation and data analysis in the coming decade.

Richard A. Gerber, NERSC Overview - Focus: Berkeley Rotary Club, October 18, 2017,

Richard A. Gerber, Current and Next Generation Supercomputing and Data Analysis at NERSC, HPC Distinguished Lecture, Iowa State University & Ames Laboratory, October 18, 2017,

Taylor Groves, Networks, Damn Networks and Aries, NERSC CS/Data Seminar, October 6, 2017,

Presentation of the performance of the Cori Aries network.   Highlights of monitoring and analysis efforts underway.

Douglas Doerfler, Steven Gottlieb, Carleton DeTar, Doug Toussaint, Karthik Raman, Improving the Performance of the MILC Code on Intel Knights Landing, An Overview, Intel Xeon Phi User Group Meeting 2017 Fall Meeting, September 26, 2017,

Jaehyun Han, Donghun Koo, Glenn K. Lockwood, Jaehwan Lee, Hyeonsang Eom, Soonwook Hwang, "Accelerating a Burst Buffer via User-Level I/O Isolation", Proceedings of the 2017 IEEE International Conference on Cluster Computing (CLUSTER), Honolulu, HI, IEEE, September 2017, 245-255, doi: 10.1109/CLUSTER.2017.60

Burst buffers tolerate I/O spikes in High-Performance Computing environments by using a non-volatile flash technology. Burst buffers are commonly located between parallel file systems and compute nodes, handling bursty I/Os in the middle. In this architecture, burst buffers are shared resources. The performance of an SSD is significantly reduced when it is used excessively because of garbage collection, and we have observed that SSDs in a burst buffer become slow when many users simultaneously use the burst buffer. To mitigate the performance problem, we propose a new user-level I/O isolation framework in a High-Performance Computing environment using a multi-streamed SSD. The multi-streamed SSD allocates the same flash block for I/Os in the same stream. We assign a different stream to each user; thus, the user can use the stream exclusively. To evaluate the performance, we have used open-source supercomputing workloads and I/O traces from real workloads in the Cori supercomputer at the National Energy Research Scientific Computing Center. Via user-level I/O isolation, we have obtained up to a 125% performance improvement in terms of I/O throughput. In addition, our approach reduces the write amplification in the SSDs, leading to improved SSD endurance. This user-level I/O isolation framework could be applied to deployed burst buffers without having to make any user interface changes.

Richard A. Gerber, Cori KNL Update, IXPUG 2017, Austin, TX, September 26, 2017,

Yun (Helen) He, Jack Deslippe, Enabling Applications for Cori KNL: NESAP, September 21, 2017,

NERSC Science Highlights - September 2017, NERSC Users Group Meeting 2017, September 19, 2017,

Douglas Doerfler, Brian Austin, Brandon Cook, Jack Deslippe, Krishna Kandalla, Peter Mendygral, "Evaluating the Networking Characteristics of the Cray XC-40 Intel Knights Landing Based Cori Supercomputer at NERSC", Concurrency and Computation: Practice and Experience, Volume 30, Issue 1, September 12, 2017,

Taylor Groves, Yizi Gu, Nicholas J. Wright, "Understanding Performance Variability on the Aries Dragonfly Network", HPCMASPA in association with IEEE Cluster, September 1, 2017,

Yun (Helen) He, Brandon Cook, Jack Deslippe, Brian Friesen, Richard Gerber, Rebecca Hartman­-Baker, Alice Koniges, Thorsten Kurth, Stephen Leak, Woo­Sun Yang, Zhengji Zhao, Eddie Baron, Peter Hauschildt, "Preparing NERSC users for Cori, a Cray XC40 system with Intel Many Integrated Cores", Concurrency and Computation: Practice and Experience, August 2017, 30, doi: 10.1002/cpe.4291

The newest NERSC supercomputer Cori is a Cray XC40 system consisting of 2,388 Intel Xeon Haswell nodes and 9,688 Intel Xeon‐Phi “Knights Landing” (KNL) nodes. Compared to the Xeon‐based clusters NERSC users are familiar with, optimal performance on Cori requires consideration of KNL mode settings; process, thread, and memory affinity; fine‐grain parallelization; vectorization; and use of the high‐bandwidth MCDRAM memory. This paper describes our efforts preparing NERSC users for KNL through the NERSC Exascale Science Application Program, Web documentation, and user training. We discuss how we configured the Cori system for usability and productivity, addressing programming concerns, batch system configurations, and default KNL cluster and memory modes. System usage data, job completion analysis, programming and running jobs issues, and a few successful user stories on KNL are presented.

Zhaoyi Meng, Ekaterina Merkurjev, Alice Koniges, Andrea L. Bertozzi, "Hyperspectral Image Classification Using Graph Clustering Methods", IPOL Journal · Image Processing On Line, 2017, 2017-08-, doi:

Doug Jacobsen, Taylor Groves, Global Aries Counter Collection and Analysis, Cray Quarterly Meeting, July 25, 2017,

Alex Gittens et al, "Matrix Factorization at Scale: a Comparison of Scientific Data Analytics in Spark and C+MPI Using Three Case Studies", 2016 IEEE International Conference on Big Data, July 1, 2017,

Barbara Chapman, Alice Koniges, Yun (Helen) He, Oscar Hernandez, and Deepak Eachempati, OpenMP, An Introduction, Scaling to Petascale Institute, XSEDE Training, Berkeley, CA., June 27, 2017,

Thorsten Kurth, William Arndt, Taylor Barnes, Brandon Cook, Jack Deslippe, Doug Doerfler, Brian Friesen, Yun (Helen) He, Tuomas Koskela, Mathieu Lobet, Tareq Malas, Leonid Oliker, Andrey Ovsyannikov, Samual Williams, Woo-Sun Yang, Zhengji Zhao, "Analyzing Performance of Selected NESAP Applications on the Cori HPC System", High Performance Computing. ISC High Performance 2017. Lecture Notes in Computer Science, Volume 10524, June 22, 2017,

C. Yang, R. C. Bording, D. Price, and R. Nealon, "Optimizing Smoothed Particle Hydrodynamics Code Phantom on Haswell and KNL", International Supercomputing Conference (ISC'17), June 2017,

Shahzeb Siddiqui, HPC Application Testing Framework – buildtest, HPCKP, June 15, 2017,

L. Xu, C. J. Yang, D. Huang, and A. Cantoni, "Exploiting Cyclic Prefix for Turbo-OFDM Receiver Design", IEEE Access, vol. 5, pp. 15762-15775, June 2017,

Yun (Helen) He, Steve Leak, and Zhengji Zhao, Using Cori KNL Nodes, Cori KNL Training, Berkeley, CA., June 9, 2017,

Mustafa Mustafa, Deborah Bard, Wahid Bhimji, Rami Al-Rfou, Zarija Lukić, "Creating Virtual Universes Using Generative Adversarial Networks", Submitted To Sci. Rep., June 1, 2017,

Taylor A. Barnes, Thorsten Kurth, Pierre Carrier, Nathan Wichmann, David Prendergast, Paul RC Kent, Jack Deslippe, "Improved treatment of exact exchange in Quantum ESPRESSO.", Computer Physics Communications, May 31, 2017,

Douglas Jacobsen, and ​Zhengji Zhao, Instrumenting Slurm User Commands to Gain Workload Insight, Proceeding of the ​Cray User Group Meeting (​CUG18​), Stockholm, Sweden, May 20, 2017,

Yun (Helen) He, Brandon Cook, Jack Deslippe, Brian Friesen, Richard Gerber, Rebecca Hartman-Baker, Alice Koniges, Thorsten Kurth, Stephen Leak, Woo-Sun Yang, Zhengji Zhao, Eddie Baron, Peter Hauschildt, Preparing NERSC users for Cori, a Cray XC40 system with Intel Many Integrated Cores, Cray User Group 2017, Redmond, WA, May 12, 2017,

Yun (Helen) He, Brandon Cook, Jack Deslippe, Brian Friesen, Richard Gerber, Rebecca Hartman­-Baker, Alice Koniges, Thorsten Kurth, Stephen Leak, Woo­Sun Yang, Zhengji Zhao, Eddie Baron, Peter Hauschildt, "Preparing NERSC users for Cori, a Cray XC40 system with Intel Many Integrated Cores", Cray User Group 2017, Redmond, WA. Best Paper First Runner-Up., May 12, 2017,

Colin MacLean, "Python Usage Metrics on Blue Waters", Cray User Group, Redmond, WA, 2017,

Jialin Liu, Quincey Koziol, Houjun Tang, François Tessier, Wahid Bhimji, Brandon Cook, Brian Austin, Suren Byna, Bhupender Thakur, Glenn K. Lockwood, Jack Deslippe, Prabhat, "Understanding the IO Performance Gap Between Cori KNL and Haswell", Proceedings of the 2017 Cray User Group, Redmond, WA, May 10, 2017,

The Cori system at NERSC has two compute partitions with different CPU architectures: a 2,004 node Haswell partition and a 9,688 node KNL partition, which ranked as the 5th most powerful and fastest supercomputer on the November 2016 Top 500 list. The compute partitions share a common storage configuration, and understanding the IO performance gap between them is important, impacting not only to NERSC/LBNL users and other national labs, but also to the relevant hardware vendors and software developers. In this paper, we have analyzed performance of single core and single node IO comprehensively on the Haswell and KNL partitions, and have discovered the major bottlenecks, which include CPU frequencies and memory copy performance. We have also extended our performance tests to multi-node IO and revealed the IO cost difference caused by network latency, buffer size, and communication cost. Overall, we have developed a strong understanding of the IO gap between Haswell and KNL nodes and the lessons learned from this exploration will guide us in designing optimal IO solutions in many-core era.

Mario Melara, Todd Gamblin, Gregory Becker, Robert French, Matt Belhorn, Kelly Thompson, Peter Scheibel, Rebecca Hartman-Baker, "Using Spack to Manage Software on Cray Supercomputers", Cray User Group 2017, 2017,

Koskela TS, Deslippe J, Friesen B, Raman K, "Fusion PIC code performance analysis on the Cori KNL system", May 2017,

We study the attainable performance of Particle-In-Cell codes on the Cori KNL system by analyzing a miniature particle push application based on the fusion PIC code XGC1. We start from the most basic building blocks of a PIC code and build up the complexity to identify the kernels that cost the most in performance and focus optimization efforts there. Particle push kernels operate at high AI and are not likely to be memory bandwidth or even cache bandwidth bound on KNL. Therefore, we see only minor benefits from the high bandwidth memory available on KNL, and achieving good vectorization is shown to be the most beneficial optimization path with theoretical yield of up to 8x speedup on KNL. In practice we are able to obtain up to a 4x gain from vectorization due to limitations set by the data layout and memory latency.

Zhengji Zhao, Martijn Marsman, Florian Wende, and Jeongnim Kim, "Performance of Hybrid MPI/OpenMP VASP on Cray XC40 Based on Intel Knights Landing Many Integrated Core Architecture",, May 8, 2017,

Abstract - With the recent installation of Cori, a Cray XC40 system with Intel Xeon Phi Knights Landing (KNL) many integrated core (MIC) architecture, NERSC is transitioning from the multi-core to the more energy-efficient many-core era. The developers of VASP, a widely used materials science code, have adopted MPI/OpenMP parallelism to better exploit the increased on-node parallelism, wider vector units, and the high bandwidth on-package memory (MCDRAM) of KNL. To achieve optimal performance, KNL specifics relevant for the build, boot and run time setup must be explored. In this paper, we present the performance analysis of representative VASP workloads on Cori, focusing on the effects of the compilers, libraries, and boot/run time options such as the NUMA/MCDRAM modes, HyperThreading, huge pages, core specialization, and thread scaling. The paper is intended to serve as a KNL performance guide for VASP users, but it will also benefit other KNL users.

Kirill Lozinskiy, GPFS & HPSS Interface (GHI), Spectrum Scale User Group 2017, April 5, 2017,

This presentation gives a brief overview of integration between the High Performance Storage System (HPSS) and the General Parallel File System (GPFS).

Friesen, B., Baron, E., Parrent, J. T., Thomas, R., C., Branch, D., Nugent, P., Hauschildt, P. H., Foley, R. J., Wright, D. E., Pan, Y.-C., Filippenko, A. V., Clubb, K. I., Silverman, J. M., Maeda, K. Shivvers, I., Kelly, P. L., Cohen, D. P., Rest, A., Kasen, D., "Optical and ultraviolet spectroscopic analysis of SN 2011fe at late times", Monthly Notices of the Royal Astronomical Society, February 27, 2017, 467:2392-2411, doi: 10.1093/mnras/stx241

We present optical spectra of the nearby Type Ia supernova SN 2011fe at 100, 205, 311, 349 and 578 d post-maximum light, as well as an ultraviolet (UV) spectrum obtained with the Hubble Space Telescope at 360 d post-maximum light. We compare these observations with synthetic spectra produced with the radiative transfer code phoenix. The day +100 spectrum can be well fitted with models that neglect collisional and radiative data for forbidden lines. Curiously, including these data and recomputing the fit yields a quite similar spectrum, but with different combinations of lines forming some of the stronger features. At day +205 and later epochs, forbidden lines dominate much of the optical spectrum formation; however, our results indicate that recombination, not collisional excitation, is the most influential physical process driving spectrum formation at these late times. Consequently, our synthetic optical and UV spectra at all epochs presented here are formed almost exclusively through recombination-driven fluorescence. Furthermore, our models suggest that the UV spectrum even as late as day +360 is optically thick and consists of permitted lines from several iron-peak species. These results indicate that the transition to the ‘nebular’ phase in Type Ia supernovae is complex and highly wavelength dependent.

Tutorial w/ handouts. use of Shifter w/ image of chos=sl64 from PDSF Download the slides at


Rebecca Hartman-Baker, Craypat and Reveal, NERSC New User Training, February 23, 2017,

Rebecca Hartman-Baker, Accounts and Allocations, NERSC New User Training, February 23, 2017,

Rebecca Hartman-Baker, NERSC Overview, NERSC New User Training, February 23, 2017,

Richard A Gerber, February 2017 Allocations and Usage Update, February 23, 2017,

NERSC Users Group Webinar on 2017 allocations for Cori Knights Landing nodes and Queue Wait Time Reduction Actions

Florian Wende, Martijn Marsman, Zhengji Zhao, Jeongnim Kim, "Porting VASP from MPI to MPI+ OpenMP [SIMD]-Optimization Strategies, Insights and Feature Proposals", 13th International Workshop on OpenMP (IWOMP), September 20–22, 2017, Stony Brook, NY, USA, February 18, 2017,

Richard A Gerber, NERSC Allocations Forecast for PIs, NERSC Users Group, February 16, 2017,

Evan Berkowitz, Thorsten Kurth, Amy Nicholson, Balint Joo, Eniro Rinaldi, Mark Strother, Pavlos Vranas, Andre Walker-Loud, "Two-Nucleon Higher Partial-Wave Scattering from Lattice QCD", Phys. Lett. B, February 10, 2017,

Richard A. Gerber, Allocations and Usage Update for DOE Program Managers, February 7, 2017,

Richard A. Gerber, NERSC's KNL System: Cori, Exascale Computing Project All-Hands Meeting, February 1, 2017,

Presented at the DOE Exascale Computing Project annual meeting in Knoxville, TN.

Wangyi Liu, Alice Koniges, Kevin Gott, David Eder, John Barnard, Alex Friedman, Nathan Masters, Aaron Fisher, "Surface tension models for a multi-material ALE code with AMR", Computers & Fluids, January 2017, doi:

Special Issue on Data-Intensive Scalable Computing Systems, Special Issue of Parallel Computing, Pages: 1-96 January 31, 2017,

Richard A. Gerber, Overview of NERSC, Presented to SLAC Computing, January 24, 2017,

Taylor Groves, Characterizing Power and Performance in HPC Networks, Future Technologies Group at ORNL, January 10, 2017,

Taylor Groves, Characterizing and Improving Power and Performance in HPC Networks, Advanced Technology Group -- NERSC, January 8, 2017,

Jan-Timm Kuhr, Johannes Blaschke, Felix Ruehle, Holger Stark, "Collective sedimentation of squirmers under gravity", Soft Matter, 2017, 13:7548--7555,

Jack Deslippe, Doug Doerfler, Brandon Cook, Tareq Malas, Samuel Williams, Sudip Dosanjh, "Optimizing Science Applications for the Cori, Knights Landing, System at NERSC", Advances in Parallel Computing, Volume 30: New Frontiers in High Performance Computing and Big Data, ( January 1, 2017)

Ryan E. Grant, Taylor Groves, Simon Hammond, K. Scott Hemmert, Michael Levenhagen, Ron Brightwell, "Handbook of Exascale Computing: Network Communications", (ISBN:978-1466569003 Chapman and Hall: January 1, 2017)