NERSCPowering Scientific Discovery for 50 Years

NERSC Staff Publications & Presentations

This page displays a bibliography of staff publications and presentations from Jan. 1, 2017 to present. Earlier publications are available in the archive.


2023

Zhengji Zhao, Ermal Rrapaj, Sridutt Bhalachandra, Brian Austin, Hai Ah Nam, Nicholas Wright, "Power Analysis of NERSC Production Workloads", In Proceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis (SC-W '23), New York, NY, USA, Association for Computing Machinery, November 2023, 1279-1287, doi: 10.1145/3624062.3624200

Anish Govind, Sridutt Bhalachandra, Zhengji Zhao, Ermal Rrapaj, Brian Austin, and Hai Ah Nam, "Comparing Power Signatures of HPC Workloads: Machine Learning vs Simulation", In Proceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis (SC-W '23)., New York, NY, USA, Association for Computing Machinery, November 2023, 1890-1893, doi: 10.1145/3624062.3624274

Nicholas Balthaser, Tape Library Incident Management, 2023 HPSS User Forum, October 28, 2023,

A look at how the NERSC HPSS Team handles off-hours tape library incidents.

Jan Balewski, Daan Camps, Katherine Klymko, Andrew Tritt, "Efficient Quantum Counting and Quantum Content-Addressable Memory for DNA similarity", accepted at the IEEE23 quantum week conference, August 1, 2023,

We present QCAM, a quantum analogue of Content-Addressable Memory (CAM), useful for finding matches in two sequences of bit-strings. Our QCAM implementation takes advantage of Grover's search algorithm and proposes a highly-optimized quantum circuit implementation of the QCAM oracle. Our circuit construction uses the parallel uniformly controlled rotation gates, which were used in previous work to generate QBArt encodings. These circuits have a high degree of quantum parallelism which reduces their critical depth. The optimal number of repetitions of the Grover iterator used in QCAM depends on the number of true matches and hence is input dependent. We additionally propose a hardware-efficient implementation of the quantum counting algorithm (HEQC) that can infer the optimal number of Grover iterations from the measurement of a single observable. We demonstrate the QCAM application for computing the Jaccard similarity between two sets of k-mers obtained from two DNA sequences.

Jeffrey C. Carver,Nasir Eisty, Hai Ah Nam, Irina Tezaur, "Special Issue on the Future of Research Software Engineers in the United States—Part II", Computing in Science & Engineering ( Volume: 24, Issue: 6, Nov.-Dec. 2022), June 2023, doi: 10.1109/MCSE.2023.3267696

Zhengji Zhao, Brian Austin, Stefan Maintz, Martijn Marsman, VASP Performance on Cray EX Based on NVIDIA A100 GPUs and AMD Milan CPUs, Cray User Group 2023, Helsinki, Finland, May 11, 2023,

Zhengji Zhao, Brian Austin, Stefan Maintz, Martijn Marsman, "VASP Performance on Cray EX Based on NVIDIA A100 GPUs and AMD Milan CPUs", Cray User Group 2023, Helsinki, Finland, May 11, 2023,

Nicholas Balthaser, Rosario Martinez, Current HPSS Projects 2023, LLNL HPSS DevOps Quarterly Telecon, May 11, 2023,

A look at current HPSS operational and development projects at NERSC, as of May 2023.

Nicholas Balthaser, Rosario Martinez, HPSS Software Development at NERSC, NERSC Software Engineering Group, May 4, 2023,

Overview of NERSC software development efforts for the High Performance Storage System (HPSS)

Jeffrey C. Carver;,Nasir Eisty, Hai Ah Nam, Irina Tezaur, "Special Issue on the Future of Research Software Engineers in the United States—Part I", IEEE Computing in Science & Engineering ( Volume: 24, Issue: 5, Sept.-Oct. 2022), May 2023, doi: 10.1109/MCSE.2023.3261221

Adam Winick, Jan Balewski, Gang Huang, Yilun Xu,, "Predicting dynamics and measuring crosstalk with simultaneous Rabi experiments", APS March Meeting 2023, March 8, 2023,

Jan Balewski, Mercy G. Amankwah, Roel Van Beeumen, E. Wes Bethel, Talita Perciano, Daan Camps, "Quantum-parallel vectorized data encodings and computations on trapped-ions and transmons QPUs", January 19, 2023,

2022

"Co-design Center for Exascale Machine Learning Technologies (ExaLearn)", 2023 Exascale Computing Project Annual Meeting, December 22, 2022,

Zhengji Zhao and Rebecca Hartman-Baker, Checkpoint/Restart Project Update, Internal report, November 29, 2022,

Kevin Gott, Creating (or not creating) a Portable Test, SC22 Software Engineering and Reuse in Modeling, Simulation, and Data Analytics for Science and Engineering BOF Lightning Talk, November 16, 2022,

Zhengji Zhao, Introduction to the CRI standard 1.0, A BOF presentation in the International Conference for High Performance Computing, Networking, Storage and Analysis (SC22), November 15, 2022,

Tarun Malviya, Zhengji Zhao, Rebecca Hartman-Baker, Gene Cooperman, Extending MPI API Support in MANA, A lightning talk presented in SuperCheck-SC22 held in conjunction with SC22, November 14, 2022,

Zhengji Zhao, Welcome to SuperCheck-SC22, Opening/Closing Remarks in the third International Symposium on Checkpointing for Supercomputing (SuperCheck-SC22), held in conjunction with SC22, November 14, 2022,

Jan Balewski, Zhenying Liu, Alexander Tsyplikhin, Manuel Lopez Roland, Kristofer Bouchard, "Time-series ML-regression on Graphcore IPU-M2000 and Nvidia A100", 2022 IEEE/ACM International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS), November 14, 2022, doi: 10.1109/PMBS56514.2022.00019

Izumi Barker, Mozhgan Kabiri Chimeh, Kevin Gott, Thomas Papatheodore, Mary P. Thomas, "Approaching Exascale: Best Practicies for Training a Diverse Workforce Using Hackathons", SC22: Ninth SC Workshop on Best Practices for HPC Training and Education, 2022,

Given the anticipated growth of the high-performance computing market, HPC is challenged with expanding the size, diversity, and skill of its workforce while also addressing post-pandemic distributed workforce protocols and an ever-expanding ecosystem of architectures, accelerators and software stacks. 

As we move toward exascale computing, training approaches need to address how best to prepare future computational scientists and enable established domain researchers to stay current and master tools needed for exascale architectures.

This paper explores adding in-person and virtual hackathons to the training mix to bridge traditional programming curricula and hands-on skills needed among the diverse communities. We outline current learning and development programs available; explain benefits and challenges in implementing hackathons for training; share specific use cases, including training “readiness,” outcomes and sustaining progress; discuss how to engage diverse communities—from early career researchers to veteran scientists; and recommend best practices for implementing these events into their training mix.

Luca Fedeli, Axel Huebl, France Boillod-Cerneux, Thomas Clark, Kevin Gott, Conrad Hillairet, Stephan Jaure, Adrien Leblanc, Rémi Lehe, Andrew Myers, Christelle Piechurski, Mitsuhisa Sato, Neil Zaïm, Weiqun Zhang, Jean-Luc Vay, Henri Vincenti, "Pushing the frontier in the design of laser-based electron accelerators with groundbreaking mesh-refined particle-in-cell simulations on exascale-class supercomputers", In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC '22), IEEE Press, 2022, 3:1–12, doi: 10.5555/3571885.3571889

We present a first-of-kind mesh-refined (MR) massively parallel Particle-In-Cell (PIC) code for kinetic plasma simulations optimized on the Frontier, Fugaku, Summit, and Perlmutter supercomputers. Major innovations, implemented in the WarpX PIC code, include: (i) a three level parallelization strategy that demonstrated performance portability and scaling on millions of A64FX cores and tens of thousands of AMD and Nvidia GPUs (ii) a groundbreaking mesh refinement capability that provides between 1.5× to 4× savings in computing requirements on the science case reported in this paper, (iii) an efficient load balancing strategy between multiple MR levels. The MR PIC code enabled 3D simulations of laser-matter interactions on Frontier, Fugaku, and Summit, which have so far been out of the reach of standard codes. These simulations helped remove a major limitation of compact laser-based electron accelerators, which are promising candidates for next generation high-energy physics experiments and ultra-high dose rate FLASH radiotherapy.

Robin Shao, Thorsten Kurth, Zhengji Zhao, NERSC Job Script Generator, HUST-22: 9th International Workshop on HPC User Support Tools, November 14, 2022,

NERSC is the primary scientific computing facility for DOE’s Office of Science. NERSC supports diverse production workloads across a wide range of scientific disciplines, which requires a rather complicated queue structure with various resource limits and priorities. It has been challenging for users to generate proper job scripts to optimally use the systems. We developed a Slurm job script generator, a web application to help users not only generate job scripts but also learn how the batch system works. The job script generator was first deployed in 2016 to help generate an optimal process/threads affinity for the hybrid MPI + OpenMP applications for NERSC’s Cori system, and was recently extended to support more systems and use cases. In this talk, we will present the features supported in our job script generator, and describe the code design and implementation, which is easily adaptable to other centers who deploy Slurm.

Gene Cooperman, Dahong Li, Zhengji Zhao, "Debugging MPI Implementations via Reduction-to-Primitives", Third International Symposium on Checkpointing for Supercomputing (SuperCheck-SC22), held in conjunction with SC22, To be published with IEEE CPS, November 14, 2022,

Testing correctness of either a new MPI implementation or a transparent checkpointing package for MPI is inherently difficult. A bug is often observed when running a correctly written MPI application, and it produces an error. Tracing the bug to a particular subsystem of the MPI package is difficult due to issues of complex parallelism, race conditions, etc. This work provides tools to decide if the bug is: in the subsystem implementing of collective communication; or in the subsystem implementing point-to-point communication; or in some other subsystem. The tools were produced in the context of testing a new system, MANA. MANA is not a standalone MPI implementation, but rather a package for transparent checkpointing of MPI applications. In addition, a short survey of other debugging tools for MPI is presented. The strategy of transforming the execution for purposes of diagnosing a bug appears to be distinct from most existing debugging approaches.

Nan Ding, Samuel Williams, Hai Ah Nam, Taylor Groves, Muaaz Gul Awan, LeAnn Lindsey, Christopher Daley, Oguz Selvitopi, Leonid Oliker, Nicholas Wright, "Methodology for Evaluating the Potential of Disaggregated Memory Systems", 2022 IEEE/ACM International Workshop on Resource Disaggregation in High-Performance Computing (REDIS), November 2022, doi: 10.1109/RESDIS56595.2022.00006

Taylor Groves, Chris Daley, Rahulkumar Gayatri, Hai Ah Nam, Nan Ding, Lenny Oliker, Nicholas J. Wright, Samuel Williams, "A Methodology for Evaluating Tightly-integrated and Disaggregated Accelerated Architectures", 2022 IEEE/ACM International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS), November 2022, doi: 10.1109/PMBS56514.2022.00012

Zhengji Zhao, Running VASP on Perlmutter, A presentation in the VASP Hands-on user training (Online), September 27, 2022,

Zhengji Zhao, Requirements Report on the Checkpoint/Restart Interface Standard, NERSC Lunchtime Talk, August 22, 2022,

Rebecca Hartman-Baker, Zhengji Zhao, Checkpoint/Restart Project Update, CR Collaboration Day, August 9, 2022,

Zhengji Zhao, Welcome to C/R Requirements Gathering Workshop, Opening/Closing Remarks in the Checkpoint/Restart Requirements Gathering Workshop (Online), July 12, 2022,

Alexander Ladd, Jan Balewski, Kyung G. Kim, Kristofer Bouchard, Roy Ben-Shalom, "Scaling and Benchmarking an Evolutionary Algorithm for Constructing Biophysical Neuronal Models", ​​Frontiers in Neuroinformatics, May 22, 2022,

Yun (Helen) He, Rebecca Hartman-Baker, "Best Practices for NERSC Training", Journal of Computational Science Education, April 2022, 13:23-26, doi: 10.22369/issn.2153-4136/13/1/4

Yilun Xu, Gang Huang, Jan Balewski, Ravi K. Naik, Alexis Morvan, Brad Mitchell, Kasra Nowrouzi, David I. Santiago, Irfan Siddiqi, "Automatic Qubit Characterization and Gate Optimization with QubiC", ACM Transactions on Quantum Computing, April 13, 2022,

Jan Balewski, Gang Huang, Adam Winick, Yilun Xu, David I Santiago, Irfan Siddiqi, Ravi K Naik, Quantum processor crosstalk mitigation using QubiC controller aided by NERSC HPC, APS March Meeting 2022, March 16, 2022,

Kevin Gott, Weiqun Zhang, Andrew Myers, Ann S. Almgren, John B. Bell, Advances in GPU Methodologies in AMReX, 2022 SIAM Conference on Parallel Processing for Scientific Computing, 2022,

The AMReX software framework for massively parallel block-structured AMR applications has undergone extensive improvements to run efficiently on GPU supercomputers, especially the DOE exascale systems Perlmutter, Frontier and Aurora. The latest generation of computing technologies has led to additional studies in performance and algorithmic design with a focus on usability and scientific achievement. These advancements have demonstrated substantial gains across the AMReX suite of applications, including WarpX, Nyx, Castro, Pele, MFix-Exa and AMR-Wind. 

This talk will give an overview of recent AMReX advancements in GPU design and implementation. Topics include advancements in porting to AMD and Intel software frameworks, advances and remaining deficiencies in GPU performance and new technologies explored to enhance AMReX’s capabilities and prepare for the next-generation of scientific research.

Anshu Dubey, Klaus Weide, Jared O’Neal, Akash Dhruv, Sean Couch, J. Austin Harris, Tom Klosterman, Rajeev Jain, Johann Rudi, Bronson Messer, Michael Pajkos, Jared Carlson, Ran Chu, Mohamed Wahib, Saurabh Chawdhary, Paul M. Ricker, Dongwook Lee, Katie Antypas, Katherine M. Riley, Christopher Daley, Murali Ganapathy, Francis X. Timmes, Dean M. Townsley, Marcos Vanella, John Bachan, Paul M. Rich, Shravan Kumar, Eirik Endeve, W. Raphael Hix, Anthony Mezzacappa, Thomas Papatheodore, "Flash-X: A multiphysics simulation software instrument", SoftwareX, 2022, 19:101168,

Flash-X is a highly composable multiphysics software system that can be used to simulate physical phenomena in several scientific domains. It derives some of its solvers from FLASH, which was first released in 2000. Flash-X has a new framework that relies on abstractions and asynchronous communications for performance portability across a range of increasingly heterogeneous hardware platforms. Flash-X is meant primarily for solving Eulerian formulations of applications with compressible and/or incompressible reactive flows. It also has a built-in, versatile Lagrangian framework that can be used in many different ways, including implementing tracers, particle-in-cell simulations, and immersed boundary methods.

Orhean, A. I., Giannakou, A., Antypas, K., Raicu, I., & Ramakrishnan, L., "Evaluation of a scientific data search infrastructure", Concurrency and Computation: Practice and Experience,, 2022, 34(27):e7261, doi: 10.1002/cpe.7261

The ability to search over large scientific datasets has become crucial to next-generation scientific discoveries as data generated from scientific facilities grow dramatically. In previous work, we developed and deployed ScienceSearch, a search infrastructure for scientific data which uses machine learning to automate metadata creation. Our current deployment is deployed atop a container based platform at a HPC center. In this article, we present an evaluation and discuss our experiences with the ScienceSearch infrastructure. Specifically, we present a performance evaluation of ScienceSearch's infrastructure focusing on scalability trends. The obtained results show that ScienceSearch is able to serve up to 130 queries/min with latency under 3 s. We discuss our infrastructure setup and evaluation results to provide our experiences and a perspective on opportunities and challenges of our search infrastructure.

Jonathan Schwartz, Chris Harris, Jacob Pietryga, Huihuo Zheng, Prashant Kumar, Anastasiia Visheratina, Nicholas A. Kotov, Brianna Major, Patrick Avery, Peter Ercius, Utkarsh Ayachit, Berk Geveci, David A. Muller, Alessandro Genova, Yi Jiang, Marcus Hanwell, Robert Hovden, "Real-time 3D analysis during electron tomography using tomviz", Nature Communications, 2022, 13:4458, doi: 10.1038/s41467-022-32046-0

2021

Antypas, KB and Bard, DJ and Blaschke, JP and Canon, RS and Enders, B and Shankar, M and Somnath, S and Stansberry, D and Uram, TD and Wilkinson, SR, "Enabling discovery data science through cross-facility workflows", Institute of Electrical and Electronics Engineers (IEEE), December 2021, 3671-3680, doi: 10.1109/bigdata52589.2021.9671421

Zhengji Zhao, Welcome to SuperCheck-SC21, Opening/Closing Remarks in the Second International Symposium on Checkpointing for Supercomputing (SuperCheck-SC21) held in conjunction with the International Conference for High Performance Computing, Networking, Storage and Analysis (SC21), November 15, 2021,

Yao Xu, Zhengji Zhao, Rohan Garg, Harsh Khetawat, Rebecca Hartman-Baker, and Gene Cooperman, MANA-2.0: A Future-Proof Design for Transparent Checkpointing of MPI at Scale, November 15, 2021,

Yao Xu, Zhengji Zhao, Rohan Garg, Harsh Khetawat, Rebecca Hartman-Baker, Gene Cooperman, "MANA-2.0: A Future-Proof Design forTransparent Checkpointing of MPI at Scale", Second International Symposium on Checkpointing for Supercomputing, held in conjunction with SC21. Conference website: https://supercheck.lbl.gov/, November 15, 2021,

K. Z. Ibrahim, T. Nguyen, H. Nam, W. Bhimji, S. Farrell, L. Oliker, M. Rowan, N. J. Wright, Williams, "Architectural Requirements for Deep Learning Workloads in HPC Environments", 2021 International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS), IEEE, November 2021, doi: 10.1109/PMBS54543.2021.00007

Lisa Claus, Fighting Climate Change with Computer Software, Berkeley Lab Research SLAM 2021, 2021,

Zhengji Zhao, A Preliminary Report on MANA Status on Cori, MANA Collaboration Kick-off Meeting, October 26, 2021,

Nicholas Balthaser, Francis Dequenne, Melinda Jacobsen, Owen James, Kristy Kallback-Rose, Kirill Lozinskiy, NERSC HPSS Site Update, 2021 HPSS User Forum, October 20, 2021,

Melinda Jacobsen, Rosario Martinez, Containerization of HSI/HTAR Clients at NERSC, 2021 HPSS User Forum, October 13, 2021,

Zhengji Zhao, Rebecca Hartman-Baker, "Checkpoint/Restart Vision and Strategies for NERSC’s Production Workloads", https://escholarship.org/uc/item/48v5r5rj, August 20, 2021,

Zhengji Zhao, Rebecca Hartman-Baker, Checkpoint/Restart (C/R) Vision at NERSC, C/R project update meeting (internal), August 13, 2021,

Yuxing Peng, Jonathan Skone, Callista Christ, Hakizumwami Runesha, "Skyway: A Seamless Solution for Bursting Workloads from On-Premises HPC Clusters to Commercial Clouds", PEARC '21: Practice and Experience in Advanced Research Computing, ACM, July 17, 2021, 46:1-5, doi: 10.1145/3437359.3465607

C. Yang, Y. Wang, S. Farrel, T. Kurth, and S. Williams, "Hierarchical Roofline Performance Analysis for Deep Learning Applications", Science and Information (SAI) Conference series - Computing Conference 2021, July 2021,

Hannah E Ross, Sambit K Giri, Garrelt Mellema, Keri L Dixon, Raghunath Ghara, Ilian T Iliev, "Redshift-space distortions in simulations of the 21-cm signal from the cosmic dawn", Journal, July 9, 2021, 506:3717–3733, doi: 10.1093/mnras/stab1822

T. Kurth, C. Yang, et al, "Roofline Performance Analysis of Deep Learning Kernels and Applications", The Platform for Advanced Scientific Computing Conference (PASC’21) (accepted), July 2021,

Rowan, M.E., Gott, K.N., Deslippe, J., Huebl, A., Thévenet, M., Lehe, R., Vay, J.L., "In-situ assessment of device-side compute work for dynamic load balancing in a GPU-accelerated PIC code", Proceedings of the Platform for Advanced Scientific Computing Conference, 2021, 1-11, doi: 10.1145/3468267.3470614

Maintaining computational load balance is important to the performant behavior of codes which operate under a distributed computing model. This is especially true for GPU architectures, which can suffer from memory oversubscription if improperly load balanced. We present enhancements to traditional load balancing approaches and explicitly target GPU architectures, exploring the resulting performance. A key component of our enhancements is the introduction of several GPU-amenable strategies for assessing compute work. These strategies are implemented and benchmarked to find the most optimal data collection methodology for in-situ assessment of GPU compute work. For the fully kinetic particle-in-cell code WarpX, which supports MPI+CUDA parallelism, we investigate the performance of the improved dynamic load balancing via a strong scaling-based performance model and show that, for a laser-ion acceleration test problem run with up to 6144 GPUs on Summit, the enhanced dynamic load balancing achieves from 62%--74% (88% when running on 6 GPUs) of the theoretically predicted maximum speedup; for the 96-GPU case, we find that dynamic load balancing improves performance relative to baselines without load balancing (3.8x speedup) and with static load balancing (1.2x speedup). Our results provide important insights into dynamic load balancing and performance assessment, and are particularly relevant in the context of distributed memory applications ran on GPUs.

Weile Wei, Eduardo D'Azevedo, Kevin Huck, Arghya Chatterjee, Oscar Hernandez, Hartmut Kaiser, "Memory Reduction using a Ring Abstraction over GPU RDMA for Distributed Quantum Monte Carlo Solver", The Platform for Advanced Scientific Computing (PASC) Conference, July 5, 2021,

Deborah J. Bard, Mark R. Day, Bjoern Enders, Rebecca J. Hartman–Baker, John Riney III, Cory Snavely, Gabor Torok, "Automation for Data-Driven Research with the NERSC Superfacility API", Lecture Notes in Computer Science, Springer International Publishing, 2021, 333, doi: 10.1007/978-3-030-90539-2_22

"Better Scientific Software (BSSw) Blogs", H. Nam, Better Scientific Software (https://bssw.io/), 2021,

Nicholas Balthaser, NERSC HPSS - ASCR Update, NERSC-9 IPT Weekly Meeting, June 28, 2021,

A look at the NERSC storage hierarchy with a focus on long-term high-capacity storage.

Doru Thom Popovici, Andrew Canning, Zhengji Zhao, Lin-Wang Wang, John Shalf, Data Layouts are Important, ARRAY 2021 (Online), June 21, 2021,

Zhengji Zhao, Rebecca Hartman-Baker, Our Vision and Strategies to Checkpoint/Restart Production Workloads at NERSC, NERSC Resilience Workshop (Online), June 17, 2021,

Doru Thom Popovici, Andrew Canning, Zhengji Zhao, Lin-Wang Wang, John Shalf, "A Systematic Approach to Improving Data Locality Across Fourier Transforms and Linear Algebra Operations", International Conference on Supercomputing 2021 (ICS-2021) (Online), https://ics21.github.io/, June 14, 2021,

Zhang W, Myers A, Gott K, Almgren A, Bell J, "AMReX: Block-structured adaptive mesh refinement for multiphysics applications", The International Journal of High Performance Computing Applications, 2021, 35(6):508-526, doi: 10.1177/10943420211022811

Block-structured adaptive mesh refinement (AMR) provides the basis for the temporal and spatial discretization strategy for a number of Exascale Computing Project applications in the areas of accelerator design, additive manufacturing, astrophysics, combustion, cosmology, multiphase flow, and wind plant modeling. AMReX is a software framework that provides a unified infrastructure with the functionality needed for these and other AMR applications to be able to effectively and efficiently utilize machines from laptops to exascale architectures. AMR reduces the computational cost and memory footprint compared to a uniform mesh while preserving accurate descriptions of different physical processes in complex multiphysics algorithms. AMReX supports algorithms that solve systems of partial differential equations in simple or complex geometries and those that use particles and/or particle–mesh operations to represent component physical processes. In this article, we will discuss the core elements of the AMReX framework such as data containers and iterators as well as several specialized operations to meet the needs of the application projects. In addition, we will highlight the strategy that the AMReX team is pursuing to achieve highly performant code across a range of accelerator-based architectures for a variety of different applications.

Zhengji Zhao, Checkpoint/Restart VASP Jobs Using MANA on Cori, NERSC User Training (Online), May 25, 2021,

Zhengji Zhao, Checkpoint/Restart MPI Applications with MANA on Cori, NERSC User Training (Online), May 7, 2021,

Zhengji Zhao, and Rebecca Hartman-Baker, Checkpoint/Restart Deployment at NERSC, PEAD-SIG Meeting at Cray User Group Meeting 2021 (Online), May 6, 2021,

Grzegorz Muszynski, Prabhat, Jan Balewski, Karthik Kashinath, Michael Wehner, Vitaliy Kurlin, "Atmospheric Blocking Pattern Recognition in Global Climate Model Simulation Data", publication descriptionICPR 2020, May 5, 2021,

Sravani Konda, Dunni Aribuki, Weiqun Zhang, Kevin Gott, Christopher Lishka, Experiences Supporting DPC++ in AMReX, IWOCL/SYCLcon 2021, 2021,

William B Andreopoulos, Alexander M Geller, Miriam Lucke, Jan Balewski, Alicia Clum, Natalia Ivanova, Asaf Levy, "Deeplasmid: Deep learning accurately separates plasmids from bacterial chromosomes", Nucleic Acids Research, April 2, 2021,

Revathi Jambunathan, Don E. Willcox, Andrew Myers, Jean-Luc Vay, Ann S. Almgren, Ligia Diana Amorim, John B. Bell, Lixin Ge, Kevin N. Gott, David Grote, Axel Heubl, Rémi Lehe, Cho-Kuen Ng, Michael Rowan, Olga Shapoval, Maxence Thevenet, Eloise J. Yang, Weiqun Zhang, Yinjiang Zhao, Edoardo Zoni, Particle-in-Cell Simulations of Pulsar Magnetospheres, SIAM Conference on Computational Science and Engineering 2021, 2021,

WarpX is a highly scalable, electromagnetic particle-in-cell code developed as part of the Exascale Computing Project. While its primary purpose is to simulate plasma-based particle accelerators, its core PIC routines and advanced algorithms to mitigate numerical artifacts in mesh-refinement simulations can also be leveraged to study particle acceleration in the astrophysical context. In this presentation, we report on the use of WarpX to model pulsar magnetospheres and on the main challenge in using a fully-kinetic approach to model pulsar magnetospheres: the disparate length-scales that span the simulation domain. Indeed, the smallest skin-depth in the critical current-sheet region is six orders of magnitude smaller than the size of the domain required to model the pulsar magnetosphere. Resolving these small length-scales with a uniform grid is intractable even on large supercomputers. As a work-around, existing PIC simulations decrease the scale-difference by reducing the magnetic-field strength of the pulsar. We will present preliminary work on extending WarpX to model pulsar magnetospheres and study the effect of scaling-down the magnetic field-strength on the predictions of Poynting vector and braking-index of the pulsar. We will also explore the use of mesh-refinement for modeling current-sheet regions, which will enable us to extend the current state-of-the-art by enabling simulations with stronger magnetic fields.

Kevin Gott, Weiqun Zhang, Andrew Meyers, Preparing AMReX for Exascale: Async I/O, Fused Launches and Other Recent Advancements, SIAM Conference on Computational Science and Engineering 2021, March 1, 2021,

AMReX, the block-structured AMR ECP Co-Design Center, is currently developing its software framework in preparation for the upcoming exascale systems, Frontier and Aurora. AMReX is targeting performance portable strategies that can be implemented natively in C++, require no additional dependencies, and can yield runtime improvements in CUDA, HIP and DPC++. The goal is to make AMReX-based applications as performant as possible on the next-generation exascale systems as soon as they are available. 

This talk will be an overview of some of AMReX’s advancements for targeting these supercomputers, focusing on general purpose algorithms that can be useful to the broader computational community. Discussed features will include asynchronous I/O, automated fused GPU kernel launches and other recent additions that are shaping AMReX’s workflow. An overview of the status of AMReX’s ECP applications will also be presented, highlighting how these algorithms are already impacting the scientific community.

H. Nam, D. Rouson, K. Niemeyer, N. Eisty, C. Rubio-Gonzalez, SIAM CSE21 Minisymposium: Better Scientific Software Fellowship, SIAM CSE 2021. https://www.siam.org/conferences/cm/conference/cse21, March 1, 2021,

Smith, J. S. and Nebgen, B. and Mathew, N. et al., "Automated discovery of a robust interatomic potential for aluminum", February 23, 2021, doi: 10.1038/s41467-021-21376-0

Thijs Steel, Daan Camps, Karl Meerbergen, Raf Vandebril, "A Multishift, Multipole Rational QZ Method with Aggressive Early Deflation", SIAM Journal on Matrix Analysis and Applications, February 19, 2021, 42:753-774, doi: 10.1137/19M1249631

In the article “A Rational QZ Method” by D. Camps, K. Meerbergen, and R. Vandebril [SIAM J. Matrix Anal. Appl., 40 (2019), pp. 943--972], we introduced rational QZ (RQZ) methods. Our theoretical examinations revealed that the convergence of the RQZ method is governed by rational subspace iteration, thereby generalizing the classical QZ method, whose convergence relies on polynomial subspace iteration. Moreover the RQZ method operates on a pencil more general than Hessenberg---upper triangular, namely, a Hessenberg pencil, which is a pencil consisting of two Hessenberg matrices. However, the RQZ method can only be made competitive to advanced QZ implementations by using crucial add-ons such as small bulge multishift sweeps, aggressive early deflation, and optimal packing. In this paper we develop these techniques for the RQZ method. In the numerical experiments we compare the results with state-of-the-art routines for the generalized eigenvalue problem and show that the presented method is competitive in terms of speed and accuracy.

Steven Farrell, Murali Emani, Jacob Balma, Lukas Drescher, Aleksandr Drozd, Andreas Fink, Geoffrey Fox, David Kanter, Thorsten Kurth, Peter Mattson, Dawei Mu, Amit Ruhela, Kento Sato, Koichi Shirahata, Tsuguchika Tabaru, Aristeidis Tsaris, Jan Balewski, Ben Cumming, Takumi Danjo, Jens Domke, Takaaki Fukai, Naoto Fukumoto, Tatsuya Fukushi, Balazs Gerofi, Takumi Honda, Toshiyuki Imamura, Akihiko Kasagi, Kentaro Kawakami, Shuhei Kudo, Akiyoshi Kuroda, Maxime Martinasso, Satoshi Matsuoka, Henrique Mendonça, Kazuki Minami, Prabhat Ram, Takashi Sawada, Mallikarjun Shankar, Tom St. John, Akihiro Tabuchi, Venkatram Vishwanath, Mohamed Wahib, Masafumi Yamazaki, Junqi Yin, "MLPerf HPC: A Holistic Benchmark Suite for Scientific Machine Learning on HPC Systems", 2021 IEEE/ACM Workshop on Machine Learning in High Performance Computing Environments (MLHPC), February 13, 2021, doi: 10.1109/MLHPC54614.2021.00009

Prashant Singh Chouhan, Harsh Khetawat, Neil Resnik, Jain Twinkle, Rohan Garg, Gene Cooperman, Rebecca Hartman-Baker and Zhengji Zhao, Improving scalability and reliability of MPI-agnostic transparent checkpointing for production workloads at NERSC, First International Symposium on Checkpointing for Supercomputing. Conference website: https://supercheck.lbl.gov/archive/supercheck21, February 4, 2021,

Lisa Claus, High-Performance Multifrontal Solver with Low-Rank Compression, Berkeley Lab's 2021 Computing Sciences Area Postdoc Symposium, 2021,

P. Maris, J.P. Vary, P. Navratil, W.E. Ormand, H. Nam, D.J. Dean, Origin of the anomalous long lifetime of 14C, "Origin of the anomalous long lifetime of 14C", Phys. Rev. Lett. 105, 202502 (2011), 2021,

C. Yang, Accelerating Large-Scale Excited-State GW Calculations in Material Science, 2nd Berkeley Excited States Conference (BESC2021), January 2021,

Eric Suchyta, Scott Klasky, Norbert Podhorszki, Matthew Wolf, Abolaji Adesoji, CS Chang, Jong Choi, Philip E Davis, Julien Dominski, Stéphane Ethier, Ian Foster, Kai Germaschewski, Berk Geveci, Chris Harris, Kevin A Huck, Qing Liu, Jeremy Logan, Kshitij Mehta, Gabriele Merlo, Shirley V Moore, Todd Munson, Manish Parashar, David Pugmire, Mark S Shephard, Cameron W Smith, Pradeep Subedi, Lipeng Wan, Ruonan Wang, Shuangxi Zhang, "The Exascale Framework for High Fidelity coupled Simulations (EFFIS): Enabling whole device modeling in fusion science", The International Journal of High Performance Computing Applications, 2021, 36:106-128, doi: 10.1177/10943420211019119

2020

M. Del Ben, C. Yang, et al, Achieving Performance Portability on Leadership Class HPC Systems for Large Scale GW Calculations, Pacifichem Congress, December 2020,

Yosuke Oyama, Jan Balewski, and more, "The Case for Strong Scaling in Deep Learning: Training Large 3D CNNs With Hybrid Parallelism", IEEE Transactions on Parallel and Distributed Systems, December 1, 2020,

Yilun Xu, Gang Huang, Jan Balewski, Ravi Naik, Alexis Morvan, Bradley Mitchell, Kasra Nowrouzi, David I. Santiago, Irfan Siddiqi, "QubiC: An open source FPGA-based control and measurement system for superconducting quantum information processors", IEEE Transactions on Quantum Engineering, https://arxiv.org/abs/2101.00071, December 1, 2020,

Y. Wang, C. Yang, S. Farrel, Y. Zhang, T. Kurth, and S. Williams, "Time-Based Roofline for Deep Learning Performance Analysis", IEEE/ACM Deep Learning on Supercomputers Workshop (DLS'20), November 2020,

S. Williams, C. Yang, et al., Performance Tuning with the Roofline Model on GPUs and CPUs, Half-Day Tutorial, Supercomputing Conference (SC’20), November 2020,

Zhengji Zhao, Rebecca Hartman-Baker, and Gene Cooperman, Deploying Checkpoint/Restart for ProductionWorkloads at NERSC, A presentation at SC20 State of the Practice Talks, November 17, 2020,

M. Del Ben, C. Yang, Z. Li, F. H. da Jornada, S. G. Louie, and J. Deslippe, "Accelerating Large-Scale Excited-State GW Calculations on Leadership HPC Systems", ACM Gordon Bell Finalist, Supercomputing Conference (SC’20), November 2020,

Weile Wei, Arghya Chatterjee, Kevin Huck, Oscar Hernandez, Hartmut Kaiser, "Performance Analysis of a Quantum Monte Carlo Application on Multiple Hardware Architectures Using the HPX Runtime", Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA), November 13, 2020,

B Enders, D Bard, C Snavely, L Gerhardt, J Lee, B Totzke, K Antypas, S Byna, R Cheema, S Cholia, M Day, A Gaur, A Greiner, T Groves, M Kiran, Q Koziol, K Rowland, C Samuel, A Selvarajan, A Sim, D Skinner, R Thomas, G Torok, "Cross-facility science with the Superfacility Project at LBNL", IEEE/ACM WND Annual Workshop on Extreme-scale Experiment-in-the-Loop Computig (XLOOP), 2020, pp. 1-7, doi: 10.1109/XLOOP51963.2020.00006., November 12, 2020, 00:1-7, doi: 10.1109/XLOOP51963.2020.00006.

Daan Camps, Roel Van Beeumen, "Approximate quantum circuit synthesis using block encodings", PHYSICAL REVIEW A, November 11, 2020, 102, doi: 10.1103/PhysRevA.102.052411

One of the challenges in quantum computing is the synthesis of unitary operators into quantum circuits with polylogarithmic gate complexity. Exact synthesis of generic unitaries requires an exponential number of gates in general. We propose a novel approximate quantum circuit synthesis technique by relaxing the unitary constraints and interchanging them for ancilla qubits via block encodings. This approach combines smaller block encodings, which are easier to synthesize, into quantum circuits for larger operators. Due to the use of block encodings, our technique is not limited to unitary operators and can be applied for the synthesis of arbitrary operators. We show that operators which can be approximated by a canonical polyadic expression with a polylogarithmic number of terms can be synthesized with polylogarithmic gate complexity with respect to the matrix dimension.

Gang Huang,Yilun Xu,Jan Balewski, QubiC - Qubits Control Systems at LBNL, Supercomputing Conference, SC20, November 11, 2020,

Benjamin Driscoll, and Zhengji Zhao, "Automation of NERSC Application Usage Report", Seventh Annual Workshop on HPC User Support Tools (HUST 2020), held in conjunction with SC20, Online, November 11, 2020,

Max P. Katz, Ann Almgren, Maria Barrios Sazo, Kiran Eiden, Kevin Gott, Alice Harpole, Jean M. Sexton, Don E. Willcox, Weiqun Zhang, Michael Zingele, "Preparing nuclear astrophysics for exascale", SC '20: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2020, 91:1-12, doi: 10.5555/3433701.3433822

Astrophysical explosions such as supernovae are fascinating events that require sophisticated algorithms and substantial computational power to model. Castro and MAESTROeX are nuclear astrophysics codes that simulate thermonuclear fusion in the context of supernovae and X-ray bursts. Examining these nuclear burning processes using high resolution simulations is critical for understanding how these astrophysical explosions occur. In this paper we describe the changes that have been made to these codes to transform them from standard MPI + OpenMP codes targeted at petascale CPU-based systems into a form compatible with the pre-exascale systems now online and the exascale systems coming soon. We then discuss what new science is possible to run on systems such as Summit and Perlmutter that could not have been achieved on the previous generation of supercomputers.

Gabor Torok, Mark R. Day, Rebecca J. Hartman-Baker, Cory Snavely, "Iris: allocation banking and identity and access management for the exascale era", SC '20: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, November 2020, 42:1-11, doi: 10.5555/3433701.3433756

Nicholas Balthaser, Wayne Hurlbert, Melinda Jacobsen, Owen James, Kristy Kallback-Rose, Kirill Lozinskiy, NERSC HPSS Site Update, 2020 HPSS User Forum, October 9, 2020,

Report on recent projects and challenges running HPSS at NERSC, including recent AQI issues and upcoming HPSS upgrade.

Zhengji Zhao, Checkpoint/Restart Options on Cori, NERSC User Group Monthly Webinar, September 24, 2020,

M. Del Ben, C. Yang, and J. Deslippe, Achieving Performance Portability on Hybrid GPU-CPU Architectures for a Large-Scale Material Science Code: The BerkeleyGW Case Study, IEEE International Workshop on Performance, Portability and Productivity in HPC (P3HPC), September 2020,

Daan Camps, Thomas Mach, Raf Vandebril, David Watkins, "On pole-swapping algorithms for the eigenvalue problem", ETNA - Electronic Transactions on Numerical Analysis, September 18, 2020, 52:480-508, doi: 10.1553/etna_vol52s480

Pole-swapping algorithms, which are generalizations of the QZ algorithm for the generalized eigenvalue problem, are studied. A new modular (and therefore more flexible) convergence theory that applies to all pole-swapping algorithms is developed. A key component of all such algorithms is a procedure that swaps two adjacent eigenvalues in a triangular pencil. An improved swapping routine is developed, and its superiority over existing methods is demonstrated by a backward error analysis and numerical tests. The modularity of the new convergence theory and the generality of the pole-swapping approach shed new light on bi-directional chasing algorithms, optimally packed shifts, and bulge pencils, and allow the design of novel algorithms.

D. Camps, R. Van Beeumen, C. Yang, "Quantum Fourier Transform Revisited", Numerical Linear Algebra and Applications, September 15, 2020, 28:e2331, doi: https://doi.org/10.1002/nla.2331

Miroslav Urbanek, Daan Camps, Roel Van Beeumen, Wibe A. de Jong, "Chemistry on quantum computers with virtual quantum subspace expansion", Journal of Chemical Theory and Computation, August 21, 2020, 16:5425–5431, doi: 10.1021/acs.jctc.0c00447

C. Yang, "8 Steps to 3.7 TFLOP/s on NVIDIA V100 GPU: Roofline Analysis and Other Tricks", arXiv preprint arXiv:2008.11326, August 2020,

C. Yang, GW Calculations at Scale, NERSC GPU for Science Day 2020, July 2020,

Zhengji Zhao, Using VASP on Cori, VASP Online Hands-on User Training, Berkeley, CA, June 30, 2020,

J. R. Madsen, M. G. Awan, H. Brunie, J. Deslippe, R. Gayatri, L. Oliker, Y. Wang,
C. Yang, and S. Williams,
"Timemory: Modular Performance Analysis for HPC", International Supercomputing Conference (ISC’20), June 2020,

Zhengji Zhao, Programming Environment and Compilation, NERSC New User Training, Berkeley, CA, June 16, 2020,

Hendrickson, Bruce; Bland, Buddy; Chen, Jackie; Colella, Phil; Dart, Eli; Dongarra, Jack; Dunning, Thom; Foster, Ian; Gerber, Richard; Harken, Rachel, et al., "ASCR@40: Highlights and Impacts of ASCR's Programs", June 1, 2020,

Zhengji Zhao, and Rebecca Hartman-Baker, Proposal to increase the flex queue priority, NERSC Queue Committee Meeting, Berkeley, CA, June 1, 2020,

Zhengji Zhao, Running Variable-Time Jobs on Cori, Hans-on User training on Variable-Time Jobs, Berkeley, CA, May 21, 2020,

Abhinav Bhatele, Jayaraman J. Thiagarajan, Taylor Groves, Rushil Anirudh, Staci A. Smith, Brandon Cook, David Lowenthal, "The Case of Performance Variability on Dragonfly-based Systems", IPDPS 2020, May 21, 2020,

Zhengji Zhao, Rebecca Hartman-Baker, Checkpoint/Restart (C/R) Project Plan, Small P Project Meeting, Berkeley CA, May 15, 2020,

Kevin Gott, Andrew Meyers, Weiqun Zhang, AMReX in 2020: Porting for Performance to GPGPU Systems, 2020 Performance, Portability, and Productivity in HPC Forum, 2020,

M D Poat, J Lauret, J Porter, Jan Balewski, "STAR Data Production Workflow on HPC: Lessons Learned & Best Practices", 19th International Workshop on Advanced Computing and Analysis Techniques 2019, April 1, 2020,

C. Yang, M. Del Ben, Accelerating Large-Scale GW Calculations in Material Science, NVIDIA GPU Technology Conference (GTC’20), March 2020,

C. Yang, S. Williams, Y. Wang, Roofline Performance Model for HPC and Deep Learning Applications, NVIDIA GPU Technology Conference (GTC’20), March 2020,

Shahzeb Siddiqui, "Buildtest: A Software Testing Framework with Module Operations for HPC Systems", HUST, Springer, March 25, 2020, doi: https://doi.org/10.1007/978-3-030-44728-1_1

Brandon Cook, Jack Deslippe, Jonathan Madsen, Kevin Gott, Muaaz Awan, Enabling 800 Projects for GPU-Accelerated Science on Perlmutter at NERSC, GTC 2020, 2020,

The National Energy Research Scientific Computing Center (NERSC) is the mission HPC center for the U.S. Department of Energy Office of Science and supports the needs of 800+ projects and 7,000+ scientists with advanced HPC and data capabilities. NERSC’s newest system, Perlmutter, is an upcoming Cray system with heterogeneous nodes including AMD CPUs and NVIDIA Volta-Next GPUs. It will be the first NERSC flagship system with GPUs. Preparing our diverse user base for the new system is a critical part of making the system successful in enabling science at scale. The NERSC Exascale Science Application Program is responsible for preparing the simulation, data, and machine learning workloads to take advantage of the new architecture. We'll outline our strategy to enable our users to take advantage of the new architecture in a performance-portable way and discuss early outcomes. We'll highlight our use of tools and performance models to evaluate application readiness for Perlmutter and how we effectively frame the conversation about GPU optimization with our wide user base. In addition, we'll highlight a number of activities we are undertaking in order to make Perlmutter a more productive system when it arrives through compiler, library, and tool development. We'll also cover outcomes from a series of case studies that demonstrate our strategy to enable users to take advantage of the new architecture. We'll discuss the programming model used to port codes to GPUs, the strategy used to optimize code bottlenecks, and the GPU vs. CPU speedup achieved so far. The codes will include Tomopy (tomographic reconstruction), Exabiome (genomics de novo assembly), and AMReX (Adaptive Mesh Refinement software framework).

M. Del Ben, C. Yang, S. G. Louie, and J. Deslippe, Accelerating Large-Scale GW Calculations on Hybrid GPU-CPU Systems, American Physical Society (APS) March Meeting 2020, March 2020,

C. Yang, and J. Deslippe, Accelerate Science on Perlmutter with NERSC, American Physical Society (APS) March Meeting 2020, March 2020,

Greg Butler, Ravi Cheema, Damian Hazen, Kristy Kallback-Rose, Rei Lee, Glenn Lockwood, NERSC Community File System, March 4, 2020,

Presentation at Storage Technology Showcase providing an update on NERSC's Storage 2020 Strategy & Progress and newly deployed Community File System, including data migration process.

Nicholas Balthaser, Damian Hazen, Wayne Hurlbert, Owen James, Kristy Kallback-Rose, Kirill Lozinskiy, Moving the NERSC Archive to a Green Data Center, Storage Technology Showcase 2020, March 3, 2020,

Description of methods used and challenges involved in moving the NERSC tape archive to a new data center with environmental cooling.

Paul T. Lin, John N. Shadid, Paul H. Tsuji, "Krylov Smoothing for Fully-Coupled AMG Preconditioners for VMS Resistive MHD", Numerical Methods for Flows, Lecture Notes in Computational Science and Engineering, Volume 132, ( February 23, 2020)

Zhengji Zhao, and Jessica Nettleblad, NERSC Dotfile Migration to /etc/profile.d, Consulting Team Meeting, Berkeley CA, February 18, 2020,

Ann S. Almgren, John B. Bell, Kevin N. Gott, Weiqun Zhang, Andrew Myers, AMReX: A Block-Structured AMR Software Framework for the Exascale, 2020 SIAM Conference on Parallel Processing for Scientific Computing, 2020,

AMReX is a software framework for the development of block-structured AMR algorithms on current and future architectures. AMR reduces the computational cost and memory footprint compared to a uniform mesh while preserving the essentially local descriptions of different physical processes in complex multiphysics algorithms. AMReX supports a number of different time-stepping strategies and spatial discretizations, and incorporates data containers and iterators for mesh-based fields, particle data and irregular embedded boundary (cut cell) representations of complex geometries. Current AMReX applications include accelerator design, additive manufacturing, astrophysics, combustion, cosmology, microfluidics, materials science and multiphase flow. In this talk I will focus on AMReX's strategy for balancing readability, usability, maintainability and performance across multiple applications and architectures.

Kevin Gott, Andrew Myers, Weiqun Zhang, John Bell, Ann Almgren, AMReX on GPUs: Strategies, Challenges and Lessons Learned, 2020 SIAM Conference on Parallel Processing for Scientific Computing, 2020,

AMReX is a software framework for building massively parallel block-structured AMR applications using mesh operations, particles, linear solvers and/or complex geometry. AMReX was originally designed to use MPI + OpenMP on multicore systems and recently has ported the majority of its features to GPU accelerators. AMReX’s porting strategy has been designed to allow code teams without a heavy computer science background to port their codes efficiently and quickly with the software framework of their choosing, while minimizing impact to CPU performance or the scientific readability of the code. Further elements of this strategy include providing a clear and concise recommended strategy to application teams, supporting features that allow porting to GPUs in a piece-meal fashion as well as creating sufficiently general interfaces to facilitate adaptation to future changes without user intervention. This talk will give an overview of AMReX's GPU porting strategy to date. This includes a general overview of the porting philosophy and some specific examples that generated noteworthy lessons about porting a large-scale scientific framework. The discussion will also include the current status of AMReX applications that have begun to migrate to hybrid CPU/GPU systems, detail into GPU specific features that have given substantial performance gains, issues with porting a hybrid C++/Fortran code to GPUs and an overview of the limitations of the strategy.

O. Hernandez, C. Yang, et al, Early Experience of Application Developers with OpenMP Offloading at ALCF, NERSC, and OLCF, Birds of a Feather (BoF), Exascale Computing Program (ECP) Annual Meeting, February 2020,

J. Doerfert, C. Yang, et al, OpenMP Roadmap for Accelerators Across DOE Pre- Exascale/Exascale Machines, Birds of a Feather (BoF), Exascale Computing Program (ECP) Annual Meeting, February 2020,

J. Srinivasan, C. Yang, et al, Perlmutter - a Waypoint for ECP Teams, Birds of a Feather (BoF), Exascale Computing Program (ECP) Annual Meeting, February 2020,

S. Williams, C. Yang, and J. Deslippe, Performance Tuning with the Roofline Model on GPUs and CPUs, Half-Day Tutorial, Exascale Computing Program (ECP) Annual Meeting, February 2020,

Michael E. Rowan, Jack R. Deslippe, Kevin N. Gott, Axel Huebl, Remi Lehe, Andrew T. Myers, Maxence Thévenet, Jean-Luc Vay, Weiqun Zhang, "Use of CUDA Profiling Tools Interface (CUPTI) for Profiling Asynchronous GPU Activity", 2020 Exascale Computing Project Annual Meeting, 2020,

Nicholas Balthaser, Wayne Hurlbert, Long-term Data Management in the NERSC Archive, NITRD Middleware and Grid Interagency Coordination (MAGIC), February 5, 2020,

Description of data management practices contributing to long-term data integrity in the NERSC Archive.

Ann Almgren, John Bell, Kevin Gott, Andrew Myers, AMReX and AMReX-Based Applications, 2020 Exascale Computing Project Annual Meeting, February 4, 2020,

Shahzeb Siddiqui, HPC Software Stack Testing Framework, FOSDEM, February 2, 2020,

Heroux, Michael A. and McInnes, Lois and Bernholdt, David and Dubey, Anshu and Gonsiorowski, Elsa and Marques, Osni and Moulton, J. David and Norris, Boyana and Raybourn, Elaine and Balay, Satish and Bartlett, Roscoe A. and Childers, Lisa and Gamblin, Todd and Grubel, Patricia and Gupta, Rinku and Hartman-Baker, Rebecca and Hill, Judith and Hudson, Stephen and Junghans, Christoph and Klinvex, Alicia and Milewicz, Reed and Miller, Mark and Ah Nam, Hai and O Neal, Jared and Riley, Katherine and Sims, Ben and Schuler, Jean and Smith, Barry F. and Vernon, Louis and Watson, Gregory R. and Willenbring, James and Wolfenbarger, Paul, "Advancing Scientific Productivity through Better Scientific Software: Developer Productivity and Software Sustainability Report", https://www.osti.gov/biblio/1606662/, February 2020,

Shahzeb Siddiqui, Buildtest: HPC Software Stack Testing Framework, Easybuild User Meeting, January 30, 2020,

Shahzeb Siddiqui, Building an Easybuild Container Library in Sylabs Cloud, Easybuild User Meeting, January 29, 2020,

Marcus D. Hanwell, Chris Harris, Alessandro Genova, Mojtaba Haghighatlari, Muammar El Khatib, Patrick Avery, Johannes Hachmann, Wibe Albert de Jong, "Open Chemistry, JupyterLab, REST, and quantum chemistry", International Journal of Quantum Chemistry, 2020, 121:e26472, doi: https://doi.org/10.1002/qua.26472

Christian Lengauer, Sven Apel, Matthias Bolten, Shigeru Chiba, Ulrich Rüde, Jürgen Teich, Armin Größlinger, Frank Hannig, Harald Köstler, Lisa Claus, Alexander Grebhahn, Stefan Groth, Stefan Kronawitter, Sebastian Kuckuk, Hannah Rittich, Christian Schmitt, Jonas Schmitt, "ExaStencils: Advanced Multigrid Solver Generation", Software for Exascale Computing - SPPEXA 2016-2019, (Springer International Publishing: 2020) Pages: 405--452

Lisa Claus, Matthias Bolten, "Nonoverlapping block smoothers for the Stokes equations", Numerical Linear Algebra with Applications, 2020, 28:e2389, doi: 10.1002/nla.2389

Yang Liu, Pieter Ghysels, Lisa Claus, Xiaoye Sherry Li, "Sparse Approximate Multifrontal Factorization with Butterfly Compression for High-Frequency Wave Equations", SIAM Journal on Scientific Computing, 2020, 43:S367-S391, doi: 10.1137/20M1349667

Catherine A Watkinson, Sambit K Giri, Hannah E Ross, Keri L Dixon, Ilian T Iliev, Garrelt Mellema, Jonathan R Pritchard, "The 21-cm bispectrum as a probe of non-Gaussianities due to X-ray heating", Journal, January 2020, 482:2653–2669, doi: 10.1093/mnras/sty2740

Nasir U. Perez Eisty, "Testing Research Software: A Case Study", Computational Science -- ICCS 2020, Cham, Springer International Publishing, 2020, 457--463,

2019

Paul T. Lin, John N. Shadid, Paul H. Tsuji, "On the performance of Krylov smoothing for fully coupled AMG preconditioners for VMS resistive MHD", International Journal for Numerical Methods in Engineering, December 21, 2019, 120:1297-1309, doi: 10.1002/nme.6178

Nicholas Balthaser, Wayne Hurlbert, Kirill Lozinskiy, Owen James, Regent System Move Update, NERSC All-to-All Meeting, December 16, 2019,

Update on moving the NERSC center backup system from the Oakland Scientific Facility to LBL Building 59.

Zhengji Zhao, Automation of NERSC Application Usage Report, Application Usage Page Support Transition Meeting, Berkeley CA, December 4, 2019,

Glenn K. Lockwood, Kirill Lozinskiy, Lisa Gerhardt, Ravi Cheema, Damian Hazen, Nicholas J. Wright, "A Quantitative Approach to Architecting All-Flash Lustre File Systems", ISC High Performance 2019: High Performance Computing, edited by Michele Weiland, Guido Juckeland, Sadaf Alam, Heike Jagode, (Springer International Publishing: 2019) Pages: 183--197 doi: 10.1007/978-3-030-34356-9_16

New experimental and AI-driven workloads are moving into the realm of extreme-scale HPC systems at the same time that high-performance flash is becoming cost-effective to deploy at scale. This confluence poses a number of new technical and economic challenges and opportunities in designing the next generation of HPC storage and I/O subsystems to achieve the right balance of bandwidth, latency, endurance, and cost. In this work, we present quantitative models that use workload data from existing, disk-based file systems to project the architectural requirements of all-flash Lustre file systems. Using data from NERSC’s Cori I/O subsystem, we then demonstrate the minimum required capacity for data, capacity for metadata and data-on-MDT, and SSD endurance for a future all-flash Lustre file system.

Hai Ah Nam, Elsa Gonsiorowski, Grigori Fursin, Lorena A. Barba, "Special Issue on the SC’18 Student Cluster Competition Reproducibility Initiative, Parallel Computing, Volume 90, 2019", Parallel Computing, Volume 90, 2019, December 2019,

M. Del Ben, C. Yang, F. H. da Jornada, S. G. Louie, and J. Deslippe, "Accelerating Large-Scale GW Calculations on Hybrid CPU-GPU Architectures", Supercomputing Conference (SC’19), November 2019,

Abe Singer, Shane Canon, Rebecca Hartman-Baker, Kelly L. Rowland, David Skinner, Craig Lant, "What Deploying MFA Taught Us About Changing Infrastructure", HPCSYSPROS19: HPC System Professionals Workshop, November 2019, doi: 10.5281/zenodo.3525375

NERSC is not the first organization to implement multi-factor authentication (MFA) for its users. We had seen multiple talks by other supercomputing facilities who had deployed MFA, but as we planned and deployed our MFA implementation, we found that nobody had talked about the more interesting and difficult challenges, which were largely social rather than technical. Our MFA deployment was a success, but, more importantly, much of what we learned could apply to any infrastructure change. Additionally, we developed the sshproxy service, a key piece of infrastructure technology that lessens user and staff burden and has made our MFA implementation more amenable to scientific workflows. We found great value in using robust open-source components where we could and developing tailored solutions where necessary.

S. Williams, C. Yang, A. Ilic, and K. Rogozhin, Performance Tuning with the Roofline Model on GPUs and CPUs, Half-Day Tutorial, Supercomputing Conference (SC’19), November 2019,

Glenn K. Lockwood, Kirill Lozinskiy, Kristy Kallback-Rose, NERSC's Perlmutter System: Deploying 30 PB of all-NVMe Lustre at scale, Lustre BoF at SC19, November 19, 2019,

Update at SC19 Lustre BoF on collaborative work with Cray on deploying an all-flash Lustre tier for NERSC's Perlmutter Shasta system.

Timothy G. Mattson, Yun (Helen) He, Alice E. Koniges, The OpenMP Common Core: Making OpenMP Simple Again, Book: Scientific and Engineering Computation Series, edited by William Gropp, Ewing Lusk, (The MPI Press: November 19, 2019) Pages: 320 pp

How to become a parallel programmer by learning the twenty-one essential components of OpenMP.

R. Gayatri, K. Gott, J. Deslippe, "Comparing Managed Memory and ATS with and without Prefetching on NVIDIA Volta GPUs", 2019 IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS), 2019, 41-46, doi: 10.1109/PMBS49563.2019.00010

One of the major differences in many-core versus multicore architectures is the presence of two different memory spaces: a host space and a device space. In the case of NVIDIA GPUs, the device is supplied with data from the host via one of the multiple memory management API calls provided by the CUDA framework, such as CudaMallocManaged and CudaMemCpy. Modern systems, such as the Summit supercomputer, have the capability to avoid the use of CUDA calls for memory management and access the same data on GPU and CPU. This is done via the Address Translation Services (ATS) technology that gives a unified virtual address space for data allocated with malloc and new if there is an NVLink connection between the two memory spaces. In this paper, we perform a deep analysis of the performance achieved when using two types of unified virtual memory addressing: UVM and managed memory.

Hongzhang Shan, Zhengji Zhao, and Marcus Wagner, Accelerating the Performance of Modal AerosolModule of E3SM Using OpenACC, Sixth Workshop on Accelerator Programming Using Directives (WACCPD) in SC19, November 18, 2019,

Hongzhang Shan, Zhengji Zhao, and Marcus Wagner, "Accelerating the Performance of Modal Aerosol Module of E3SM Using OpenACC", (won the best paper award), Sixth Workshop on Accelerator Programming Using Directives (WACCPD) in SC19, November 18, 2019,

Glenn K. Lockwood, Shane Snyder, Suren Byna, Philip Carns, Nicholas J. Wright, "Understanding Data Motion in the Modern HPC Data Center", 2019 IEEE/ACM Fourth International Parallel Data Systems Workshop (PDSW), Denver, CO, USA, IEEE, 2019, 74--83, doi: 10.1109/PDSW49588.2019.00012

The utilization and performance of storage, compute, and network resources within HPC data centers have been studied extensively, but much less work has gone toward characterizing how these resources are used in conjunction to solve larger scientific challenges. To address this gap, we present our work in characterizing workloads and workflows at a data-center-wide level by examining all data transfers that occurred between storage, compute, and the external network at the National Energy Research Scientific Computing Center over a three-month period in 2019. Using a simple abstract representation of data transfers, we analyze over 100 million transfer logs from Darshan, HPSS user interfaces, and Globus to quantify the load on data paths between compute, storage, and the wide-area network based on transfer direction, user, transfer tool, source, destination, and time. We show that parallel I/O from user jobs, while undeniably important, is only one of several major I/O workloads that occurs throughout the execution of scientific workflows. We also show that this approach can be used to connect anomalous data traffic to specific users and file access patterns, and we construct time-resolved user transfer traces to demonstrate that one can systematically identify coupled data motion for individual workflows.

George Michelogiannakis, Yiwen Shen, Min Yee Teh, Xiang Meng, Benjamin Aivazi, Taylor Groves, John Shalf, Madeleine Glick, Manya Ghobadi, Larry Dennison, Keren Bergman, "Bandwidth Steering in HPC using Silicon Nanophotonics", International Conference on High Performance Computing, Networking, Storage and Analysis (SC'19), November 17, 2019,

Sudheer Chunduri, Taylor Groves, Peter Mendygral, Brian Austin, Jacob Balma, Krishna Kandalla, Kalyan Kumaran, Glenn Lockwood, Scott Parker, Steven Warren, Nathan Wichmann, Nicholas Wright, "GPCNeT: Designing a Benchmark Suite for Inducing and Measuring Contention in HPC Networks", International Conference on High Performance Computing, Networking, Storage and Analysis (SC'19), November 16, 2019,

Network congestion is one of the biggest problems facing HPC systems today, affecting system throughput, performance, user experience and reproducibility. Congestion manifests as run-to-run variability due to contention for shared resources like filesystems or routes between compute endpoints. Despite its significance, current network benchmarks fail to proxy the real-world network utilization seen on congested systems. We propose a new open-source benchmark suite called the Global Performance and Congestion Network Tests (GPCNeT) to advance the state of the practice in this area. The guiding principles used in designing GPCNeT are described and the methodology employed to maximize its utility is presented. The capabilities of GPCNeT evaluated by analyzing results from several world’s largest HPC systems, including an evaluation of congestion management on a next-generation network. The results show that systems of all technologies and scales are susceptible to congestion and this work motivates the need for congestion control in next-generation networks.

C. Yang, T. Kurth, and S. Williams, "Hierarchical Roofline Analysis for GPUs: Accelerating Performance Optimization for the NERSC-9 Perlmutter System", Concurrency and Computation: Practice and Experience, DOI: 10.1002/cpe.5547, November 2019,

Zhengji Zhao, Checkpointing and Restarting Jobs with DMTCP, NERSC User Training on Checkpointing and Restarting Jobs Using DMTCP, November 6, 2019,

Lixin Ge, Chou Ng, Zhengji Zhao, Jeff Hammond, and Karen Zhou, OpenMP Hackathon Acrum ACE3P, NERSC Application Readiness Meeting, October 30, 2019,

Nicholas Balthaser, Kirill Lozinskiy, Melinda Jacobsen, Kristy Kallback-Rose, NERSC Migration from Oracle tape libraries GPFS-HPSS-Integration Proof of Concept, October 16, 2019,

NERSC updates on Storage 2020 Strategy & Progress, GHI Testing, Tape Library Update, Futures

Ravi Cheema, Kristy Kallback-Rose, Storage 2020 Strategy & Progress - NERSC Site Update at HPCXXL User Group Meeting, September 24, 2019,

NERSC site update including Systems Overview, Storage 2020 Strategy & Progress, GPFS-HPSS-Integration Testing, Tape Library Update and Futures.

S. Williams, C. Yang, K. Ibrahim, T. Kurth, N. Ding, J. Deslippe, L. Oliker, "Performance Analysis using the Roofline Model", SciDAC PIs Meeting, 2019,

S. Lee, R. Corliss, I. Friščić, R. Alarcon, S. Aulenbacher, J. Balewski, S. Benson, J.C. Bernauer, J. Bessuille, J. Boyce, J. Coleman, D. Douglas, C.S. Epstein, P. Fisher, S. Frierson, M. Garçon, J. Grames, D. Hasell, C. Hernandez-Garcia, E. Ihloff, R. Johnston, K. Jordan, R. Kazimi, J. Kelsey, M. Kohl, A. Liyanage, M. McCaughan, R.G. Milner, P. Moran, J. Nazeer, D. Palumbo, M. Poelker, G. Randall, S.G. Steadman, C. Tennant, C. Tschalär, C. Vidal, C. Vogel, Y. Wang, S. Zhang,, "Design and operation of a windowless gas target internal to a solenoidal magnet for use with a megawatt electron beam", Nuclear Instruments and Methods in Physics Research Section A, September 21, 2019,

Roy Ben-Shalom, Jan Balewski, Anand Siththaranjan, Vyassa Baratham, Henry Kyoung, Kyung Geun Kim, Kevin J. Bender, Kristofer E. Bouchard, "Inferring neuronal ionic conductances from membrane potentials using CNNs", August 6, 2019,

Ciston, J., Johnson, I., Draney, B., Ercius, P., Fong, E., Goldschmidt, A., . . . Denes, P., "The 4D Camera: Very High Speed Electron Counting for 4D-STEM. Microscopy and Microanalysis", Microscopy and Microanalysis, 25(S2), 1930-1931, August 5, 2019, doi: 10.1017/S1431927619010389

 

Ciston, J., Johnson, I., Draney, B., Ercius, P., Fong, E., Goldschmidt, A., . . . Denes, P. (2019). The 4D Camera: Very High Speed Electron Counting for 4D-STEM. Microscopy and Microanalysis, 25(S2), 1930-1931. doi:10.1017/S1431927619010389

C. Yang, Performance-Related Activities at NERSC/CRD, RRZE Thomas Gruber Visit NERSC, July 2019,

C. Yang, The Current and Future of Roofline, LBNL Brown Bag Seminar, July 2019,

Hannah E Ross, Keri L. Dixon, Raghunath Ghara, Ilian T. Iliev, Garrelt Mellema, "Evaluating the QSO contribution to the 21-cm signal from the Cosmic Dawn", Journal, July 2019, 487:1101–1119, doi: /10.1093/mnras/stz1220

C. Yang, NERSC Application Readiness Process and Strategy, NERSC GPU for Science Day 2019, July 2019,

Zhengji Zhao, Programming Environment and Compilation, NERSC New User Training, June 21, 2019,

C. Yang, Z. Matveev, A. Ilic, and D. Marques, Performance Optimization of Scientific Codes with the Roofline Model, Half-Day Tutorial, International Supercomputing Conference (ISC’19), June 2019,

Yuping Fan, Zhiling Lan, Paul Rich, William E Allcock, Michael E Papka, Brian Austin, David Paul, "Scheduling Beyond CPUs for HPC", Proceedings of the 28th International Symposium on High-Performance Parallel and Distributed Computing, Pheonix, AZ, ACM, June 19, 2019, 97-108, doi: 10.1145/3307681.3325401

High performance computing (HPC) is undergoing significant changes. The emerging HPC applications comprise both compute- and data-intensive applications. To meet the intense I/O demand from emerging data-intensive applications, burst buffers are deployed in production systems. Existing HPC schedulers are mainly CPU-centric. The extreme heterogeneity of hardware devices, combined with workload changes, forces the schedulers to consider multiple resources (e.g., burst buffers) beyond CPUs, in decision making. In this study, we present a multi-resource scheduling scheme named BBSched that schedules user jobs based on not only their CPU requirements, but also other schedulable resources such as burst buffer. BBSched formulates the scheduling problem into a multi-objective optimization (MOO) problem and rapidly solves the problem using a multi-objective genetic algorithm. The multiple solutions generated by BBSched enables system managers to explore potential tradeoffs among various resources, and therefore obtains better utilization of all the resources. The trace-driven simulations with real system workloads demonstrate that BBSched improves scheduling performance by up to 41% compared to existing methods, indicating that explicitly optimizing multiple resources beyond CPUs is essential for HPC scheduling.

Zhengji Zhao, Running VASP on Cori KNL, VASP User Hands-on KNL Training, June 18, 2019, Berkeley CA, June 18, 2019,

C. Yang, Using Intel Tools at NERSC, Intel KNL Training, May 2019,

Nicholas Balthaser, Tape's Not Dead at LBNL/NERSC, MSST 2019 Conference, May 21, 2019,

Lightning talk on archival storage projects at NERSC for 2019 MSST conference.

Tiffany Connors, Taylor Groves, Tony Quan, Scott Hemmert, "Simulation Framework for Studying Optical Cable Failures in Dragonfly Topologies", Workshop on Scalable Networks for Advanced Computing Systems in conjunction with IPDPS, May 17, 2019,

Kirill Lozinskiy, Glenn K. Lockwood, Lisa Gerhardt, Ravi Cheema, Damian Hazen, Nicholas J. Wright, A Quantitative Approach to Architecting All‐Flash Lustre File Systems, Lustre User Group (LUG) 2019, May 15, 2019,

Weiqun Zhang, Ann Almgren, Vince Beckner, John Bell, Johannes Blaschke, Cy Chan, Marcus Day, Brian Friesen, Kevin Gott, Daniel Graves, Max P. Katz, Andrew Myers, Tan Nguyen, Andrew Nonaka, Michele Rosso, Samuel Williams, Michael Zingale, "AMReX: a framework for block-structured adaptive mesh refinement", Journal of Open Source Software, 2019, 4(37):1370, doi: 10.21105/joss.01370

JOSS Article for Citation of AMReX:

AMReX is a C++ software framework that supports the development of block-structured adaptive mesh refinement (AMR) algorithms for solving systems of partial differential equations (PDEs) with complex boundary conditions on current and emerging architec- tures.

Kirill Lozinskiy, Lisa Gerhardt, Annette Greiner, Ravi Cheema, Damian Hazen, Kristy Kallback-Rose, Rei Lee, User-Friendly Data Management for Scientific Computing Users, Cray User Group (CUG) 2019, May 9, 2019,

Wrangling data at a scientific computing center can be a major challenge for users, particularly when quotas may impact their ability to utilize resources. In such an environment, a task as simple as listing space usage for one's files can take hours. The National Energy Research Scientific Computing Center (NERSC) has roughly 50 PBs of shared storage utilizing more than 4.6B inodes, and a 146 PB high-performance tape archive, all accessible from two supercomputers. As data volumes increase exponentially, managing data is becoming a larger burden on scientists. To ease the pain, we have designed and built a “Data Dashboard”. Here, in a web-enabled visual application, our 7,000 users can easily review their usage against quotas, discover patterns, and identify candidate files for archiving or deletion. We describe this system, the framework supporting it, and the challenges for such a framework moving into the exascale age.

Thorsten Kurth, Joshua Romero, Everett Phillips, and Massimiliano Fatica, Brandon Cook, Rahul Gayatri, Zhengji Zhao, and Jack Deslippe, Porting Quantum ESPRESSO Hybrid Functional DFT to GPUs Using CUDA Fortran, Cray User Group Meeting, Montreal, Canada, May 5, 2019,

Zhengji Zhao, Introduction - The basics of compiling and running on KNL, Cori KNL: Programming and Optimization, Cray KNL training, April 16-18, 2018, Berkeley CA, April 16, 2019,

C. Yang, Performance Analysis of GPU-Accelerated Applications using the Roofline Model, Cray Center of Excellence (COE) Webinar, April 2019,

NERSC site update including Systems Overview, Storage 2020 Strategy & Progress and Superfacility Initiative.

Kevin Gott, AMReX: Enabling Appliations on GPUs, 2019 DOE Performance, Portability and Productivity Annual Meeting, 2019,

C. Yang, Preparing NERSC Applications for Perlmutter as an Exascale Waypoint, Meet the Mentors, Pawsey Supercomputing Centre, March 2019,

R. Gayatri, and C. Yang, Optimizing Large Reductions in BerkeleyGW with CUDA, OpenACC, OpenMP 4.5 and Kokkos, NVIDIA GPU Technology Conference (GTC’19), March 2019,

C. Yang, S. Williams, Performance Analysis of GPU-Accelerated Applications using the Roofline Model, NVIDIA GPU Technology Conference (GTC’19), March 2019,

C. Yang, Preparing NERSC Applications for Perlmutter as an Exascale Waypoint, Tsukuba/LBNL Meeting, March 2019,

Kevin Gott, "An Overview of GPU Strategies for Porting Amrex-Based Applications to Next-generation HPC Systems", 2019 SIAM Conference on Computational Science and Engineering, 2019,

AMReX is a parallel computing framework for applying adaptive mesh refinement (AMR) to scientific applications. AMReX-based applications, including the astrophysics code Castro and the beam-plasma simulation code WarpX, have begun to implement AMReX's new GPU offloading paradigms to gain access to next generation HPC resources, including ORNL's Summit supercomputer. The AMReX library is exploring multiple paradigms using OpenMP, OpenACC, CUDA Fortran and CUDA to allow users to offload kernels in a manner that yields good speedups while maintaining readability for users. 

An overview of the paradigms will be presented and compared on Summit, LBNL's Cori supercomputer and other applicable HPC platforms. Selected AMReX-based applications that have been ported to GPUs will be presented, focusing on paradigms implemented, the difficulty of the conversion, runtime improvement compared to modern CPU-based HPC systems, and where additional optimizations could be made.

Andrew Myers, Ann S. Almgren, John B. Bell, Marcus Day, Brian Friesen, Kevin N. Gott, Andy J. Nonaka, Steven Reeves, Weiqun Zhang, "Overview of Amrex - a New Framework for Block-structured Adaptive Mesh Refinement Calculations", 2019 SIAM Conference on Computational Science and Engineering, 2019,

AMReX is a new software framework that supports the development of block-structured adaptive mesh refinement algorithms for solving systems of partial differential equations on emerging architectures. AMReX aims to provide all the tools necessary for performing complex multiphysics simulations on an adaptive hierarchy of meshes. We give an overview of the software components provided by AMReX, including support for cell, edge, face, and node-centered mesh data, particles, embedded boundary (cut cell) representations of complex geometries, linear solvers, profiling tools, and parallel load balancing. We describe the parallelization strategies supported, including straight MPI, hybrid MPI+OpenMP, and support for GPU systems. Finally, we also give an overview of the application codes built on top of AMReX, which span a wide range of scientific domains and include several ECP and SciDAC-supported projects.

C. Yang, OpenACC Updates, NERSC Application Readiness Seminar, February 2019,

Zhengji Zhao, Introduction: The basics of compiling and running jobs on KNL, Cori KNL: Programming and Optimization, Cray Training, Feb 12-13, 2019, Berkeley CA, February 12, 2019,

C. Yang, Roofline Performance Analysis with nvprof, NERSC-NVIDIA Face-to-Face Meeting, February 2019,

Marco Govoni, Milson Munakami, Aditya Tanikanti, Jonathan H. Skone, Hakizumwami Runesha, Federico Giberti, Juan De Pablo, Giulia Galli, "Qresp, a tool for curating, discovering and exploring reproducible scientific papers", Scientific Data, January 29, 2019, 6:190002, doi: 10.1038/sdata.2019.2

J. Pennycook, C. Yang, and J. Deslippe, Quantitatively Assessing Performance Portability with Roofline, Exascale Computing Project (ECP) Interoperable Design of Extreme-scale Application Software (IDEAS) Webinar, January 2019,

S. Williams, J. Deslippe, C. Yang, Performance Tuning of Scientific Codes with the Roofline Model, Half-Day Tutorial, Exascale Computing Project (ECP) Annual Meeting, January 2019,

Kevin Gott, Weiqun Zhang, Andrew Myers, Ann S. Almgren, John B. Bell, "AMReX Co-Design Center for Exascale Computing", 2019 Exascale Computing Project Annual Meeting, January 16, 2019,

Kevin Gott, Weiqun Zhang, Andrew Myers, Ann S. Almgren, John B. Bell, Breakout Session for AMReX Users and Developers, 2019 Exascale Computing Project Annual Meeting, 2019,

Phuong Hoai Ha, Otto J. Anshus, Ibrahim Umar, "Efficient concurrent search trees using portable fine-grained locality", IEEE Transactions on Parallel and Distributed Systems, January 14, 2019,

Abhinav Thota, Yun He, "Foreword to the Special Issue of the Cray User Group (CUG 2018)", Concurrency and Computation: Practice and Experience, January 11, 2019,

Daan Camps, Nicola Mastronardi, Raf Vandebril, Paul Van Dooren, "Swapping 2 × 2 blocks in the Schur and generalized Schur form", Journal of Computational and Applied Mathematics, 2019, doi: https://doi.org/10.1016/j.cam.2019.05.022

Pole swapping methods for the eigenvalue problem - Rational QR algorithms, Daan Camps, 2019,

Daan Camps, Karl Meerbergen, Raf Vandebril, "An implicit filter for rational Krylov using core transformations", Linear Algebra Appl., 2019, 561:113--140, doi: 10.1016/j.laa.2018.09.021

Daan Camps, Karl Meerbergen, Raf Vandebril, "A rational QZ method", SIAM J. Matrix Anal. Appl., January 1, 2019, 40:943--972, doi: 10.1137/18M1170480

Weiqun Zhang, Ann Almgren, Vince Beckner, John Bell, Johannes Blaschke, Cy Chan, Marcus Day, Brian Friesen, Kevin Gott, Daniel Graves, Max P. Katz, Andrew Myers, Tan Nguyen, Andrew Nonaka, Michele Rosso, Samuel Williams, Michael Zingale, "AMReX: a framework for block-structured adaptive mesh refinement", Journal of Open Source Software, 2019, 4:1370, doi: https://doi.org/10.21105/joss.01370

Glenn K. Lockwood, Kirill Lozinskiy, Lisa Gerhardt, Ravi Cheema, Damian Hazen, Nicholas J. Wright, "Designing an All-Flash Lustre File System for the 2020 NERSC Perlmutter System", Proceedings of the 2019 Cray User Group, Montreal, January 1, 2019,

New experimental and AI-driven workloads are moving into the realm of extreme-scale HPC systems at the same time that high-performance flash is becoming cost-effective to deploy at scale. This confluence poses a number of new technical and economic challenges and opportunities in designing the next generation of HPC storage and I/O subsystems to achieve the right balance of bandwidth, latency, endurance, and cost. In this paper, we present the quantitative approach to requirements definition that resulted in the 30 PB all-flash Lustre file system that will be deployed with NERSC's upcoming Perlmutter system in 2020. By integrating analysis of current workloads and projections of future performance and throughput, we were able to constrain many critical design space parameters and quantitatively demonstrate that Perlmutter will not only deliver optimal performance, but effectively balance cost with capacity, endurance, and many modern features of Lustre.

Osni A. Marques, David E. Bernholdt, Elaine M. Raybourn, Ashley D. Barker, Rebecca J. Hartman-Baker, "The HPC Best Practices Webinar Series", Journal of Computational Science Education, January 2019, doi: 10.22369/issn.2153-4136/10/1/19

In this contribution, we discuss our experiences organizing the Best Practices for HPC Software Developers (HPC-BP) webinar series, an effort for the dissemination of software development methodologies, tools and experiences to improve developer productivity and software sustainability. HPC-BP is an outreach component of the IDEAS Productivity Project and has been designed to support the IDEAS mission to work with scientific software development teams to enhance their productivity and the sustainability of their codes. The series, which was launched in 2016, has just presented its 22nd webinar. We summarize and distill our experiences with these webinars, including what we consider to be “best practices” in the execution of both individual webinars and a long-running series like HPC-BP. We also discuss future opportunities and challenges in continuing the series.

 

2018

Thomas Heller, Bryce Adelstein Lelbach, Kevin A. Huck, John Biddiscombe, Patricia Grubel, Alice E. Koniges, Matthias Kretz, Dominic Marcello, David Pfander, Adrian SerioL, Juhan Frank, Geoffrey C. Clayton, Dirk Pflu ̈ger, David Eder, and Hartmut Kaiser, "Harnessing Billions of Tasks for a Scalable Portable Hydrodynamic Simulation of the Merger of Two Stars", The International Journal of High Performance Computing Applications, 2018, Accepted,

Florian Wende, Martijn Marsman, Jeongnim Kim, Fedor Vasilev, Zhengji Zhao, Thomas Steinke, "OpenMP in VASP: Threading and SIMD", International Journal of Quantum Chemistry, December 19, 2018,

Paul T. Lin, John N. Shadid, Jonathan J. Hu, Roger P. Pawlowki, Eric C. Cyr, "Performance of fully-coupled algebraic multigrid preconditioners for large-scale VMS resistive MHD", Journal of Computational and Applied Mathematics, December 15, 2018, 344:782-793, doi: 10.1016/j.cam.2017.09.028

Zhengji Zhao, Transition from Edison to Cori KNL How to compile and run on KNL, Edison to KNL User Training, December 12, 2018, Berkeley CA, December 12, 2018,

Vetter, Jeffrey S.; Brightwell, Ron; Gokhale, Maya; McCormick, Pat; Ross, Rob; Shalf, John; Antypas, Katie; Donofrio, David; Humble, Travis; Schuman, Catherine; Van Essen, Brian; Yoo, Shinjae; Aiken, Alex; Bernholdt, David; Byna, Suren; Cameron, Kirk; Cappello, Frank; Chapman, Barbara; Chien, Andrew; Hall, Mary; Hartman-Baker, Rebecca; Lan, Zhiling; Lang, Michael; Leidel, John; Li, Sherry; Lucas, Robert; Mellor-Crummey, John; Peltz Jr., Paul; Peterka, Thomas; Strout, Michelle; Wilke, Jeremiah, "Extreme Heterogeneity 2018 - Productive Computational Science in the Era of Extreme Heterogeneity: Report for DOE ASCR Workshop on Extreme Heterogeneity", December 2018, doi: 10.2172/1473756

Kurt Ferreira, Ryan E. Grant, Michael J. Levenhagen, Scott Levy, Taylor Groves, "Hardware MPI Message Matching: Insights into MPI Matching Behavior to Inform Design", Concurrency and Computation Practice and Experience, December 1, 2018,

S. Williams, A. Ilic, Z. Matveev, C. Yang, Performance Tuning of Scientific Codes with the Roofline Model,, Half-Day Tutorial, Supercomputing Conference (SC’18), November 2018,

C. Yang, R. Gayatri, T. Kurth, P. Basu, Z. Ronaghi, A. Adetokunbo, B. Friesen, B.
Cook, D. Doerfler, L. Oliker, J. Deslippe, and S. Williams,
"An Empirical Roofline Methodology for Quantitatively Assessing Performance Portability", IEEE International Workshop on Performance, Portability and Productivity in HPC (P3HPC'18), November 2018,

R. Gayatri, C. Yang, T. Kurth, and J. Deslippe, "A Case Study for Performance Portability Using OpenMP 4.5", IEEE International Workshop on Accelerator Programming Using Directives (WACCPD'18), November 2018,

B. Austin, C. Daley, D. Doerfler, J. Deslippe, B. Cook, B. Friesen, T. Kurth, C. Yang,
and N. Wright,
"A Metric for Evaluating Supercomputer Performance in the Era of Extreme Heterogeneity", 9th IEEE International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS'18), November 2018,

Tim Mattson, Alice Koniges, Yun (Helen) He, David Eder, The OpenMP Common Core: A hands-on exploration, SuperComputing 2018 Tutorial, November 11, 2018,

Glenn K. Lockwood, Shane Snyder, Teng Wang, Suren Byna, Philip Carns, Nicholas J. Wright, "A Year in the Life of a Parallel File System", Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, Dallas, TX, IEEE Press, November 11, 2018, 71:1--74:1,

I/O performance is a critical aspect of data-intensive scientific computing. We seek to advance the state of the practice in understanding and diagnosing I/O performance issues through investigation of a comprehensive I/O performance data set that captures a full year of production storage activity at two leadership-scale computing facilities. We demonstrate techniques to identify regions of interest, perform focused investigations of both long-term trends and transient anomalies, and uncover the contributing factors that lead to performance fluctuation.


We find that a year in the life of a parallel file system is comprised of distinct regions of long-term performance variation in addition to short-term performance transients. We demonstrate how systematic identification of these performance regions, combined with comprehensive analysis, allows us to isolate the factors contributing to different performance maladies at different time scales. From this, we present specific lessons learned and important considerations for HPC storage practitioners.

Cory Snavely, Gonzalo Alvarez, Valerie Hendrix, Shreyas Cholia, Stefan Lasiewski, "Spin: A Docker-based Platform for Deploying Science Gateways at NERSC", Gateways 2018: The 13th Gateway Computing Environments Conference, Univ. of Texas at Austin, October 18, 2018, doi: 10.6084/m9.figshare.7071770.v2

Kevin Gott, Charles Lena, Ariel Biller, Josh Neitzel, Kai-Hsin Liou, Jack Deslippe, James R Chelikowsky, Scaling and optimization results of the real-space DFT solver PARSEC on Haswell and KNL systems, Intel Xeon Phi Users Group (IXPUG), 2017, 2018,

Tianmu Xin, ​Zhengji Zhao​, Yue Hao, Binping Xiao, Qiong Wu, Alexander Zaltsman, Kevin Smith, and Xinmin Tian, Performance Tuning to close Ninja Gap for Accelerator Physics Emulation System (APES) on Intel Xeon Phi Processors, A talk presented in the 14th ​International Workshop on OpenMP (​IWOMP18​), Barcelona Spain, September 26, 2018,

Tianmu Xin, ​Zhengji Zhao​, Yue Hao, Binping Xiao, Qiong Wu, Alexander Zaltsman, Kevin Smith, and Xinmin Tian, "Performance Tuning to close Ninja Gap for Accelerator Physics Emulation System (APES) on Intel Xeon Phi Processors", Proceeding of the 14th ​International Workshop on OpenMP (​IWOMP18​), Barcelona Spain, September 26, 2018,

Yun (Helen) He, Barbara Chapman, Oscar Hernandez, Tim Mattson, Alice Koniges, Introduction to "OpenMP Common Core", OpenMPCon / IWOMP 2018 Tutorial Day, September 26, 2018,

Oscar Hernandez, Yun (Helen) He, Barbara Chapman, Using MPI+OpenMP for Current and Future Architectures, OpenMPCon 2018, September 24, 2018,

Robert Ross, Lee Ward, Philip Carns, Gary Grider, Scott Klasky, Quincey Koziol, Glenn K. Lockwood, Kathryn Mohror, Bradley Settlemyer, Matthew Wolf, "Storage Systems and I/O: Organizing, Storing, and Accessing Data for Scientific Discovery", 2018, doi: 10.2172/1491994

In September, 2018, the Department of Energy, Office of Science, Advanced Scientific Computing Research Program convened a workshop to identify key challenges and define research directions that will advance the field of storage systems and I/O over the next 5–7 years. The workshop concluded that addressing these combined challenges and opportunities requires tools and techniques that greatly extend traditional approaches and require new research directions. Key research opportunities were identified.

Yun (Helen) He, Michael Klemm, Bronis R. De Supinski, OpenMP: Current and Future Directions, 8th NCAR MultiCore Workshop (MC8), September 19, 2018,

Suzanne M. Kosina, Annette M. Greiner, Rebecca K. Lau, Stefan Jenkins, Richard Baran, Benjamin P. Bowen and Trent R. Northen, "Web of microbes (WoM): a curated microbial exometabolomics database for linking chemistry and microbes", BMC Microbiology, September 12, 2018, 18, doi: https://doi.org/10.1186/s12866-018-1256-y

As microbiome research becomes increasingly prevalent in the fields of human health, agriculture and biotechnology, there exists a need for a resource to better link organisms and environmental chemistries. Exometabolomics experiments now provide assertions of the metabolites present within specific environments and how the production and depletion of metabolites is linked to specific microbes. This information could be broadly useful, from comparing metabolites across environments, to predicting competition and exchange of metabolites between microbes, and to designing stable microbial consortia. Here, we introduce Web of Microbes (WoM; freely available at: http://webofmicrobes.org), the first exometabolomics data repository and visualization tool.

Teng Wang, Shane Snyder, Glenn K. Lockwood, Philip Carns, Nicholas Wright, Suren Byna, "IOMiner: Large-Scale Analytics Framework for Gaining Knowledge from I/O Logs", 2018 IEEE International Conference on Cluster Computing (CLUSTER), Belfast, UK, IEEE, 2018, 466--476, doi: 10.1109/CLUSTER.2018.00062

Modern HPC systems are collecting large amounts of I/O performance data. The massive volume and heterogeneity of this data, however, have made timely performance of in-depth integrated analysis difficult. To overcome this difficulty and to allow users to identify the root causes of poor application I/O performance, we present IOMiner, an I/O log analytics framework. IOMiner provides an easy-to-use interface for analyzing instrumentation data, a unified storage schema that hides the heterogeneity of the raw instrumentation data, and a sweep-line-based algorithm for root cause analysis of poor application I/O performance. IOMiner is implemented atop Spark to facilitate efficient, interactive, parallel analysis. We demonstrate the capabilities of IOMiner by using it to analyze logs collected on a large-scale production HPC system. Our analysis techniques not only uncover the root cause of poor I/O performance in key application case studies but also provide new insight into HPC I/O workload characterization.

Chris Harris, Reproducible quantum chemistry in Jupyter, jupytercon, August 23, 2018,

Benjamin Driscoll, and Zhengji Zhao, Automating NERSC Reporting, 2018 CS Summer Student POSTER SESSION, August 2, 2018, Berkeley CA, August 2, 2018,

Johnson, IJ; Bustillo,KC; Ciston, J; Draney, BR; Ercius, P; Fong,E; Goldschmidt, A; Joseph, JM; Lee, JR; Minor, A.M.; Ophus, C; Selvarajan, A; Skinner, DE; Stezelberger, T; Tindall, CS; Denes, P., "A Next Generation Electron Microscopy Detector Aimed at Enabling New Scanning Diffraction Techniques and Online Data Reconstruction", Microscopy and Microanalysis, 24(S1), 166-167., August 2018, doi: 10.1017/S1431927618001320

Nathan Hjelm, Matthew Dosanjh, Ryan Grant, Taylor Groves, Patrick Bridges, Dorian Arnold, "Improving MPI Multi-threaded RMA Communication Performance", ACM International Conference on Parallel Processing (ICPP), August 1, 2018,

C. Yang, Introduction to Performance Scalability Tools, Department of Energy (DOE) Computational Science Graduate Fellowship (CSGF) Annual Review, July 2018,

Simulating the 21-cm Signal during the Cosmic Dawn, Hannah Elizabeth Ross, PhD, University of Sussex, July 2018,

Adam P Arkin, Robert W Cottingham, Christopher S Henry, Nomi L Harris, Rick L Stevens, Sergei Maslov, Paramvir Dehal, Doreen Ware, Fernando Perez, Shane Canon, Michael W Sneddon, Matthew L Henderson, William J Riehl, Dan Murphy-Olson, Stephen Y Chan, Roy T Kamimura, Sunita Kumari, Meghan M Drake, Thomas S Brettin, Elizabeth M Glass, Dylan Chivian, Dan Gunter, David J Weston, Benjamin H Allen, Jason Baumohl, Aaron A Best, Ben Bowen, Steven E Brenner, Christopher C Bun, John-Marc Chandonia, Jer-Ming Chia, Ric Colasanti, Neal Conrad, James J Davis, Brian H Davison, Matthew DeJongh, Scott Devoid, Emily Dietrich, Inna Dubchak, Janaka N Edirisinghe, Gang Fang, José P Faria, Paul M Frybarger, Wolfgang Gerlach, Mark Gerstein, Annette Greiner, James Gurtowski, Holly L Haun, Fei He, Rashmi Jain, Marcin P Joachimiak, Kevin P Keegan, Shinnosuke Kondo, Vivek Kumar, Miriam L Land, Folker Meyer, Marissa Mills, Pavel S Novichkov, Taeyun Oh, Gary J Olsen, Robert Olson, Bruce Parrello, Shiran Pasternak, Erik Pearson, Sarah S Poon, Gavin A Price, Srividya Ramakrishnan, Priya Ranjan, Pamela C Ronald, Michael C Schatz, Samuel M D Seaver, Maulik Shukla, Roman A Sutormin, Mustafa H Syed, James Thomason, Nathan L Tintle, Daifeng Wang, Fangfang Xia, Hyunseung Yoo, Shinjae Yoo, Dantong Yu, "KBase: the United States department of energy systems biology knowledgebase", Nature Biotechnology, July 6, 2018, 36.7, doi: 10.1038/nbt.4163.

Here we present the DOE Systems Biology Knowledgebase (KBase, http://kbase.us), an open-source software and data platform that enables data sharing, integration, and analysis of microbes, plants, and their communities. KBase maintains an internal reference database that consolidates information from widely used external data repositories. This includes over 90,000 microbial genomes from RefSeq4, over 50 plant genomes from Phytozome5, over 300 Biolog media formulations6, and >30,000 reactions and compounds from KEGG7, BIGG8, and MetaCyc9. These public data are available for integration with user data where appropriate (e.g., genome comparison or building species trees). KBase links these diverse data types with a range of analytical functions within a web-based user interface. This extensive community resource facilitates large-scale analyses on scalable computing infrastructure and has the potential to accelerate scientific discovery, improve reproducibility, and foster open collaboration.

Zhengji Zhao, Using VASP at NERSC, Chemistry and Materials Science Application Training, Berkeley CA, June 29, 2018,

T. Koskela, A. Ilic, Z. Matveev, R. Belenov, C. Yang, and L. Sousa, A Practical Approach to Application Performance Tuning with the Roofline Model, Half-Day Tutorial, International Supercomputing Conference (ISC’18), June 2018,

B. Cook, C. Yang, B. Friesen, T. Kurth and J. Deslippe, "Sparse CSB_Coo Matrix-Vector and Matrix-Matrix Performance on Intel Xeon Architectures", Intel eXtreme Performance Users Group (IXPUG) at International Supercomputing Conference (ISC'18), June 2018,

T. Koskela, Z. Matveev, C. Yang, A. Adetokunbo, R. Belenov, P. Thierry, Z. Zhao,
R. Gayatri, H. Shan, L. Oliker, J. Deslippe, R. Green, and S. Williams,
"A Novel Multi-Level Integrated Roofline Model Approach for Performance Characterization", International Supercomputing Conference (ISC'18), June 2018,

Tuomas Koskela, Zakhar Matveev, Charlene Yang, Adetokunbo Adedoyin, Roman Belenov, Philippe Thierry, Zhengji Zhao, Rahulkumar Gayatri, Hongzhang Shan, Leonid Oliker, Jack Deslippe, Ron Green, Samuel Williams, "​A Novel Multi-Level Integrated Roofline Model Approach for Performance Characterization", International Conferences on High Performance Computing 2018, June 24, 2018,

Shahzeb Siddiqui, Software Stack Testing Framework, HPCKP, June 22, 2018,

Yun (Helen) He, Introduction to NERSC Resources, LBNL Computer Sciences Summer Student Classes #1, June 11, 2018,

Glenn K. Lockwood, Nicholas J. Wright, Shane Snyder, Philip Carns, George Brown, Kevin Harms, "TOKIO on ClusterStor: Connecting Standard Tools to Enable Holistic I/O Performance Analysis", Proceedings of the 2018 Cray User Group, Stockholm, SE, May 24, 2018,

At present, I/O performance analysis requires different tools to characterize individual components of the I/O subsystem, and institutional I/O expertise is relied upon to translate these disparate data into an integrated view of application performance. This process is labor-intensive and not sustainable as the storage hierarchy deepens and system complexity increases. To address this growing disparity, we have developed the Total Knowledge of I/O (TOKIO) framework to combine the insights from existing component-level monitoring tools and provide a holistic view of performance across the entire I/O stack. 

A reference implementation of TOKIO, pytokio, is presented here. Using monitoring tools included with Cray XC and ClusterStor systems alongside commonly deployed community-supported tools, we demonstrate how pytokio provides a lightweight foundation for holistic I/O performance analyses on two Cray XC systems deployed at different HPC centers. We present results from integrated analyses that allow users to quantify the degree of I/O contention that affected their jobs and probabilistically identify unhealthy storage devices that impacted their performance.We also apply pytokio to inspect the utilization of NERSC’s DataWarp burst buffer and demonstrate how pytokio can be used to identify users and applications who may stand to benefit most from migrating their workloads from Lustre to the burst buffer.

C. Yang, B. Friesen, T. Kurth, B. Cook, S. Williams, "Toward Automated Application Profiling on Cray Systems", Cray User Group conference (CUG'18), May 2018,

Ville Ahlgren, Stefan Andersson, Jim Brandt, Nicholas Cardo, Sudheer Chunduri, Jeremy Enos, Parks Fields, Ann Gentile, Richard Gerber, Joe Greenseid, Annette Greiner, Bilel Hadri, Helen He, Dennis Hoppe, Urpo Kaila, Kaki Kelly, Mark Klein, Alex Kristiansen, Steve Leak, Michael Mason, Kevin Pedretti, Jean-Guillaume Piccinali, Jason Repik, Jim Rogers, Susanna Salminen, Michael Showerman, Cary Whitney, Jim Williams, "Cray System Monitoring: Successes, Priorities, Visions", CUG 2018 Proceedings, Stockholm, Cray User Group, May 22, 2018,

Effective HPC system operations and utilization require unprecedented insight into system state, applications’ demands for resources, contention for shared resources, and system demands on center power and cooling. Monitoring can provide such insights when the necessary fundamental capabilities for data availability and usability are provided. In this paper, multiple Cray sites seek to motivate monitoring as a core capability in HPC design, through the presentation of success stories illustrating enhanced understanding and improved performance and/or operations as a result of monitoring and analysis.We present the utility, limitations, and gaps of the data necessary to enable the required insights. The capabilities developed to enable the case successes drive our identification and prioritization of monitoring system requirements. Ultimately, we seek to engage all HPC stakeholders to drive community and vendor progress on these priorities.

Stephen Leak, Annette Greiner, Ann Gentile, James Brandt, "Supporting failure analysis with discoverable, annotated log datasets", CUG 2018 Proceedings, Stockholm, Cray User Group, May 22, 2018,

Detection, characterization, and mitigation of faults on supercomputers is complicated by the large variety of interacting subsystems. Failures often manifest as vague observations like ``my job failed" and may result from system hardware/firmware/software, filesystems, networks, resource manager state, and more. Data such as system logs, environmental metrics, job history, cluster state snapshots, published outage notices and user reports is routinely collected. These data are typically stored in different locations and formats for specific use by targeted consumers. Combining data sources for analysis generally requires a consumer-dependent custom approach. We present a vocabulary for describing data, including format and access details, an annotation schema for attaching observations to a dataset, and tools to aid in discovery and publishing system-related insights. We present case studies in which our analysis tools utilize information from disparate data sources to investigate failures and performance issues from user and administrator perspectives.

K.S. Hemmert, S.G. Moore, M.A. Gallis, M.E. Davis, J. Levesque, N. Hjelm, J. Lujan, D. Morton, H. Nam, A. Parga, P. Peltz, G. Shipman, A. Torrez, "Trinity: Opportunities and Challenges of a Heterogeneous System", Proceedings of the Cray Users Group Conference, Stockholm, Sweden, May 2018,

Nicholas Balthaser, NERSC Tape Technology, MSST 2018 Conference, May 16, 2018,

Description of tape storage technology in use at NERSC for 2018 MSST conference.

D Androić, J Balewski, and many more, "Precision measurement of the weak charge of the proton", Nature, May 9, 2018,

Hannah E Ross, Keri L Dixon, Ilian T Iliev, Garrelt Mellema, "New simulation of QSO X-ray heating during the Cosmic Dawn", Conference, Dubrovnik, Croatia, Cambridge University Press, May 8, 2018, 12:34 - 38, doi: https://doi.org/10.1017/S1743921317011115

Tim Mattson, Yun (Helen) He, Beyond OpenMP Common Core, NERSC Training, May 4, 2018,

NERSC site update focusing on plans to implement new tape technology at the Berkeley Data Center. 

NERSC Site Report focusing on plans for migration of tape-based system to new location and new technology, and collection of metrics for GPFS.

Zhengji Zhao, New User Training: March 21, 2018, New User Training, March 21, 2018, Berkeley CA, March 21, 2018,

Kevin Gott, Charles Lena, Kai-Hsin Liou, James Chelikowsky, Jack Deslippe, Scaling the Force Calculations of the Real Space Pseudopotential DFT solver PARSEC on Haswell and KNL systems, APS March Meeting 2018, 2018,

The ability to compute atomic forces through quantum contributions rather than through simple pairwise potentials is one of the most compelling reasons materials scientists use Kohn-Sham pseudopotential density functional theory (DFT). PARSEC is an actively developed real space pseudopotential DFT solver that uses Fortran MPI+OpenMP parallelization. PARSEC provides atomic forces by self-consistently solving for the electronic structure and then summing local and nonlocal contributions. Through experimentation with PARSEC, we present why increasingly bulk synchronous processing and vectorization of the contributions is not enough to fully utilize current HPC hardware. We address this limitation through a demonstration of multithreaded communication approaches for local and nonlocal force computations on Intel Knights Landing supercomputers that yield feasible calculation times for systems of over 20,000 atoms.

C. Yang, T. Kurth, Roofline Performance Model and Intel Advisor, Performance Analysis and Modeling (PAM) Workshop 2018, February 2018,

Barbara Chapman, Oscar Hernandez, Yun (Helen) He, Martin Kong, Geoffroy Vallee, MPI + OpenMP Tutorial, DOE ECP Annual Meeting Tutorial, 2018, February 9, 2018,

S. Williams, J. Deslippe, C. Yang, P. Basu, Performance Tuning of Scientific Codes with the Roofline Model, Half-Day Tutorial, Exascale Computing Project (ECP) Annual Meeting, February 2018,

Alice Kong's, Yun (Helen) He, OpenMP Common Core, NERSC Training, February 6, 2018,

Gerber, Richard; Hack, James; Riley, Katherine; Antypas, Katie; Coffey, Richard; Dart, Eli; Straatsma, Tjerk; Wells, Jack; Bard, Deborah; Dosanjh, Sudip, et al., "Crosscut report: Exascale Requirements Reviews", January 22, 2018,

Lee, Jason R., et al, "Enhancing supercomputing with software defined networking", IEEE Conference on Information Networking (ICOIN), January 10, 2018,

Lisa Claus, Rob Falgout, Matthias Bolten, "AMG Smoothers for Maxwell's Equations", 2018,

F.P. An1, A.B. Balantekin2, H.R. Band3, M. Bishai4, S. Blyth5,6, D. Cao7, G.F. Cao8, J. Cao8, Y.L. Chan9, J.F. Chang8, Y. Chang6, H.S. Chen8, Q.Y. Chen10, S.M. Chen11, Y.X. Chen12, Y. Chen13, J. Cheng10, Z.K. Cheng14, J.J. Cherwinka2, M.C. Chu9, A. Chukanov15, J.P. Cummings16, Y.Y. Ding8, M.V. Diwan4, M. Dolgareva15, J. Dove17, D.A. Dwyer18, W.R. Edwards18, R. Gill4, M. Gonchar15, G.H. Gong11, H. Gong11, M. Grassi8, W.Q. Gu19, L. Guo11, X.H. Guo20, Y.H. Guo21, Z. Guo11, R.W. Hackenburg4, S. Hans4, M. He8, K.M. Heeger3, Y.K. Heng8, A. Higuera22, Y.B. Hsiung5, B.Z. Hu5, T. Hu8, E.C. Huang17, H.X. Huang23, X.T. Huang10, P. Huber24, W. Huo25, G. Hussain11, D.E. Jaffe4, K.L. Jen26, S. Jetter8, X.P. Ji11,27, X.L. Ji8, J.B. Jiao10, R.A. Johnson28, D. Jones29, L. Kang30, S.H. Kettell4, A. Khan14, S. Kohn31, M. Kramer18,31, K.K. Kwan9, M.W. Kwok9, T. Kwok32, T.J. Langford3, K. Lau22, L. Lebanowski11, J. Lee18, J.H.C. Lee32, R.T. Lei30, R. Leitner33, C. Li10, D.J. Li25, F. Li8, G.S. Li19, Q.J. Li8, S. Li30, S.C. Li24, W.D. Li8, X.N. Li8, X.Q. Li27, Y.F. Li8, Z.B. Li14, H. Liang25, C.J. Lin18, G.L. Lin26, S. Lin30, S.K. Lin22, Y.-C. Lin5, J.J. Ling14, J.M. Link24, L. Littenberg4, B.R. Littlejohn34, J.L. Liu19, J.C. Liu8, C.W. Loh7, C. Lu35, H.Q. Lu8, J.S. Lu8, K.B. Luk18,31, X.Y. Ma8, X.B. Ma12, Y.Q. Ma8, Y. Malyshkin36, D.A. Martinez Caicedo34, K.T. McDonald35, R.D. McKeown37,38, I. Mitchell22, Y. Nakajima18, J. Napolitano29, D. Naumov15, E. Naumova15, H.Y. Ngai32, J.P. Ochoa-Ricoux36, A. Olshevskiy15, H.-R. Pan5, J. Park24, S. Patton18, V. Pec33, J.C. Peng17, L. Pinsky22, C.S.J. Pun32, F.Z. Qi8, M. Qi7, X. Qian4, R.M. Qiu12, N. Raper14,39, J. Ren23, R. Rosero4, B. Roskovec33, X.C. Ruan23, C. Sebastiani8, H. Steiner18,31, J.L. Sun40, W. Tang4, D. Taychenachev15, K. Treskov15, K.V. Tsang18, C.E. Tull18, N. Viaux36, B. Viren4, V. Vorobel33, C.H. Wang6, M. Wang10, N.Y. Wang20, R.G. Wang8, W. Wang14,38, X. Wang41, Y.F. Wang8, Z. Wang11, Z. Wang8, Z.M. Wang8, H.Y. Wei11, L.J. Wen8, K. Whisnant42, C.G. White34, L. Whitehead22, T. Wise3, H.L.H. Wong18,31, S.C.F. Wong14, E. Worcester4, C.-H. Wu26, Q. Wu10, W.J. Wu8, D.M. Xia43, J.K. Xia8, Z.Z. Xing8, J.L. Xu8, Y. Xu14, T. Xue11, C.G. Yang8, H. Yang7, L. Yang30, M.S. Yang8, M.T. Yang10, Y.Z. Yang14, M. Ye8, Z. Ye22, M. Yeh4, B.L. Young42, Z.Y. Yu8, S. Zeng8, L. Zhan8, C. Zhang4, C.C. Zhang8, H.H. Zhang14, J.W. Zhang8, Q.M. Zhang21, X.T. Zhang8, Y.M. Zhang11, Y.X. Zhang40, Y.M. Zhang14, Z.J. Zhang30, Z.Y. Zhang8, Z.P. Zhang25, J. Zhao8, L. Zhou8, H.L. Zhuang8 and J.H. Zou8, "Seasonal Variation of the Underground Cosmic Muon Flux Observed at Daya Bay", Journal of Cosmology and Astroparticle Physics, January 2, 2018,

The Daya Bay Experiment consists of eight identically designed detectors located in three underground experimental halls named as EH1, EH2, EH3, with 250, 265 and 860 meters of water equivalent vertical overburden, respectively. Cosmic muon events have been recorded over a two-year period. The underground muon rate is observed to be positively correlated with the effective atmospheric temperature and to follow a seasonal modulation pattern.

Barnaby D.A. Levin, Yi Jiang
, Elliot Padgett, Shawn Waldon, Cory Quammen, Chris Harris, Utkarsh Ayachit, Marcus Hanwell, Peter Ercius,David A. Muller,Robert Hovden,
"Tutorial on the Visualization of Volumetric Data Using tomviz", January 1, 2018, Volume 2:12 - 17, doi: doi.org/10.1017/S1551929517001213

Felix Ruehle, Johannes Blaschke, Jan-Timm Kuhr, Holger Stark, "Gravity-induced dynamics of a squirmer microswimmer in wall proximity", New J. Phys., 2018, 20:025003,

Zingale M, Almgren AS, Barrios Sazo MG, Beckner VE, Bell JB, Friesen B, Jacobs AM, Katz MP, Malone CM, Nonaka AJ, Willcox DE, Zhang W, "Meeting the Challenges of Modeling Astrophysical Thermonuclear Explosions: Castro, Maestro, and the AMReX Astrophysics Suite", 2018, doi: 10.1088/1742-6596/1031/1/012024

We describe the AMReX suite of astrophysics codes and their application to modeling problems in stellar astrophysics. Maestro is tuned to efficiently model subsonic convective flows while Castro models the highly compressible flows associated with stellar explosions. Both are built on the block-structured adaptive mesh refinement library AMReX. Together, these codes enable a thorough investigation of stellar phenomena, including Type Ia supernovae and X-ray bursts. We describe these science applications and the approach we are taking to make these codes performant on current and future many-core and GPU-based architectures.

Jialin Liu, Debbie Bard, Quincey Koziol, Stephen Bailey, Prabhat, "Searching for Millions of Objects in the BOSS Spectroscopic Survey Data with H5Boss", IEEE NYSDS'17, January 1, 2018,

2017

C. S. Daley, D. Ghoshal, G. K. Lockwood, S. Dosanjh, L. Ramakrishnan, N. J. Wright, "Performance characterization of scientific workflows for the optimal use of Burst Buffers", Future Generation Computer Systems, December 28, 2017, doi: 10.1016/j.future.2017.12.022

Scientific discoveries are increasingly dependent upon the analysis of large volumes of data from observations and simulations of complex phenomena. Scientists compose the complex analyses as workflows and execute them on large-scale HPC systems. The workflow structures are in contrast with monolithic single simulations that have often been the primary use case on HPC systems. Simultaneously, new storage paradigms such as Burst Buffers are becoming available on HPC platforms. In this paper, we analyze the performance characteristics of a Burst Buffer and two representative scientific workflows with the aim of optimizing the usage of a Burst Buffer, extending our previous analyses (Daley et al., 2016). Our key contributions are a). developing a performance analysis methodology pertinent to Burst Buffers, b). improving the use of a Burst Buffer in workflows with bandwidth-sensitive and metadata-sensitive I/O workloads, c). highlighting the key data management challenges when incorporating a Burst Buffer in the studied scientific workflows.

Tyler Allen, Christopher S. Daley, Douglas Doerfler, Brian Austin, Nicholas J. Wright, "Performance and Energy Usage of Workloads on KNL and Haswell Architectures", High Performance Computing Systems. Performance Modeling, Benchmarking, and Simulation. PMBS 2017. Lecture Notes in Computer Science, Volume 10724., December 23, 2017,

Tiffany A. Connors, Apan Qasem, "Automatically Selecting Profitable Thread Block Sizes for Accelerated Kernels", 2017 IEEE 19th International Conference on High Performance Computing and Communications (HPCC17), December 18, 2017, doi: 10.1109/HPCC-SmartCity-DSS.2017.58

Biplab Kumar Saha, Tiffany Connors, Saami Rahman, Apan Qasem, "A Machine Learning Approach to Automatic Creation of Architecture-sensitive Performance Heuristics", 2017 IEEE 19th International Conference on High Performance Computing and Communications (HPCC), Bangkok, Thailand, December 18, 2017, doi: 10.1109/HPCC-SmartCity-DSS.2017.3

Scott Michael, Yun He, "Foreword to the Special Issue of the Cray User Group (CUG 2017)", Concurrency and Computation: Practice and Experience, December 5, 2017,

Wahid Bhimji, Debbie Bard, Kaylan Burleigh, Chris Daley, Steve Farrell, Markus Fasel, Brian Friesen, Lisa Gerhardt, Jialin Liu, Peter Nugent, Dave Paul, Jeff Porter, Vakho Tsulaia, "Extreme I/O on HPC for HEP using the Burst Buffer at NERSC", Journal of Physics: Conference Series, December 1, 2017, 898:082015,

Brian Van Straalen, David Trebotich, Andrey Ovsyannikov, Daniel T. Graves, "Exascale Scientific Applications: Programming Approaches for Scalability Performance and Portability", edited by Tjerk Straatsma, Timothy William, Katie Antypas, (CRC Press: December 1, 2017)

Taylor Groves, Ryan Grant, Aaron Gonzales, Dorian Arnold, "Unraveling Network-induced Memory Contention: Deeper Insights with Machine Learning", Transactions on Parallel and Distributed Systems, November 21, 2017, doi: 10.1109/TPDS.2017.2773483

Remote Direct Memory Access (RDMA) is expected to be an integral communication mechanism for future exascale systems enabling asynchronous data transfers, so that applications may fully utilize CPU resources while simultaneously sharing data amongst remote nodes. In this work we examine Network-induced Memory Contention (NiMC) on Infiniband networks. We expose the interactions between RDMA, main-memory and cache, when applications and out-of-band services compete for memory resources. We then explore NiMCs resulting impact on application-level performance. For a range of hardware technologies and HPC workloads, we quantify NiMC and show that NiMCs impact grows with scale resulting in up to 3X performance degradation at scales as small as 8K processes even in applications that previously have been shown to be performance resilient in the presence of noise. Additionally, this work examines the problem of predicting NiMC's impact on applications by leveraging machine learning and easily accessible performance counters. This approach provides additional insights about the root cause of NiMC and facilitates dynamic selection of potential solutions. Lastly, we evaluated three potential techniques to reduce NiMCs impact, namely hardware offloading, core reservation and network throttling.

T. Koskela, A. Ilic, Z. Matveev, S. Williams, P. Thierry, and C. Yang, Performance Tuning of Scientific Codes with the Roofline Model, Half-Day Tutorial, Supercomputing Conference (SC’17), November 2017,

Colin A. MacLean, HonWai Leong, Jeremy Enos, "Improving the start-up time of python applications on large scale HPC systems", Proceedings of HPCSYSPROS 2017, Denver, CO, 2017,

B Friesen, MMA Patwary, B Austin, N Satish, Z Slepian, N Sundaram, D Bard, DJ Eisenstein, J Deslippe, P Dubey, Prabhat, "Galactos: Computing the Anisotropic 3-Point Correlation Function for 2 Billion Galaxies", November 2017, doi: 10.1145/3126908.3126927

The nature of dark energy and the complete theory of gravity are two central questions currently facing cosmology. A vital tool for addressing them is the 3-point correlation function (3PCF), which probes deviations from a spatially random distribution of galaxies. However, the 3PCF's formidable computational expense has prevented its application to astronomical surveys comprising millions to billions of galaxies. We present Galactos, a high-performance implementation of a novel, O(N2) algorithm that uses a load-balanced k-d tree and spherical harmonic expansions to compute the anisotropic 3PCF. Our implementation is optimized for the Intel Xeon Phi architecture, exploiting SIMD parallelism, instruction and thread concurrency, and significant L1 and L2 cache reuse, reaching 39% of peak performance on a single node. Galactos scales to the full Cori system, achieving 9.8 PF (peak) and 5.06 PF (sustained) across 9636 nodes, making the 3PCF easily computable for all galaxies in the observable universe.

Damian Rouson, Ethan D Gutmann, Alessandro Fanfarillo, Brian Friesen, "Performance portability of an intermediate-complexity atmospheric research model in coarray Fortran", November 2017, doi: 10.1145/3144779.3169104

We examine the scalability and performance of an open-source, coarray Fortran (CAF) mini-application (mini-app) that implements the parallel, numerical algorithms that dominate the execution of The Intermediate Complexity Atmospheric Research (ICAR) [4] model developed at the the National Center for Atmospheric Research (NCAR). The Fortran 2008 mini-app includes one Fortran 2008 implementation of a collective subroutine defined in the Committee Draft of the upcoming Fortran 2018 standard. The ability of CAF to run atop various communication layers and the increasing CAF compiler availability facilitated evaluating several compilers, runtime libraries and hardware platforms. Results are presented for the GNU and Cray compilers, each of which offers different parallel runtime libraries employing one or more communication layers, including MPI, OpenSHMEM, and proprietary alternatives. We study performance on multi- and many-core processors in distributed memory. The results show promising scaling across a range of hardware, compiler, and runtime choices on up to ~100,000 cores.

Glenn K. Lockwood, Wucherl Yoo, Suren Byna, Nicholas J. Wright, Shane Snyder, Kevin Harms, Zachary Nault, Philip Carns, "UMAMI: a recipe for generating meaningful metrics through holistic I/O performance analysis", Proceedings of the 2nd Joint International Workshop on Parallel Data Storage & Data Intensive Scalable Computing Systems (PDSW-DISCS'17), Denver, CO, ACM, November 2017, 55-60, doi: 10.1145/3149393.3149395

I/O efficiency is essential to productivity in scientific computing, especially as many scientific domains become more data-intensive. Many characterization tools have been used to elucidate specific aspects of parallel I/O performance, but analyzing components of complex I/O subsystems in isolation fails to provide insight into critical questions: how do the I/O components interact, what are reasonable expectations for application performance, and what are the underlying causes of I/O performance problems? To address these questions while capitalizing on existing component-level characterization tools, we propose an approach that combines on-demand, modular synthesis of I/O characterization data into a unified monitoring and metrics interface (UMAMI) to provide a normalized, holistic view of I/O behavior.

We evaluate the feasibility of this approach by applying it to a month-long benchmarking study on two distinct large-scale computing platforms. We present three case studies that highlight the importance of analyzing application I/O performance in context with both contemporaneous and historical component metrics, and we provide new insights into the factors affecting I/O performance. By demonstrating the generality of our approach, we lay the groundwork for a production-grade framework for holistic I/O analysis.

Tim Mattson, Alice Koniges, Yun (Helen) He, Barbara Chapman, The OpenMP Common Core: A hands-on exploration, SuperComputing 2017 Tutorial, November 12, 2017,

Kurt Ferreira, Ryan E. Grant, Michael J. Levenhagen, Scott Levy, Taylor Groves, "Hardware MPI Message Matching: Insights into MPI Matching Behavior to Inform Design", ExaMPI in association with SC17, November 12, 2017,

Richard A. Gerber, Jack Deslippe, Manycore for the Masses Part 2, Intel HPC DevCon, November 11, 2017,

Richard A Gerber, NERSC Overview - Focus: Energy Technologies, November 6, 2017,

GK Lockwood, D Hazen, Q Koziol, RS Canon, K Antypas, J Balewski, N Balthaser, W Bhimji, J Botts, J Broughton, TL Butler, GF Butler, R Cheema, C Daley, T Declerck, L Gerhardt, WE Hurlbert, KA Kallback-Rose, S Leak, J Lee, R Lee, J Liu, K Lozinskiy, D Paul, Prabhat, C Snavely, J Srinivasan, T Stone Gibbins, NJ Wright, "Storage 2020: A Vision for the Future of HPC Storage", October 20, 2017, LBNL LBNL-2001072,

As the DOE Office of Science's mission computing facility, NERSC will follow this roadmap and deploy these new storage technologies to continue delivering storage resources that meet the needs of its broad user community. NERSC's diversity of workflows encompass significant portions of open science workloads as well, and the findings presented in this report are also intended to be a blueprint for how the evolving storage landscape can be best utilized by the greater HPC community. Executing the strategy presented here will ensure that emerging I/O technologies will be both applicable to and effective in enabling scientific discovery through extreme-scale simulation and data analysis in the coming decade.

Richard A. Gerber, NERSC Overview - Focus: Berkeley Rotary Club, October 18, 2017,

Richard A. Gerber, Current and Next Generation Supercomputing and Data Analysis at NERSC, HPC Distinguished Lecture, Iowa State University & Ames Laboratory, October 18, 2017,

A measurement of electron antineutrino oscillation by the Daya Bay Reactor Neutrino Experiment is described in detail

Taylor Groves, Networks, Damn Networks and Aries, NERSC CS/Data Seminar, October 6, 2017,

Presentation of the performance of the Cori Aries network.   Highlights of monitoring and analysis efforts underway.

Mustafa Mustafa, Jan Balewski, Jérôme Lauret, Jefferson Porter, Shane Canon, Lisa Gerhardt, Levente Hajdu2, Mark Lukascsyk, "STAR Data Reconstruction at NERSC/Cori, an adaptable Docker container approach for HPC", CHEP 2016, October 1, 2017,

Douglas Doerfler, Steven Gottlieb, Carleton DeTar, Doug Toussaint, Karthik Raman, Improving the Performance of the MILC Code on Intel Knights Landing, An Overview, Intel Xeon Phi User Group Meeting 2017 Fall Meeting, September 26, 2017,

Jaehyun Han, Donghun Koo, Glenn K. Lockwood, Jaehwan Lee, Hyeonsang Eom, Soonwook Hwang, "Accelerating a Burst Buffer via User-Level I/O Isolation", Proceedings of the 2017 IEEE International Conference on Cluster Computing (CLUSTER), Honolulu, HI, IEEE, September 2017, 245-255, doi: 10.1109/CLUSTER.2017.60

Burst buffers tolerate I/O spikes in High-Performance Computing environments by using a non-volatile flash technology. Burst buffers are commonly located between parallel file systems and compute nodes, handling bursty I/Os in the middle. In this architecture, burst buffers are shared resources. The performance of an SSD is significantly reduced when it is used excessively because of garbage collection, and we have observed that SSDs in a burst buffer become slow when many users simultaneously use the burst buffer. To mitigate the performance problem, we propose a new user-level I/O isolation framework in a High-Performance Computing environment using a multi-streamed SSD. The multi-streamed SSD allocates the same flash block for I/Os in the same stream. We assign a different stream to each user; thus, the user can use the stream exclusively. To evaluate the performance, we have used open-source supercomputing workloads and I/O traces from real workloads in the Cori supercomputer at the National Energy Research Scientific Computing Center. Via user-level I/O isolation, we have obtained up to a 125% performance improvement in terms of I/O throughput. In addition, our approach reduces the write amplification in the SSDs, leading to improved SSD endurance. This user-level I/O isolation framework could be applied to deployed burst buffers without having to make any user interface changes.

Richard A. Gerber, Cori KNL Update, IXPUG 2017, Austin, TX, September 26, 2017,

Yun (Helen) He, Jack Deslippe, Enabling Applications for Cori KNL: NESAP, September 21, 2017,

NERSC Science Highlights - September 2017, NERSC Users Group Meeting 2017, September 19, 2017,

Douglas Doerfler, Brian Austin, Brandon Cook, Jack Deslippe, Krishna Kandalla, Peter Mendygral, "Evaluating the Networking Characteristics of the Cray XC-40 Intel Knights Landing Based Cori Supercomputer at NERSC", Concurrency and Computation: Practice and Experience, Volume 30, Issue 1, September 12, 2017,

Hai Ah Nam, Gabriel Rockefeller, Mike Glass, Shawn Dawson, John Levesque, Victor Lee, "The Trinity Center of Excellence Co-Design Best Practices", Computing in Science & Engineering, vol. 19, pp. 19-26, September/October 2017, September 2017,

Taylor Groves, Yizi Gu, Nicholas J. Wright, "Understanding Performance Variability on the Aries Dragonfly Network", HPCMASPA in association with IEEE Cluster, September 1, 2017,

Yun (Helen) He, Brandon Cook, Jack Deslippe, Brian Friesen, Richard Gerber, Rebecca Hartman­-Baker, Alice Koniges, Thorsten Kurth, Stephen Leak, Woo­Sun Yang, Zhengji Zhao, Eddie Baron, Peter Hauschildt, "Preparing NERSC users for Cori, a Cray XC40 system with Intel Many Integrated Cores", Concurrency and Computation: Practice and Experience, August 2017, 30, doi: 10.1002/cpe.4291

The newest NERSC supercomputer Cori is a Cray XC40 system consisting of 2,388 Intel Xeon Haswell nodes and 9,688 Intel Xeon‐Phi “Knights Landing” (KNL) nodes. Compared to the Xeon‐based clusters NERSC users are familiar with, optimal performance on Cori requires consideration of KNL mode settings; process, thread, and memory affinity; fine‐grain parallelization; vectorization; and use of the high‐bandwidth MCDRAM memory. This paper describes our efforts preparing NERSC users for KNL through the NERSC Exascale Science Application Program, Web documentation, and user training. We discuss how we configured the Cori system for usability and productivity, addressing programming concerns, batch system configurations, and default KNL cluster and memory modes. System usage data, job completion analysis, programming and running jobs issues, and a few successful user stories on KNL are presented.

Zhaoyi Meng, Ekaterina Merkurjev, Alice Koniges, Andrea L. Bertozzi, "Hyperspectral Image Classification Using Graph Clustering Methods", IPOL Journal · Image Processing On Line, 2017, 2017-08-, doi: https://doi.org/10.5201/ipol.2017.204

Doug Jacobsen, Taylor Groves, Global Aries Counter Collection and Analysis, Cray Quarterly Meeting, July 25, 2017,

Hannah E Ross, Keri L Dixon, Ilian T Iliev, Garrelt Mellema, "Simulating the impact of X-ray heating during the cosmic dawn", July 2017, 468:3785–379, doi: 10.1093/mnras/stx649

Alex Gittens et al, "Matrix Factorization at Scale: a Comparison of Scientific Data Analytics in Spark and C+MPI Using Three Case Studies", 2016 IEEE International Conference on Big Data, July 1, 2017,

Barbara Chapman, Alice Koniges, Yun (Helen) He, Oscar Hernandez, and Deepak Eachempati, OpenMP, An Introduction, Scaling to Petascale Institute, XSEDE Training, Berkeley, CA., June 27, 2017,

Thorsten Kurth, William Arndt, Taylor Barnes, Brandon Cook, Jack Deslippe, Doug Doerfler, Brian Friesen, Yun (Helen) He, Tuomas Koskela, Mathieu Lobet, Tareq Malas, Leonid Oliker, Andrey Ovsyannikov, Samual Williams, Woo-Sun Yang, Zhengji Zhao, "Analyzing Performance of Selected NESAP Applications on the Cori HPC System", High Performance Computing. ISC High Performance 2017. Lecture Notes in Computer Science, Volume 10524, June 22, 2017,

C. Yang, R. C. Bording, D. Price, and R. Nealon, "Optimizing Smoothed Particle Hydrodynamics Code Phantom on Haswell and KNL", International Supercomputing Conference (ISC'17), June 2017,

Shahzeb Siddiqui, HPC Application Testing Framework – buildtest, HPCKP, June 15, 2017,

L. Xu, C. J. Yang, D. Huang, and A. Cantoni, "Exploiting Cyclic Prefix for Turbo-OFDM Receiver Design", IEEE Access, vol. 5, pp. 15762-15775, June 2017,

Yun (Helen) He, Steve Leak, and Zhengji Zhao, Using Cori KNL Nodes, Cori KNL Training, Berkeley, CA., June 9, 2017,

Mustafa Mustafa, Deborah Bard, Wahid Bhimji, Rami Al-Rfou, Zarija Lukić, "Creating Virtual Universes Using Generative Adversarial Networks", Submitted To Sci. Rep., June 1, 2017,

Taylor A. Barnes, Thorsten Kurth, Pierre Carrier, Nathan Wichmann, David Prendergast, Paul RC Kent, Jack Deslippe, "Improved treatment of exact exchange in Quantum ESPRESSO.", Computer Physics Communications, May 31, 2017,

Douglas Jacobsen, and ​Zhengji Zhao, Instrumenting Slurm User Commands to Gain Workload Insight, Proceeding of the ​Cray User Group Meeting (​CUG18​), Stockholm, Sweden, May 20, 2017,

Yun (Helen) He, Brandon Cook, Jack Deslippe, Brian Friesen, Richard Gerber, Rebecca Hartman-Baker, Alice Koniges, Thorsten Kurth, Stephen Leak, Woo-Sun Yang, Zhengji Zhao, Eddie Baron, Peter Hauschildt, Preparing NERSC users for Cori, a Cray XC40 system with Intel Many Integrated Cores, Cray User Group 2017, Redmond, WA, May 12, 2017,

Yun (Helen) He, Brandon Cook, Jack Deslippe, Brian Friesen, Richard Gerber, Rebecca Hartman­-Baker, Alice Koniges, Thorsten Kurth, Stephen Leak, Woo­Sun Yang, Zhengji Zhao, Eddie Baron, Peter Hauschildt, "Preparing NERSC users for Cori, a Cray XC40 system with Intel Many Integrated Cores", Cray User Group 2017, Redmond, WA. Best Paper First Runner-Up., May 12, 2017,

Colin MacLean, "Python Usage Metrics on Blue Waters", Cray User Group, Redmond, WA, 2017,

Jialin Liu, Quincey Koziol, Houjun Tang, François Tessier, Wahid Bhimji, Brandon Cook, Brian Austin, Suren Byna, Bhupender Thakur, Glenn K. Lockwood, Jack Deslippe, Prabhat, "Understanding the IO Performance Gap Between Cori KNL and Haswell", Proceedings of the 2017 Cray User Group, Redmond, WA, May 10, 2017,

The Cori system at NERSC has two compute partitions with different CPU architectures: a 2,004 node Haswell partition and a 9,688 node KNL partition, which ranked as the 5th most powerful and fastest supercomputer on the November 2016 Top 500 list. The compute partitions share a common storage configuration, and understanding the IO performance gap between them is important, impacting not only to NERSC/LBNL users and other national labs, but also to the relevant hardware vendors and software developers. In this paper, we have analyzed performance of single core and single node IO comprehensively on the Haswell and KNL partitions, and have discovered the major bottlenecks, which include CPU frequencies and memory copy performance. We have also extended our performance tests to multi-node IO and revealed the IO cost difference caused by network latency, buffer size, and communication cost. Overall, we have developed a strong understanding of the IO gap between Haswell and KNL nodes and the lessons learned from this exploration will guide us in designing optimal IO solutions in many-core era.

Mario Melara, Todd Gamblin, Gregory Becker, Robert French, Matt Belhorn, Kelly Thompson, Peter Scheibel, Rebecca Hartman-Baker, "Using Spack to Manage Software on Cray Supercomputers", Cray User Group 2017, 2017,

Koskela TS, Deslippe J, Friesen B, Raman K, "Fusion PIC code performance analysis on the Cori KNL system", May 2017,

We study the attainable performance of Particle-In-Cell codes on the Cori KNL system by analyzing a miniature particle push application based on the fusion PIC code XGC1. We start from the most basic building blocks of a PIC code and build up the complexity to identify the kernels that cost the most in performance and focus optimization efforts there. Particle push kernels operate at high AI and are not likely to be memory bandwidth or even cache bandwidth bound on KNL. Therefore, we see only minor benefits from the high bandwidth memory available on KNL, and achieving good vectorization is shown to be the most beneficial optimization path with theoretical yield of up to 8x speedup on KNL. In practice we are able to obtain up to a 4x gain from vectorization due to limitations set by the data layout and memory latency.

Zhengji Zhao, Martijn Marsman, Florian Wende, and Jeongnim Kim, "Performance of Hybrid MPI/OpenMP VASP on Cray XC40 Based on Intel Knights Landing Many Integrated Core Architecture", https://cug.org/CUG2017, May 8, 2017,

Abstract - With the recent installation of Cori, a Cray XC40 system with Intel Xeon Phi Knights Landing (KNL) many integrated core (MIC) architecture, NERSC is transitioning from the multi-core to the more energy-efficient many-core era. The developers of VASP, a widely used materials science code, have adopted MPI/OpenMP parallelism to better exploit the increased on-node parallelism, wider vector units, and the high bandwidth on-package memory (MCDRAM) of KNL. To achieve optimal performance, KNL specifics relevant for the build, boot and run time setup must be explored. In this paper, we present the performance analysis of representative VASP workloads on Cori, focusing on the effects of the compilers, libraries, and boot/run time options such as the NUMA/MCDRAM modes, HyperThreading, huge pages, core specialization, and thread scaling. The paper is intended to serve as a KNL performance guide for VASP users, but it will also benefit other KNL users.

Kirill Lozinskiy, GPFS & HPSS Interface (GHI), Spectrum Scale User Group 2017, April 5, 2017,

This presentation gives a brief overview of integration between the High Performance Storage System (HPSS) and the General Parallel File System (GPFS).

F. P. An, A. B. Balantekin, H. R. Band, M. Bishai, S. Blyth, D. Cao, G. F. Cao, J. Cao, Y. L. Chan, J. F. Chang, Y. Chang, H. S. Chen, Q. Y. Chen, S. M. Chen, Y. X. Chen, Y. Chen, J. Cheng, Z. K. Cheng, J. J. Cherwinka, M. C. Chu, A. Chukanov, J. P. Cummings, Y. Y. Ding, M. V. Diwan, M. Dolgareva, J. Dove, D. A. Dwyer, W. R. Edwards, R. Gill, M. Gonchar, G. H. Gong, H. Gong, M. Grassi, W. Q. Gu, L. Guo, X. H. Guo, Y. H. Guo, Z. Guo, R. W. Hackenburg, S. Hans, M. He, K. M. Heeger, Y. K. Heng, A. Higuera, Y. B. Hsiung, B. Z. Hu, T. Hu, E. C. Huang, H. X. Huang, X. T. Huang, Y. B. Huang, P. Huber, W. Huo, G. Hussain, D. E. Jaffe, K. L. Jen, X. P. Ji, X. L. Ji, J. B. Jiao, R. A. Johnson, D. Jones, L. Kang, S. H. Kettell, A. Khan, S. Kohn, M. Kramer, K. K. Kwan, M. W. Kwok, T. J. Langford, K. Lau, L. Lebanowski, J. Lee, J. H. C. Lee, R. T. Lei, R. Leitner, J. K. C. Leung, C. Li, D. J. Li, F. Li, G. S. Li, Q. J. Li, S. Li, S. C. Li, W. D. Li, X. N. Li, X. Q. Li, Y. F. Li, Z. B. Li, H. Liang, C. J. Lin, G. L. Lin, S. Lin, S. K. Lin, Y.-C. Lin, J. J. Ling, J. M. Link, L. Littenberg, B. R. Littlejohn, J. L. Liu, J. C. Liu , C. W. Loh, C. Lu, H. Q. Lu, J. S. Lu, K. B. Luk, X. Y. Ma, X. B. Ma, Y. Q. Ma, Y. Malyshkin, D. A. Martinez Caicedo, K. T. McDonald, R. D. McKeown, I. Mitchell, Y. Nakajima, J. Napolitano, D. Naumov, E. Naumova, H. Y. Ngai, J. P. Ochoa-Ricoux, A. Olshevskiy, H.-R. Pan, J. Park, S. Patton, V. Pec, J. C. Peng, L. Pinsky, C. S. J. Pun, F. Z. Qi, M. Qi, X. Qian, R. M. Qiu, N. Raper, J. Ren, R. Rosero, B. Roskovec, X. C. Ruan, H. Steiner, P. Stoler, J. L. Sun, W. Tang, D. Taychenachev, K. Treskov, K. V. Tsang, C. E. Tull, N. Viaux, B. Viren, V. Vorobel, C. H. Wang, M. Wang, N. Y. Wang, R. G. Wang, W. Wang, X. Wang, Y. F. Wang, Z. Wang, Z. Wang, Z. M. Wang, H. Y. Wei, L. J. Wen, K. Whisnant, C. G. White, L. Whitehead, T. Wise, H. L. H. Wong, S. C. F. Wong, E. Worcester, C.-H. Wu, Q. Wu, W. J. Wu, D. M. Xia, J. K. Xia, Z. Z. Xing, J. L. Xu, Y. Xu, T. Xue, C. G. Yang, H. Yang, L. Yang, M. S. Yang, M. T. Yang, Y. Z. Yang, M. Ye, Z. Ye, M. Yeh, B. L. Young, Z. Y. Yu, S. Zeng, L. Zhan, C. Zhang, C. C. Zhang, H. H. Zhang, J. W. Zhang, Q. M. Zhang, R. Zhang, X. T. Zhang, Y. M. Zhang, Y. X. Zhang, Y. M. Zhang, Z. J. Zhang, Z. Y. Zhang, Z. P. Zhang, J. Zhao, L. Zhou, H. L. Zhuang, J. H. Zou, "Evolution of the Reactor Antineutrino Flux and Spectrum at Daya Bay", Phys. Rev. Lett. 118, 251801 (2017), April 4, 2017,

The Daya Bay experiment has observed correlations between reactor core fuel evolution and changes in the reactor antineutrino flux and energy spectrum.

Friesen, B., Baron, E., Parrent, J. T., Thomas, R., C., Branch, D., Nugent, P., Hauschildt, P. H., Foley, R. J., Wright, D. E., Pan, Y.-C., Filippenko, A. V., Clubb, K. I., Silverman, J. M., Maeda, K. Shivvers, I., Kelly, P. L., Cohen, D. P., Rest, A., Kasen, D., "Optical and ultraviolet spectroscopic analysis of SN 2011fe at late times", Monthly Notices of the Royal Astronomical Society, February 27, 2017, 467:2392-2411, doi: 10.1093/mnras/stx241

We present optical spectra of the nearby Type Ia supernova SN 2011fe at 100, 205, 311, 349 and 578 d post-maximum light, as well as an ultraviolet (UV) spectrum obtained with the Hubble Space Telescope at 360 d post-maximum light. We compare these observations with synthetic spectra produced with the radiative transfer code phoenix. The day +100 spectrum can be well fitted with models that neglect collisional and radiative data for forbidden lines. Curiously, including these data and recomputing the fit yields a quite similar spectrum, but with different combinations of lines forming some of the stronger features. At day +205 and later epochs, forbidden lines dominate much of the optical spectrum formation; however, our results indicate that recombination, not collisional excitation, is the most influential physical process driving spectrum formation at these late times. Consequently, our synthetic optical and UV spectra at all epochs presented here are formed almost exclusively through recombination-driven fluorescence. Furthermore, our models suggest that the UV spectrum even as late as day +360 is optically thick and consists of permitted lines from several iron-peak species. These results indicate that the transition to the ‘nebular’ phase in Type Ia supernovae is complex and highly wavelength dependent.

Tutorial w/ handouts. use of Shifter w/ image of chos=sl64 from PDSF Download the slides at https://docs.google.com/presentation/d/1Hh8vFE3ixxxiYTz9TgfljbUJcjmWUCNwzs-NugmLjSs/edit?usp=sharing

 

Rebecca Hartman-Baker, Craypat and Reveal, NERSC New User Training, February 23, 2017,

Rebecca Hartman-Baker, Accounts and Allocations, NERSC New User Training, February 23, 2017,

Rebecca Hartman-Baker, NERSC Overview, NERSC New User Training, February 23, 2017,

Richard A Gerber, February 2017 Allocations and Usage Update, February 23, 2017,

NERSC Users Group Webinar on 2017 allocations for Cori Knights Landing nodes and Queue Wait Time Reduction Actions

Florian Wende, Martijn Marsman, Zhengji Zhao, Jeongnim Kim, "Porting VASP from MPI to MPI+ OpenMP [SIMD]-Optimization Strategies, Insights and Feature Proposals", 13th International Workshop on OpenMP (IWOMP), September 20–22, 2017, Stony Brook, NY, USA, February 18, 2017,

Richard A Gerber, NERSC Allocations Forecast for PIs, NERSC Users Group, February 16, 2017,

Evan Berkowitz, Thorsten Kurth, Amy Nicholson, Balint Joo, Eniro Rinaldi, Mark Strother, Pavlos Vranas, Andre Walker-Loud, "Two-Nucleon Higher Partial-Wave Scattering from Lattice QCD", Phys. Lett. B, February 10, 2017,

Richard A. Gerber, Allocations and Usage Update for DOE Program Managers, February 7, 2017,

Richard A. Gerber, NERSC's KNL System: Cori, Exascale Computing Project All-Hands Meeting, February 1, 2017,

Presented at the DOE Exascale Computing Project annual meeting in Knoxville, TN.

Wangyi Liu, Alice Koniges, Kevin Gott, David Eder, John Barnard, Alex Friedman, Nathan Masters, Aaron Fisher, "Surface tension models for a multi-material ALE code with AMR", Computers & Fluids, January 2017, doi: http://dx.doi.org/10.1016/j.compfluid.2017.01.016

Special Issue on Data-Intensive Scalable Computing Systems, Special Issue of Parallel Computing, Pages: 1-96 January 31, 2017,

Richard A. Gerber, Overview of NERSC, Presented to SLAC Computing, January 24, 2017,

Taylor Groves, Characterizing Power and Performance in HPC Networks, Future Technologies Group at ORNL, January 10, 2017,

Taylor Groves, Characterizing and Improving Power and Performance in HPC Networks, Advanced Technology Group -- NERSC, January 8, 2017,

Jan-Timm Kuhr, Johannes Blaschke, Felix Ruehle, Holger Stark, "Collective sedimentation of squirmers under gravity", Soft Matter, 2017, 13:7548--7555,

Jack Deslippe, Doug Doerfler, Brandon Cook, Tareq Malas, Samuel Williams, Sudip Dosanjh, "Optimizing Science Applications for the Cori, Knights Landing, System at NERSC", Advances in Parallel Computing, Volume 30: New Frontiers in High Performance Computing and Big Data, ( January 1, 2017)

Ryan E. Grant, Taylor Groves, Simon Hammond, K. Scott Hemmert, Michael Levenhagen, Ron Brightwell, "Handbook of Exascale Computing: Network Communications", (ISBN:978-1466569003 Chapman and Hall: January 1, 2017)