NERSCPowering Scientific Discovery Since 1974

NERSC Staff Publications & Presentations

This page displays a bibliography of staff publications and presentations from Jan. 1, 2017 to present. Earlier publications are available in the archive.


Journal Article

2019

Phuong Hoai Ha, Otto J. Anshus, Ibrahim Umar, "Efficient concurrent search trees using portable fine-grained locality", IEEE Transactions on Parallel and Distributed Systems, January 14, 2019,

Abhinav Thota, Yun He, "Foreword to the Special Issue of the Cray User Group (CUG 2018)", Concurrency and Computation: Practice and Experience, January 11, 2019,

Osni A. Marques, David E. Bernholdt, Elaine M. Raybourn, Ashley D. Barker, Rebecca J. Hartman-Baker, "The HPC Best Practices Webinar Series", Journal of Computational Science Education, January 2019, doi: 10.22369/issn.2153-4136/10/1/19

In this contribution, we discuss our experiences organizing the Best Practices for HPC Software Developers (HPC-BP) webinar series, an effort for the dissemination of software development methodologies, tools and experiences to improve developer productivity and software sustainability. HPC-BP is an outreach component of the IDEAS Productivity Project and has been designed to support the IDEAS mission to work with scientific software development teams to enhance their productivity and the sustainability of their codes. The series, which was launched in 2016, has just presented its 22nd webinar. We summarize and distill our experiences with these webinars, including what we consider to be “best practices” in the execution of both individual webinars and a long-running series like HPC-BP. We also discuss future opportunities and challenges in continuing the series.

 

2018

Thomas Heller, Bryce Adelstein Lelbach, Kevin A. Huck, John Biddiscombe, Patricia Grubel, Alice E. Koniges, Matthias Kretz, Dominic Marcello, David Pfander, Adrian SerioL, Juhan Frank, Geoffrey C. Clayton, Dirk Pflu ̈ger, David Eder, and Hartmut Kaiser, "Harnessing Billions of Tasks for a Scalable Portable Hydrodynamic Simulation of the Merger of Two Stars", The International Journal of High Performance Computing Applications, 2018, Accepted,

Vetter, Jeffrey S.; Brightwell, Ron; Gokhale, Maya; McCormick, Pat; Ross, Rob; Shalf, John; Antypas, Katie; Donofrio, David; Humble, Travis; Schuman, Catherine; Van Essen, Brian; Yoo, Shinjae; Aiken, Alex; Bernholdt, David; Byna, Suren; Cameron, Kirk; Cappello, Frank; Chapman, Barbara; Chien, Andrew; Hall, Mary; Hartman-Baker, Rebecca; Lan, Zhiling; Lang, Michael; Leidel, John; Li, Sherry; Lucas, Robert; Mellor-Crummey, John; Peltz Jr., Paul; Peterka, Thomas; Strout, Michelle; Wilke, Jeremiah, "Extreme Heterogeneity 2018 - Productive Computational Science in the Era of Extreme Heterogeneity: Report for DOE ASCR Workshop on Extreme Heterogeneity", December 2018, doi: 10.2172/1473756

Kurt Ferreira, Ryan E. Grant, Michael J. Levenhagen, Scott Levy, Taylor Groves, "Hardware MPI Message Matching: Insights into MPI Matching Behavior to Inform Design", Concurrency and Computation Practice and Experience, December 1, 2018,

Suzanne M. Kosina, Annette M. Greiner, Rebecca K. Lau, Stefan Jenkins, Richard Baran, Benjamin P. Bowen and Trent R. Northen, "Web of microbes (WoM): a curated microbial exometabolomics database for linking chemistry and microbes", BMC Microbiology, September 12, 2018, 18, doi: https://doi.org/10.1186/s12866-018-1256-y

As microbiome research becomes increasingly prevalent in the fields of human health, agriculture and biotechnology, there exists a need for a resource to better link organisms and environmental chemistries. Exometabolomics experiments now provide assertions of the metabolites present within specific environments and how the production and depletion of metabolites is linked to specific microbes. This information could be broadly useful, from comparing metabolites across environments, to predicting competition and exchange of metabolites between microbes, and to designing stable microbial consortia. Here, we introduce Web of Microbes (WoM; freely available at: http://webofmicrobes.org), the first exometabolomics data repository and visualization tool.

Adam P Arkin, Robert W Cottingham, Christopher S Henry, Nomi L Harris, Rick L Stevens, Sergei Maslov, Paramvir Dehal, Doreen Ware, Fernando Perez, Shane Canon, Michael W Sneddon, Matthew L Henderson, William J Riehl, Dan Murphy-Olson, Stephen Y Chan, Roy T Kamimura, Sunita Kumari, Meghan M Drake, Thomas S Brettin, Elizabeth M Glass, Dylan Chivian, Dan Gunter, David J Weston, Benjamin H Allen, Jason Baumohl, Aaron A Best, Ben Bowen, Steven E Brenner, Christopher C Bun, John-Marc Chandonia, Jer-Ming Chia, Ric Colasanti, Neal Conrad, James J Davis, Brian H Davison, Matthew DeJongh, Scott Devoid, Emily Dietrich, Inna Dubchak, Janaka N Edirisinghe, Gang Fang, José P Faria, Paul M Frybarger, Wolfgang Gerlach, Mark Gerstein, Annette Greiner, James Gurtowski, Holly L Haun, Fei He, Rashmi Jain, Marcin P Joachimiak, Kevin P Keegan, Shinnosuke Kondo, Vivek Kumar, Miriam L Land, Folker Meyer, Marissa Mills, Pavel S Novichkov, Taeyun Oh, Gary J Olsen, Robert Olson, Bruce Parrello, Shiran Pasternak, Erik Pearson, Sarah S Poon, Gavin A Price, Srividya Ramakrishnan, Priya Ranjan, Pamela C Ronald, Michael C Schatz, Samuel M D Seaver, Maulik Shukla, Roman A Sutormin, Mustafa H Syed, James Thomason, Nathan L Tintle, Daifeng Wang, Fangfang Xia, Hyunseung Yoo, Shinjae Yoo, Dantong Yu, "KBase: the United States department of energy systems biology knowledgebase", Nature Biotechnology, July 6, 2018, 36.7, doi: 10.1038/nbt.4163.

Here we present the DOE Systems Biology Knowledgebase (KBase, http://kbase.us), an open-source software and data platform that enables data sharing, integration, and analysis of microbes, plants, and their communities. KBase maintains an internal reference database that consolidates information from widely used external data repositories. This includes over 90,000 microbial genomes from RefSeq4, over 50 plant genomes from Phytozome5, over 300 Biolog media formulations6, and >30,000 reactions and compounds from KEGG7, BIGG8, and MetaCyc9. These public data are available for integration with user data where appropriate (e.g., genome comparison or building species trees). KBase links these diverse data types with a range of analytical functions within a web-based user interface. This extensive community resource facilitates large-scale analyses on scalable computing infrastructure and has the potential to accelerate scientific discovery, improve reproducibility, and foster open collaboration.

Lee, Jason R., et al, "Enhancing supercomputing with software defined networking", IEEE Conference on Information Networking (ICOIN), January 10, 2018,

2017

C. S. Daley, D. Ghoshal, G. K. Lockwood, S. Dosanjh, L. Ramakrishnan, N. J. Wright, "Performance characterization of scientific workflows for the optimal use of Burst Buffers", Future Generation Computer Systems, December 28, 2017, doi: 10.1016/j.future.2017.12.022

Scientific discoveries are increasingly dependent upon the analysis of large volumes of data from observations and simulations of complex phenomena. Scientists compose the complex analyses as workflows and execute them on large-scale HPC systems. The workflow structures are in contrast with monolithic single simulations that have often been the primary use case on HPC systems. Simultaneously, new storage paradigms such as Burst Buffers are becoming available on HPC platforms. In this paper, we analyze the performance characteristics of a Burst Buffer and two representative scientific workflows with the aim of optimizing the usage of a Burst Buffer, extending our previous analyses (Daley et al., 2016). Our key contributions are a). developing a performance analysis methodology pertinent to Burst Buffers, b). improving the use of a Burst Buffer in workflows with bandwidth-sensitive and metadata-sensitive I/O workloads, c). highlighting the key data management challenges when incorporating a Burst Buffer in the studied scientific workflows.

Scott Michael, Yun He, "Foreword to the Special Issue of the Cray User Group (CUG 2017)", Concurrency and Computation: Practice and Experience, December 5, 2017,

Taylor Groves, Ryan Grant, Aaron Gonzales, Dorian Arnold, "Unraveling Network-induced Memory Contention: Deeper Insights with Machine Learning", Transactions on Parallel and Distributed Systems, November 21, 2017, doi: 10.1109/TPDS.2017.2773483

Remote Direct Memory Access (RDMA) is expected to be an integral communication mechanism for future exascale systems enabling asynchronous data transfers, so that applications may fully utilize CPU resources while simultaneously sharing data amongst remote nodes. In this work we examine Network-induced Memory Contention (NiMC) on Infiniband networks. We expose the interactions between RDMA, main-memory and cache, when applications and out-of-band services compete for memory resources. We then explore NiMCs resulting impact on application-level performance. For a range of hardware technologies and HPC workloads, we quantify NiMC and show that NiMCs impact grows with scale resulting in up to 3X performance degradation at scales as small as 8K processes even in applications that previously have been shown to be performance resilient in the presence of noise. Additionally, this work examines the problem of predicting NiMC's impact on applications by leveraging machine learning and easily accessible performance counters. This approach provides additional insights about the root cause of NiMC and facilitates dynamic selection of potential solutions. Lastly, we evaluated three potential techniques to reduce NiMCs impact, namely hardware offloading, core reservation and network throttling.

Douglas Doerfler, Brian Austin, Brandon Cook, Jack Deslippe, Krishna Kandalla, Peter Mendygral, "Evaluating the Networking Characteristics of the Cray XC-40 Intel Knights Landing Based Cori Supercomputer at NERSC", Concurrency and Computation: Practice and Experience, Volume 30, Issue 1, September 12, 2017,

Yun (Helen) He, Brandon Cook, Jack Deslippe, Brian Friesen, Richard Gerber, Rebecca Hartman­-Baker, Alice Koniges, Thorsten Kurth, Stephen Leak, Woo­Sun Yang, Zhengji Zhao, Eddie Baron, Peter Hauschildt, "Preparing NERSC users for Cori, a Cray XC40 system with Intel Many Integrated Cores", Concurrency and Computation: Practice and Experience, August 2017, 30, doi: 10.1002/cpe.4291

The newest NERSC supercomputer Cori is a Cray XC40 system consisting of 2,388 Intel Xeon Haswell nodes and 9,688 Intel Xeon‐Phi “Knights Landing” (KNL) nodes. Compared to the Xeon‐based clusters NERSC users are familiar with, optimal performance on Cori requires consideration of KNL mode settings; process, thread, and memory affinity; fine‐grain parallelization; vectorization; and use of the high‐bandwidth MCDRAM memory. This paper describes our efforts preparing NERSC users for KNL through the NERSC Exascale Science Application Program, Web documentation, and user training. We discuss how we configured the Cori system for usability and productivity, addressing programming concerns, batch system configurations, and default KNL cluster and memory modes. System usage data, job completion analysis, programming and running jobs issues, and a few successful user stories on KNL are presented.

Zhaoyi Meng, Ekaterina Merkurjev, Alice Koniges, Andrea L. Bertozzi, "Hyperspectral Image Classification Using Graph Clustering Methods", IPOL Journal · Image Processing On Line, 2017, 2017-08-, doi: https://doi.org/10.5201/ipol.2017.204

L. Xu, C. Yang, D. Huang, and A. Cantoni, "Exploiting Cyclic Prefix for Turbo-OFDM Receiver Design", IEEE Access, vol. 5, pp. 15762-15775, June 15, 2017,

Mustafa Mustafa, Deborah Bard, Wahid Bhimji, Rami Al-Rfou, Zarija Lukić, "Creating Virtual Universes Using Generative Adversarial Networks", Submitted To Sci. Rep., June 1, 2017,

Taylor A. Barnes, Thorsten Kurth, Pierre Carrier, Nathan Wichmann, David Prendergast, Paul RC Kent, Jack Deslippe, "Improved treatment of exact exchange in Quantum ESPRESSO.", Computer Physics Communications, May 31, 2017,

Friesen, B., Baron, E., Parrent, J. T., Thomas, R., C., Branch, D., Nugent, P., Hauschildt, P. H., Foley, R. J., Wright, D. E., Pan, Y.-C., Filippenko, A. V., Clubb, K. I., Silverman, J. M., Maeda, K. Shivvers, I., Kelly, P. L., Cohen, D. P., Rest, A., Kasen, D., "Optical and ultraviolet spectroscopic analysis of SN 2011fe at late times", Monthly Notices of the Royal Astronomical Society, February 27, 2017, 467:2392-2411, doi: 10.1093/mnras/stx241

We present optical spectra of the nearby Type Ia supernova SN 2011fe at 100, 205, 311, 349 and 578 d post-maximum light, as well as an ultraviolet (UV) spectrum obtained with the Hubble Space Telescope at 360 d post-maximum light. We compare these observations with synthetic spectra produced with the radiative transfer code phoenix. The day +100 spectrum can be well fitted with models that neglect collisional and radiative data for forbidden lines. Curiously, including these data and recomputing the fit yields a quite similar spectrum, but with different combinations of lines forming some of the stronger features. At day +205 and later epochs, forbidden lines dominate much of the optical spectrum formation; however, our results indicate that recombination, not collisional excitation, is the most influential physical process driving spectrum formation at these late times. Consequently, our synthetic optical and UV spectra at all epochs presented here are formed almost exclusively through recombination-driven fluorescence. Furthermore, our models suggest that the UV spectrum even as late as day +360 is optically thick and consists of permitted lines from several iron-peak species. These results indicate that the transition to the ‘nebular’ phase in Type Ia supernovae is complex and highly wavelength dependent.

Evan Berkowitz, Thorsten Kurth, Amy Nicholson, Balint Joo, Eniro Rinaldi, Mark Strother, Pavlos Vranas, Andre Walker-Loud, "Two-Nucleon Higher Partial-Wave Scattering from Lattice QCD", Phys. Lett. B, February 10, 2017,

Wangyi Liu, Alice Koniges, Kevin Gott, David Eder, John Barnard, Alex Friedman, Nathan Masters, Aaron Fisher, "Surface tension models for a multi-material ALE code with AMR", Computers & Fluids, January 2017, doi: http://dx.doi.org/10.1016/j.compfluid.2017.01.016

Conference Paper

2018

Charlene Yang, Rahulkumar Gayatri, Thorsten Kurth, Protonu Basu, Zahra Ronaghi, Adedoyin Adetokunbo, Brian Friesen, Brandon Cook, Douglas Doerfler, Leonid Oliker, Jack Deslippe, Samuel Williams, "An Empirical Roofline Methodology for Quantitatively Assessing Performance Portability", International Workshop on Performance, Portability and Productivity in HPC (P3HPC18), November 16, 2018,

Brian Austin, Chris Daley, Douglas Doerfler, Jack Deslippe, Brandon Cook, Brian Friesen, Thorsten Kurth, Charlene Yang, Nicholas J. Wright, "A Metric for Evaluating Supercomputer Performance in the Era of Extreme Heterogeneity", 9th IEEE International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS18), November 12, 2018,

Glenn K. Lockwood, Shane Snyder, Teng Wang, Suren Byna, Philip Carns, Nicholas J. Wright, "A Year in the Life of a Parallel File System", Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, Dallas, TX, IEEE Press, November 11, 2018, 71:1--74:1,

I/O performance is a critical aspect of data-intensive scientific computing. We seek to advance the state of the practice in understanding and diagnosing I/O performance issues through investigation of a comprehensive I/O performance data set that captures a full year of production storage activity at two leadership-scale computing facilities. We demonstrate techniques to identify regions of interest, perform focused investigations of both long-term trends and transient anomalies, and uncover the contributing factors that lead to performance fluctuation.


We find that a year in the life of a parallel file system is comprised of distinct regions of long-term performance variation in addition to short-term performance transients. We demonstrate how systematic identification of these performance regions, combined with comprehensive analysis, allows us to isolate the factors contributing to different performance maladies at different time scales. From this, we present specific lessons learned and important considerations for HPC storage practitioners.

Nathan Hjelm, Matthew Dosanjh, Ryan Grant, Taylor Groves, Patrick Bridges, Dorian Arnold, "Improving MPI Multi-threaded RMA Communication Performance", August 1, 2018,

T. Koskela, Z. Matveev, C. Yang, A. Adetokunbo, R. Belenov, P. Thierry, Z. Zhao,
R. Gayatri, H. Shan, L. Oliker, J. Deslippe, R. Green, and S. Williams,
"A Novel Multi-Level integrated Roofline Model Approach for Performance Characterization", International Supercomputing Conference (ISC) 2018, June 28, 2018,

B. Cook, C. Yang, B. Friesen, T. Kurth and J. Deslippe, "CSB Coo sparse matrix vector performance on Intel Xeon and Xeon Phi architectures", Intel eXtreme Performance Users Group (IXPUG) at International Supercomputing Conference (ISC) 2018, June 28, 2018,

Glenn K. Lockwood, Nicholas J. Wright, Shane Snyder, Philip Carns, George Brown, Kevin Harms, "TOKIO on ClusterStor: Connecting Standard Tools to Enable Holistic I/O Performance Analysis", Proceedings of the 2018 Cray User Group, Stockholm, SE, May 24, 2018,

At present, I/O performance analysis requires different tools to characterize individual components of the I/O subsystem, and institutional I/O expertise is relied upon to translate these disparate data into an integrated view of application performance. This process is labor-intensive and not sustainable as the storage hierarchy deepens and system complexity increases. To address this growing disparity, we have developed the Total Knowledge of I/O (TOKIO) framework to combine the insights from existing component-level monitoring tools and provide a holistic view of performance across the entire I/O stack. 

A reference implementation of TOKIO, pytokio, is presented here. Using monitoring tools included with Cray XC and ClusterStor systems alongside commonly deployed community-supported tools, we demonstrate how pytokio provides a lightweight foundation for holistic I/O performance analyses on two Cray XC systems deployed at different HPC centers. We present results from integrated analyses that allow users to quantify the degree of I/O contention that affected their jobs and probabilistically identify unhealthy storage devices that impacted their performance.We also apply pytokio to inspect the utilization of NERSC’s DataWarp burst buffer and demonstrate how pytokio can be used to identify users and applications who may stand to benefit most from migrating their workloads from Lustre to the burst buffer.

C. Yang, B. Friesen, T. Kurth, B. Cook, S. Williams, "Toward Automated Application Profiling on Cray Systems", Cray User Group (CUG) 2018, May 24, 2018,

Ville Ahlgren, Stefan Andersson, Jim Brandt, Nicholas Cardo, Sudheer Chunduri, Jeremy Enos, Parks Fields, Ann Gentile, Richard Gerber, Joe Greenseid, Annette Greiner, Bilel Hadri, Helen He, Dennis Hoppe, Urpo Kaila, Kaki Kelly, Mark Klein, Alex Kristiansen, Steve Leak, Michael Mason, Kevin Pedretti, Jean-Guillaume Piccinali, Jason Repik, Jim Rogers, Susanna Salminen, Michael Showerman, Cary Whitney, Jim Williams, "Cray System Monitoring: Successes, Priorities, Visions", CUG 2018 Proceedings, Stockholm, Cray User Group, May 22, 2018,

Effective HPC system operations and utilization require unprecedented insight into system state, applications’ demands for resources, contention for shared resources, and system demands on center power and cooling. Monitoring can provide such insights when the necessary fundamental capabilities for data availability and usability are provided. In this paper, multiple Cray sites seek to motivate monitoring as a core capability in HPC design, through the presentation of success stories illustrating enhanced understanding and improved performance and/or operations as a result of monitoring and analysis.We present the utility, limitations, and gaps of the data necessary to enable the required insights. The capabilities developed to enable the case successes drive our identification and prioritization of monitoring system requirements. Ultimately, we seek to engage all HPC stakeholders to drive community and vendor progress on these priorities.

Stephen Leak, Annette Greiner, Ann Gentile, James Brandt, "Supporting failure analysis with discoverable, annotated log datasets", CUG 2018 Proceedings, Stockholm, Cray User Group, May 22, 2018,

Detection, characterization, and mitigation of faults on supercomputers is complicated by the large variety of interacting subsystems. Failures often manifest as vague observations like ``my job failed" and may result from system hardware/firmware/software, filesystems, networks, resource manager state, and more. Data such as system logs, environmental metrics, job history, cluster state snapshots, published outage notices and user reports is routinely collected. These data are typically stored in different locations and formats for specific use by targeted consumers. Combining data sources for analysis generally requires a consumer-dependent custom approach. We present a vocabulary for describing data, including format and access details, an annotation schema for attaching observations to a dataset, and tools to aid in discovery and publishing system-related insights. We present case studies in which our analysis tools utilize information from disparate data sources to investigate failures and performance issues from user and administrator perspectives.

Jialin Liu, Debbie Bard, Quincey Koziol, Stephen Bailey, Prabhat, "Searching for Millions of Objects in the BOSS Spectroscopic Survey Data with H5Boss", IEEE NYSDS'17, January 1, 2018,

Zingale M, Almgren AS, Barrios Sazo MG, Beckner VE, Bell JB, Friesen B, Jacobs AM, Katz MP, Malone CM, Nonaka AJ, Willcox DE, Zhang W, "Meeting the Challenges of Modeling Astrophysical Thermonuclear Explosions: Castro, Maestro, and the AMReX Astrophysics Suite", 2018, doi: 10.1088/1742-6596/1031/1/012024

We describe the AMReX suite of astrophysics codes and their application to modeling problems in stellar astrophysics. Maestro is tuned to efficiently model subsonic convective flows while Castro models the highly compressible flows associated with stellar explosions. Both are built on the block-structured adaptive mesh refinement library AMReX. Together, these codes enable a thorough investigation of stellar phenomena, including Type Ia supernovae and X-ray bursts. We describe these science applications and the approach we are taking to make these codes performant on current and future many-core and GPU-based architectures.

2017

Tyler Allen, Christopher S. Daley, Douglas Doerfler, Brian Austin, Nicholas J. Wright, "Performance and Energy Usage of Workloads on KNL and Haswell Architectures", High Performance Computing Systems. Performance Modeling, Benchmarking, and Simulation. PMBS 2017. Lecture Notes in Computer Science, Volume 10724., December 23, 2017,

Wahid Bhimji, Debbie Bard, Kaylan Burleigh, Chris Daley, Steve Farrell, Markus Fasel, Brian Friesen, Lisa Gerhardt, Jialin Liu, Peter Nugent, Dave Paul, Jeff Porter, Vakho Tsulaia, "Extreme I/O on HPC for HEP using the Burst Buffer at NERSC", Journal of Physics: Conference Series, December 1, 2017, 898:082015,

Colin A. MacLean, HonWai Leong, Jeremy Enos, "Improving the start-up time of python applications on large scale HPC systems", Proceedings of HPCSYSPROS 2017, Denver, CO, 2017,

Kurt Ferreira, Ryan E. Grant, Michael J. Levenhagen, Scott Levy, Taylor Groves, "Hardware MPI Message Matching: Insights into MPI Matching Behavior to Inform Design", ExaMPI in association with SC17, November 12, 2017,

B Friesen, MMA Patwary, B Austin, N Satish, Z Slepian, N Sundaram, D Bard, DJ Eisenstein, J Deslippe, P Dubey, Prabhat, "Galactos: Computing the Anisotropic 3-Point Correlation Function for 2 Billion Galaxies", November 2017, doi: 10.1145/3126908.3126927

The nature of dark energy and the complete theory of gravity are two central questions currently facing cosmology. A vital tool for addressing them is the 3-point correlation function (3PCF), which probes deviations from a spatially random distribution of galaxies. However, the 3PCF's formidable computational expense has prevented its application to astronomical surveys comprising millions to billions of galaxies. We present Galactos, a high-performance implementation of a novel, O(N2) algorithm that uses a load-balanced k-d tree and spherical harmonic expansions to compute the anisotropic 3PCF. Our implementation is optimized for the Intel Xeon Phi architecture, exploiting SIMD parallelism, instruction and thread concurrency, and significant L1 and L2 cache reuse, reaching 39% of peak performance on a single node. Galactos scales to the full Cori system, achieving 9.8 PF (peak) and 5.06 PF (sustained) across 9636 nodes, making the 3PCF easily computable for all galaxies in the observable universe.

Glenn K. Lockwood, Wucherl Yoo, Suren Byna, Nicholas J. Wright, Shane Snyder, Kevin Harms, Zachary Nault, Philip Carns, "UMAMI: a recipe for generating meaningful metrics through holistic I/O performance analysis", Proceedings of the 2nd Joint International Workshop on Parallel Data Storage & Data Intensive Scalable Computing Systems (PDSW-DISCS'17), Denver, CO, ACM, November 2017, 55-60, doi: 10.1145/3149393.3149395

I/O efficiency is essential to productivity in scientific computing, especially as many scientific domains become more data-intensive. Many characterization tools have been used to elucidate specific aspects of parallel I/O performance, but analyzing components of complex I/O subsystems in isolation fails to provide insight into critical questions: how do the I/O components interact, what are reasonable expectations for application performance, and what are the underlying causes of I/O performance problems? To address these questions while capitalizing on existing component-level characterization tools, we propose an approach that combines on-demand, modular synthesis of I/O characterization data into a unified monitoring and metrics interface (UMAMI) to provide a normalized, holistic view of I/O behavior.

We evaluate the feasibility of this approach by applying it to a month-long benchmarking study on two distinct large-scale computing platforms. We present three case studies that highlight the importance of analyzing application I/O performance in context with both contemporaneous and historical component metrics, and we provide new insights into the factors affecting I/O performance. By demonstrating the generality of our approach, we lay the groundwork for a production-grade framework for holistic I/O analysis.

Damian Rouson, Ethan D Gutmann, Alessandro Fanfarillo, Brian Friesen, "Performance portability of an intermediate-complexity atmospheric research model in coarray Fortran", November 2017, doi: 10.1145/3144779.3169104

We examine the scalability and performance of an open-source, coarray Fortran (CAF) mini-application (mini-app) that implements the parallel, numerical algorithms that dominate the execution of The Intermediate Complexity Atmospheric Research (ICAR) [4] model developed at the the National Center for Atmospheric Research (NCAR). The Fortran 2008 mini-app includes one Fortran 2008 implementation of a collective subroutine defined in the Committee Draft of the upcoming Fortran 2018 standard. The ability of CAF to run atop various communication layers and the increasing CAF compiler availability facilitated evaluating several compilers, runtime libraries and hardware platforms. Results are presented for the GNU and Cray compilers, each of which offers different parallel runtime libraries employing one or more communication layers, including MPI, OpenSHMEM, and proprietary alternatives. We study performance on multi- and many-core processors in distributed memory. The results show promising scaling across a range of hardware, compiler, and runtime choices on up to ~100,000 cores.

Jaehyun Han, Donghun Koo, Glenn K. Lockwood, Jaehwan Lee, Hyeonsang Eom, Soonwook Hwang, "Accelerating a Burst Buffer via User-Level I/O Isolation", Proceedings of the 2017 IEEE International Conference on Cluster Computing (CLUSTER), Honolulu, HI, IEEE, September 2017, 245-255, doi: 10.1109/CLUSTER.2017.60

Burst buffers tolerate I/O spikes in High-Performance Computing environments by using a non-volatile flash technology. Burst buffers are commonly located between parallel file systems and compute nodes, handling bursty I/Os in the middle. In this architecture, burst buffers are shared resources. The performance of an SSD is significantly reduced when it is used excessively because of garbage collection, and we have observed that SSDs in a burst buffer become slow when many users simultaneously use the burst buffer. To mitigate the performance problem, we propose a new user-level I/O isolation framework in a High-Performance Computing environment using a multi-streamed SSD. The multi-streamed SSD allocates the same flash block for I/Os in the same stream. We assign a different stream to each user; thus, the user can use the stream exclusively. To evaluate the performance, we have used open-source supercomputing workloads and I/O traces from real workloads in the Cori supercomputer at the National Energy Research Scientific Computing Center. Via user-level I/O isolation, we have obtained up to a 125% performance improvement in terms of I/O throughput. In addition, our approach reduces the write amplification in the SSDs, leading to improved SSD endurance. This user-level I/O isolation framework could be applied to deployed burst buffers without having to make any user interface changes.

Taylor Groves, Yizi Gu, Nicholas J. Wright, "Understanding Performance Variability on the Aries Dragonfly Network", HPCMASPA in association with IEEE Cluster, September 1, 2017,

Alex Gittens et al, "Matrix Factorization at Scale: a Comparison of Scientific Data Analytics in Spark and C+MPI Using Three Case Studies", 2016 IEEE International Conference on Big Data, July 1, 2017,

Thorsten Kurth, William Arndt, Taylor Barnes, Brandon Cook, Jack Deslippe, Doug Doerfler, Brian Friesen, Yun (Helen) He, Tuomas Koskela, Mathieu Lobet, Tareq Malas, Leonid Oliker, Andrey Ovsyannikov, Samual Williams, Woo-Sun Yang, Zhengji Zhao, "Analyzing Performance of Selected NESAP Applications on the Cori HPC System", High Performance Computing. ISC High Performance 2017. Lecture Notes in Computer Science, Volume 10524, June 22, 2017,

Yun (Helen) He, Brandon Cook, Jack Deslippe, Brian Friesen, Richard Gerber, Rebecca Hartman­-Baker, Alice Koniges, Thorsten Kurth, Stephen Leak, Woo­Sun Yang, Zhengji Zhao, Eddie Baron, Peter Hauschildt, "Preparing NERSC users for Cori, a Cray XC40 system with Intel Many Integrated Cores", Cray User Group 2017, Redmond, WA. Best Paper First Runner-Up., May 12, 2017,

Colin MacLean, "Python Usage Metrics on Blue Waters", Cray User Group, Redmond, WA, 2017,

Jialin Liu, Quincey Koziol, Houjun Tang, François Tessier, Wahid Bhimji, Brandon Cook, Brian Austin, Suren Byna, Bhupender Thakur, Glenn K. Lockwood, Jack Deslippe, Prabhat, "Understanding the IO Performance Gap Between Cori KNL and Haswell", Proceedings of the 2017 Cray User Group, Redmond, WA, May 10, 2017,

The Cori system at NERSC has two compute partitions with different CPU architectures: a 2,004 node Haswell partition and a 9,688 node KNL partition, which ranked as the 5th most powerful and fastest supercomputer on the November 2016 Top 500 list. The compute partitions share a common storage configuration, and understanding the IO performance gap between them is important, impacting not only to NERSC/LBNL users and other national labs, but also to the relevant hardware vendors and software developers. In this paper, we have analyzed performance of single core and single node IO comprehensively on the Haswell and KNL partitions, and have discovered the major bottlenecks, which include CPU frequencies and memory copy performance. We have also extended our performance tests to multi-node IO and revealed the IO cost difference caused by network latency, buffer size, and communication cost. Overall, we have developed a strong understanding of the IO gap between Haswell and KNL nodes and the lessons learned from this exploration will guide us in designing optimal IO solutions in many-core era.

Mario Melara, Todd Gamblin, Gregory Becker, Robert French, Matt Belhorn, Kelly Thompson, Peter Scheibel, Rebecca Hartman-Baker, "Using Spack to Manage Software on Cray Supercomputers", Cray User Group 2017, 2017,

Zhengji Zhao, Martijn Marsman, Florian Wende, and Jeongnim Kim, "Performance of Hybrid MPI/OpenMP VASP on Cray XC40 Based on Intel Knights Landing Many Integrated Core Architecture", https://cug.org/CUG2017, May 8, 2017,

Abstract - With the recent installation of Cori, a Cray XC40 system with Intel Xeon Phi Knights Landing (KNL) many integrated core (MIC) architecture, NERSC is transitioning from the multi-core to the more energy-efficient many-core era. The developers of VASP, a widely used materials science code, have adopted MPI/OpenMP parallelism to better exploit the increased on-node parallelism, wider vector units, and the high bandwidth on-package memory (MCDRAM) of KNL. To achieve optimal performance, KNL specifics relevant for the build, boot and run time setup must be explored. In this paper, we present the performance analysis of representative VASP workloads on Cori, focusing on the effects of the compilers, libraries, and boot/run time options such as the NUMA/MCDRAM modes, HyperThreading, huge pages, core specialization, and thread scaling. The paper is intended to serve as a KNL performance guide for VASP users, but it will also benefit other KNL users.

Koskela TS, Deslippe J, Friesen B, Raman K, "Fusion PIC code performance analysis on the Cori KNL system", May 2017,

We study the attainable performance of Particle-In-Cell codes on the Cori KNL system by analyzing a miniature particle push application based on the fusion PIC code XGC1. We start from the most basic building blocks of a PIC code and build up the complexity to identify the kernels that cost the most in performance and focus optimization efforts there. Particle push kernels operate at high AI and are not likely to be memory bandwidth or even cache bandwidth bound on KNL. Therefore, we see only minor benefits from the high bandwidth memory available on KNL, and achieving good vectorization is shown to be the most beneficial optimization path with theoretical yield of up to 8x speedup on KNL. In practice we are able to obtain up to a 4x gain from vectorization due to limitations set by the data layout and memory latency.

Book Chapter

2017

Brian Van Straalen, David Trebotich, Andrey Ovsyannikov, Daniel T. Graves, "Exascale Scientific Applications: Programming Approaches for Scalability Performance and Portability", edited by Tjerk Straatsma, Timothy William, Katie Antypas, (CRC Press: December 1, 2017)

Ryan E. Grant, Taylor Groves, Simon Hammond, K. Scott Hemmert, Michael Levenhagen, Ron Brightwell, "Handbook of Exascale Computing: Network Communications", (ISBN:978-1466569003 Chapman and Hall: January 1, 2017)

Jack Deslippe, Doug Doerfler, Brandon Cook, Tareq Malas, Samuel Williams, Sudip Dosanjh, "Optimizing Science Applications for the Cori, Knights Landing, System at NERSC", Advances in Parallel Computing, Volume 30: New Frontiers in High Performance Computing and Big Data, ( January 1, 2017)

Presentation/Talk

2018

S. Williams, A. Ilic, Z. Matveev, C. Yang, Performance Tuning of Scientific Codes with the Roofline Model, Tutorial at Supercomputing Conference (SC) 2018, November 16, 2018,

Tim Mattson, Alice Koniges, Yun (Helen) He, David Eder, The OpenMP Common Core: A hands-on exploration, SuperComputing 2018 Tutorial, November 11, 2018,

Yun (Helen) He, Barbara Chapman, Oscar Hernandez, Tim Mattson, Alice Koniges, Introduction to "OpenMP Common Core", OpenMPCon / IWOMP 2018 Tutorial Day, September 26, 2018,

Oscar Hernandez, Yun (Helen) He, Barbara Chapman, Using MPI+OpenMP for Current and Future Architectures, OpenMPCon 2018, September 24, 2018,

Yun (Helen) He, Michael Klemm, Bronis R. De Supinski, OpenMP: Current and Future Directions, 8th NCAR MultiCore Workshop (MC8), September 19, 2018,

T. Koskela, A. Ilic, Z. Matveev, R. Belenov, C. Yang, L. Sousa, A Practical Approach to Application Performance tuning with the Roofline Model, Tutorial at International Supercomputing Conference (ISC) 2018, June 28, 2018,

Yun (Helen) He, Introduction to NERSC Resources, LBNL Computer Sciences Summer Student Classes #1, June 11, 2018,

Tim Mattson, Yun (Helen) He, Beyond OpenMP Common Core, NERSC Training, May 4, 2018,

Kristy Kallback-Rose, NERSC Site Update at the Linear Tape User Group, May 2, 2018,

NERSC site update focusing on plans to implement new tape technology at the Berkelely Data Center. 

NERSC Site Report focusing on current storage challenges for disk-based and tape-based systems.

C. Yang, T. Kurth, Roofline Performance Model and Intel Advisor, Performance Analysis and Modeling (PAM) Workshop 2018, February 15, 2018,

S. Williams, J. Deslippe, C. Yang, P. Basu, Performance Tuning of Scientific Codes with the Roofline Model, Exascale Computing Project (ECP) 2nd Annual Meeting, February 9, 2018,

Barbara Chapman, Oscar Hernandez, Yun (Helen) He, Martin Kong, Geoffroy Vallee, MPI + OpenMP Tutorial, DOE ECP Annual Meeting Tutorial, 2018, February 9, 2018,

Alice Kong's, Yun (Helen) He, OpenMP Common Core, NERSC Training, February 6, 2018,

2017

T. Koskela, A. Ilic, Z. Matveev, S. Williams, P. Thierry, C. Yang, Performance Tuning of Scientific Codes with the Roofline Model, Tutorial at Supercomputing Conference (SC) 2017, November 17, 2017,

Tim Mattson, Alice Koniges, Yun (Helen) He, Barbara Chapman, The OpenMP Common Core: A hands-on exploration, SuperComputing 2017 Tutorial, November 12, 2017,

Richard A. Gerber, Jack Deslippe, Manycore for the Masses Part 2, Intel HPC DevCon, November 11, 2017,

Richard A Gerber, NERSC Overview - Focus: Energy Technologies, November 6, 2017,

Richard A. Gerber, NERSC Overview - Focus: Berkeley Rotary Club, October 18, 2017,

Richard A. Gerber, Current and Next Generation Supercomputing and Data Analysis at NERSC, HPC Distinguished Lecture, Iowa State University & Ames Laboratory, October 18, 2017,

Taylor Groves, Networks, Damn Networks and Aries, NERSC CS/Data Seminar, October 6, 2017,

Presentation of the performance of the Cori Aries network.   Highlights of monitoring and analysis efforts underway.

Douglas Doerfler, Steven Gottlieb, Carleton DeTar, Doug Toussaint, Karthik Raman, Improving the Performance of the MILC Code on Intel Knights Landing, An Overview, Intel Xeon Phi User Group Meeting 2017 Fall Meeting, September 26, 2017,

Richard A. Gerber, Cori KNL Update, IXPUG 2017, Austin, TX, September 26, 2017,

Yun (Helen) He, Jack Deslippe, Enabling Applications for Cori KNL: NESAP, September 21, 2017,

NERSC Science Highlights - September 2017, NERSC Users Group Meeting 2017, September 19, 2017,

Doug Jacobsen, Taylor Groves, Global Aries Counter Collection and Analysis, Cray Quarterly Meeting, July 25, 2017,

Barbara Chapman, Alice Koniges, Yun (Helen) He, Oscar Hernandez, and Deepak Eachempati, OpenMP, An Introduction, Scaling to Petascale Institute, XSEDE Training, Berkeley, CA., June 27, 2017,

Yun (Helen) He, Steve Leak, and Zhengji Zhao, Using Cori KNL Nodes, Cori KNL Training, Berkeley, CA., June 9, 2017,

Yun (Helen) He, Brandon Cook, Jack Deslippe, Brian Friesen, Richard Gerber, Rebecca Hartman-Baker, Alice Koniges, Thorsten Kurth, Stephen Leak, Woo-Sun Yang, Zhengji Zhao, Eddie Baron, Peter Hauschildt, Preparing NERSC users for Cori, a Cray XC40 system with Intel Many Integrated Cores, Cray User Group 2017, Redmond, WA, May 12, 2017,

Kirill Lozinskiy, GPFS & HPSS Interface (GHI), Spectrum Scale User Group 2017, April 5, 2017,

This presentation gives a brief overview of integration between the High Performance Storage System (HPSS) and the General Parallel File System (GPFS).

Tutorial w/ handouts. use of Shifter w/ image of chos=sl64 from PDSF Download the slides at https://docs.google.com/presentation/d/1Hh8vFE3ixxxiYTz9TgfljbUJcjmWUCNwzs-NugmLjSs/edit?usp=sharing

 

Rebecca Hartman-Baker, NERSC Overview, NERSC New User Training, February 23, 2017,

Rebecca Hartman-Baker, Accounts and Allocations, NERSC New User Training, February 23, 2017,

Rebecca Hartman-Baker, Craypat and Reveal, NERSC New User Training, February 23, 2017,

Richard A Gerber, February 2017 Allocations and Usage Update, February 23, 2017,

NERSC Users Group Webinar on 2017 allocations for Cori Knights Landing nodes and Queue Wait Time Reduction Actions

Richard A Gerber, NERSC Allocations Forecast for PIs, NERSC Users Group, February 16, 2017,

Richard A. Gerber, Allocations and Usage Update for DOE Program Managers, February 7, 2017,

Richard A. Gerber, NERSC's KNL System: Cori, Exascale Computing Project All-Hands Meeting, February 1, 2017,

Presented at the DOE Exascale Computing Project annual meeting in Knoxville, TN.

Richard A. Gerber, Overview of NERSC, Presented to SLAC Computing, January 24, 2017,

Taylor Groves, Characterizing Power and Performance in HPC Networks, Future Technologies Group at ORNL, January 10, 2017,

Taylor Groves, Characterizing and Improving Power and Performance in HPC Networks, Advanced Technology Group -- NERSC, January 8, 2017,

Report

2017

Glenn K. Lockwood, Damian Hazen, Quincey Koziol, Shane Canon, Katie Antypas, Jan Balewski, Nicholas Balthaser, Wahid Bhimji, James Botts, Jeff Broughton, Tina L. Butler, Gregory F. Butler, Ravi Cheema, Christopher Daley, Tina Declerck, Lisa Gerhardt, Wayne E. Hurlbert, Kristy A. Kallback-
Rose, Stephen Leak, Jason Lee, Rei Lee, Jialin Liu, Kirill Lozinskiy, David Paul, Prabhat, Cory Snavely, Jay Srinivasan, Tavia Stone Gibbins, Nicholas J. Wright,
"Storage 2020: A Vision for the Future of HPC Storage", October 20, 2017, LBNL LBNL-2001072,

As the DOE Office of Science's mission computing facility, NERSC will follow this roadmap and deploy these new storage technologies to continue delivering storage resources that meet the needs of its broad user community. NERSC's diversity of workflows encompass significant portions of open science workloads as well, and the findings presented in this report are also intended to be a blueprint for how the evolving storage landscape can be best utilized by the greater HPC community. Executing the strategy presented here will ensure that emerging I/O technologies will be both applicable to and effective in enabling scientific discovery through extreme-scale simulation and data analysis in the coming decade.

Poster

2017

C. Yang, R. C. Bording, D. Price, and R. Nealon, "Optimizing Smoothed Particle Hydrodynamics Code Phantom on Haswell and KNL", International Supercomputing Conference (ISC) 2017, June 20, 2017,

Other

2017

Special Issue on Data-Intensive Scalable Computing Systems, Special Issue of Parallel Computing, Pages: 1-96 January 31, 2017,