Glenn K. Lockwood
Glenn K. Lockwood is a storage architect who specializes in I/O performance analysis, extreme-scale storage architectures, and emerging I/O technologies. He led NERSC design efforts for Perlmutter's 35 PB all-flash Lustre file system and played a key role in defining NERSC's Storage 2020 vision which culminated in the deployment of the 128 PB Community File System. In addition to storage systems design, Glenn is also actively engaged in the parallel I/O community; he represents NERSC on the HPSS Executive Committee, is a maintainer of the IOR and mdtest community I/O benchmarks, and is a contributor to the Darshan I/O profiling library.
Prior to joining the Advanced Technologies Group, Glenn was the acting lead of the Storage Systems Group at NERSC and has had roles designing and supporting research computing in private industry and national HPC. He holds a Ph.D. in materials science and B.S. in ceramic engineering, both from Rutgers University.
C. S. Daley, D. Ghoshal, G. K. Lockwood, S. Dosanjh, L. Ramakrishnan, N. J. Wright, "Performance characterization of scientific workflows for the optimal use of Burst Buffers", Future Generation Computer Systems, December 28, 2017, doi: 10.1016/j.future.2017.12.022
Scientific discoveries are increasingly dependent upon the analysis of large volumes of data from observations and simulations of complex phenomena. Scientists compose the complex analyses as workflows and execute them on large-scale HPC systems. The workflow structures are in contrast with monolithic single simulations that have often been the primary use case on HPC systems. Simultaneously, new storage paradigms such as Burst Buffers are becoming available on HPC platforms. In this paper, we analyze the performance characteristics of a Burst Buffer and two representative scientific workflows with the aim of optimizing the usage of a Burst Buffer, extending our previous analyses (Daley et al., 2016). Our key contributions are a). developing a performance analysis methodology pertinent to Burst Buffers, b). improving the use of a Burst Buffer in workflows with bandwidth-sensitive and metadata-sensitive I/O workloads, c). highlighting the key data management challenges when incorporating a Burst Buffer in the studied scientific workflows.
Grace X. Y. Zheng, Billy T. Lau, Michael Schnall-Levin, Mirna Jarosz, John M. Bell, Christopher M. Hindson, Sofia Kyriazopoulou-Panagiotopoulou, Donald A. Masquelier, Landon Merrill, Jessica M. Terry, Patrice A. Mudivarti, Paul W. Wyatt, Rajiv Bharadwaj, Anthony J. Makarewicz, Yuan Li, Phillip Belgrader, Andrew D. Price, Adam J. Lowe, Patrick Marks, Gerard M. Vurens, Paul Hardenbol, Luz Montesclaros, Melissa Luo, Lawrence Greenfield, Alexander Wong, David E. Birch, Steven W. Short, Keith P. Bjornson, Pranav Patel, Erik S. Hopmans, Christina Wood, Sukhvinder Kaur, Glenn K. Lockwood, David Stafford, Joshua P. Delaney, Indira Wu, Heather S. Ordonez, Susan M. Grimes, Stephanie Greer, Josephine Y. Lee, Kamila Belhocine, Kristina M. Giorda, William H. Heaton, Geoffrey P. McDermott, Zachary W. Bent, Francesca Meschi, Nikola O. Kondov, Ryan Wilson, Jorge A. Bernate, Shawn Gauby, Alex Kindwall, Clara Bermejo, Adrian N. Fehr, Adrian Chan, Serge Saxonov, Kevin D. Ness, Benjamin J. Hindson, Hanlee P. Ji, "Haplotyping germline and cancer genomes with high-throughput linked-read sequencing", Nature Biotechnology, February 1, 2016, 31:303-311, doi: 10.1038/nbt.3432
Haplotyping of human chromosomes is a prerequisite for cataloguing the full repertoire of genetic variation. We present a microfluidics-based, linked-read sequencing technology that can phase and haplotype germline and cancer genomes using nanograms of input DNA. This high-throughput platform prepares barcoded libraries for short-read sequencing and computationally reconstructs long-range haplotype and structural variant information. We generate haplotype blocks in a nuclear trio that are concordant with expected inheritance patterns and phase a set of structural variants. We also resolve the structure of the EML4-ALK gene fusion in the NCI-H2228 cancer cell line using phased exome sequencing. Finally, we assign genetic aberrations to specific megabase-scale haplotypes generated from whole-genome sequencing of a primary colorectal adenocarcinoma. This approach resolves haplotype information using up to 100 times less genomic DNA than some methods and enables the accurate detection of structural variants.
Kristopher A. Standish, Tristan M. Carland, Glenn K. Lockwood, Wayne Pfeiffer, Mahidhar Tatineni, C Chris Huang, Sarah Lamberth, Yauheniya Cherkas, Carrie Brodmerkel, Ed Jaeger, Lance Smith, Gunaretnam Rajagopal, Mark E. Curran, Nicholas J. Schork, "Group-based variant calling leveraging next-generation supercomputing for large-scale whole-genome sequencing studies", BMC Bioinformatics, September 2015, 16, doi: 10.1186/s12859-015-0736-4
Next-generation sequencing (NGS) technologies have become much more efficient, allowing whole human genomes to be sequenced faster and cheaper than ever before. However, processing the raw sequence reads associated with NGS technologies requires care and sophistication in order to draw compelling inferences about phenotypic consequences of variation in human genomes. It has been shown that different approaches to variant calling from NGS data can lead to different conclusions. Ensuring appropriate accuracy and quality in variant calling can come at a computational cost.
We describe our experience implementing and evaluating a group-based approach to calling variants on large numbers of whole human genomes. We explore the influence of many factors that may impact the accuracy and efficiency of group-based variant calling, including group size, the biogeographical backgrounds of the individuals who have been sequenced, and the computing environment used. We make efficient use of the Gordon supercomputer cluster at the San Diego Supercomputer Center by incorporating job-packing and parallelization considerations into our workflow while calling variants on 437 whole human genomes generated as part of large association study.
We ultimately find that our workflow resulted in high-quality variant calls in a computationally efficient manner. We argue that studies like ours should motivate further investigations combining hardware-oriented advances in computing systems with algorithmic developments to tackle emerging ‘big data’ problems in biomedical research brought on by the expansion of NGS technologies.
Glenn K. Lockwood, Stephen H. Garofalini, "Proton dynamics at the water-silica interface via dissociative molecular dynamics", Journal of Physical Chemistry C, December 26, 2014, 118:29750-2975, doi: 10.1021/jp507640y
A robust and accurate dissociative potential that reproduces the structural and dynamic properties of bulk and nanoconfined water, and proton transport similar to ab initio calculations in bulk water, is used for reactive molecular dynamics simulations of the proton dynamics at the silica/water interface. The simulations are used to evaluate the lifetimes of protonated sites at the interfaces of water with planar amorphous silica surfaces and cylindrical pores in amorphous silica with different densities of water confined in the pores. In addition to lifetimes, the donor/acceptor sites are evaluated and discussed in terms of local atomistic structure. The results of the lifetimes of the protonated sites, including H3O+, SiOH, SiOH2+, and Si–(OH+)–Si sites, are considered. The lifetime of the hydronium ion, H3O+, is considerably shorter near the interface than in bulk water, as are the lifetimes of the other protonated sites. The results indicate the beneficial effect of the amorphous silica surface in enhancing proton transport in wet silica as seen in electrochemical studies and provide the specific molecular mechanisms.
Michael Kagan, Glenn K. Lockwood, Stephen H. Garofalini, "Reactive simulations of the activation barrier to dissolution of amorphous silica in water", Physical Chemistry Chemical Physics, May 28, 2014, 16:9294-9301, doi: 10.1039/c4cp00030g
Molecular dynamics simulations employing reactive potentials were used to determine the activation barriers to the dissolution of the amorphous SiO2 surface in the presence of a 2 nm overlayer of water. The potential of mean force calculations of the reactions of water molecules with 15 different starting Q4 sites (Qi is the Si site with i bridging oxygen neighbors) to eventually form the dissolved Q0 site were used to obtain the barriers. Activation barriers for each step in the dissolution process, from the Q4 to Q3 to Q2 to Q1 to Q0 were obtained. Relaxation runs between each reaction step enabled redistribution of the water above the surface in response to the new Qi site configuration. The rate-limiting step observed in the simulations was in both the Q32 reaction (a Q3 site changing to a Q2 site) and the Q21 reaction, each with an average barrier of ∼14.1 kcal mol(-1). However, the barrier for the overall reaction from the Q4 site to a Q0 site, averaged over the maximum barrier for each of the 15 samples, was 15.1 kcal mol(-1). This result is within the lower end of the experimental data, which varies from 14-24 kcal mol(-1), while ab initio calculations using small cluster models obtain values that vary from 18-39 kcal mol(-1). Constraints between the oxygen bridges from the Si site and the connecting silica structure, the presence of pre-reaction strained siloxane bonds, and the location of the reacting Si site within slight concave surface contours all affected the overall activation barriers.
Glenn K. Lockwood, Stephen H. Garofalini, "Lifetimes of excess protons in water using a dissociative water potential", Journal of Physical Chemistry B, April 8, 2013, 117:4089-4097, doi: 10.1021/jp310300x
Molecular dynamics simulations using a dissociative water potential were applied to study transport of excess protons in water and determine the applicability of this potential to describe such behavior. While originally developed for gas-phase molecules and bulk liquid water, the potential is transferrable to nanoconfinement and interface scenarios. Applied here, it shows proton behavior consistent with ab initio calculations and empirical models specifically designed to describe proton transport. Both Eigen and Zundel complexes are observed in the simulations showing the Eigen–Zundel–Eigen-type mechanism. In addition to reproducing the short-time rattling of the excess proton between the two oxygens of Zundel complexes, a picosecond-scale lifetime was also found. These longer-lived H3O+ ions are caused by the rapid conversion of the local solvation structure around the transferring proton from a Zundel-like form to an Eigen-like form following the transfer, effectively severing the path along which the proton can rattle. The migration of H+ over long times (>100 ps) deviates from the conventional short-time multiexponentially decaying lifetime autocorrelation model and follows the t–3/2 power-law behavior. The potential function employed here matches many of the features of proton transport observed in ab initio molecular dynamics simulations as well as the highly developed empirical valence bond models, yet is computationally very efficient, enabling longer time and larger systems to be studied.
Glenn K. Lockwood, Stephen H. Garofalini, "Reactions between water and vitreous silica during irradiation", Journal of Nuclear Materials, November 30, 2012, 430:239-245, doi: 10.1016/j.jnucmat.2012.07.004
Molecular dynamics simulations were conducted to determine the response of a vitreous silica surface in contact with water to radiation damage. The defects caused by radiation damage create channels that promote high H+ mobility and result in significantly higher concentration and deeper penetration of H+ in the silica subsurface. These subsurface H+ hop between acidic sites such as SiOH2+ and Si–(OH)–Si until subsequent radiation ruptures siloxane bridges and forms subsurface non-bridging oxygens (NBOs); existing excess H+ readily bonds to these NBO sites to form SiOH. The high temperature caused by irradiation also promotes the diffusion of molecular H2O into the subsurface, and although H2O does not penetrate as far as H+, it readily reacts with ruptured bridges to form 2SiOH. These SiOH sites are thermally stable and inhibit the reformation of bridges that would otherwise occur in the absence of water. In addition to this reduction of self-healing, the presence of water during the self-irradiation of silica may cause an increase in the glass’s proton conductivity.
Ying Ma, Glenn K. Lockwood, Stephen H. Garofalini, "Development of a transferable variable charge potential for the study of energy conversion materials FeF2 and FeF3", Journal of Physical Chemistry C, November 18, 2011, 115:24198-2420, doi: 10.1021/jp207181s
A variable charge potential is developed that is suitable for the simulations of energy conversion materials FeF2 and FeF3. Molecular dynamics simulations using this potential show that the calculated structural and elastic properties of both FeF2 and FeF3 are in good agreement with experimental data. Such a transferability of this potential rests in the fact that the difference in the bond characteristic between FeF2 and FeF3 is properly accounted for by the variable charge approach. The calculated equilibrium charges are also in excellent agreement with first-principles Bader charges. Surface energies obtained by the variable charge method are closer to the first-principles data than are fixed charge models, indicating the importance of variable charge method for the simulations of the surface. A significant decrease in atomic charges is observed only for the outermost one or two layers, which is also observed in the first-principles calculations.
Glenn K. Lockwood, Stephen H. Garofalini, "Effect of moisture on the self-healing of vitreous silica under irradiation", Journal of Nuclear Materials, February 16, 2010, 400:73-78, doi: 10.1016/j.jnucmat.2010.02.012
Although it is widely understood that water interacts extensively with vitreous silicates, atomistic simulations of the response of these materials to ballistic radiation, such as neutron or ion radiation, have excluded moisture. In this study, molecular dynamics simulations were used to simulate the collision cascades and defect formation that would result from such irradiation of silica in the presence of moisture. Using an interatomic potential that allows for the dissociation of water, it was found that the reaction between molecular water or pre-dissociated water (as OH− and H+) and the ruptured Si–O–Si bonds that result from the collision cascade inhibits a significant amount of the structural recovery that was previously observed in atomistic simulations of irradiation in perfectly dry silica. The presence of moisture not only resulted in a greater accumulation of non-bridging oxygen defects, but reduced the local density of the silica and altered the distribution of ring sizes. The results imply that an initial presence of moisture in the silica during irradiation could increase the propensity for further ingress of moisture via the low density pathways and increased defect concentration.
Glenn K. Lockwood, Stephen H. Garofalini, "Bridging oxygen as a site for proton adsorption on the vitreous silica surface", Journal of Chemical Physics, August 21, 2009, 131:074703, doi: 10.1063/1.3205946
Molecular dynamics computer simulations were used to study the protonation of bridging oxygen (Si-O-Si) sites present on the vitreous silica surface in contact with water using a dissociative water potential. In contrast to first-principles calculations based on unconstrained molecular analogs, such as H7Si2O7+ molecules, the very limited flexibility of neighboring SiO4 tetrahedra when embedded in a solid surface means that there is a relatively minor geometric response to proton adsorption, requiring sites predisposed to adsorption. Simulation results indicate that protonation of bridging oxygen occurs at predisposed sites with bridging angles in the 125°-135° range, well below the bulk silica mean of approximately 150°, consistent with various ab initio calculations, and that a small fraction of such sites are present in all ring sizes. The energy differences between dry and protonated bridges at various angles observed in the simulations coincide completely with quantum calculations over the entire range of bridging angles encountered in the vitreous silica surface. Those sites with bridging angles near 130° support adsorbed protons more stably, resulting in the proton remaining adsorbed for longer periods of time. Vitreous silica has the necessary distribution of angular strain over all ring sizes to allow protons to adsorb onto bridging oxygen at the surface, forming acidic surface groups that serve as ideal intermediate steps in proton transfer near the surface. In addition to hydronium formation and water-assisted proton transfer in the liquid, protons can rapidly move across the water-silica interface via strained bridges that are predisposed to transient proton adsorption. Thus, an excess proton at any given location on a silica surface can move by either water-assisted or strained bridge-assisted diffusion depending on the local environment. The result of this would be net migration that is faster than it would be if only one mechanism is possible. These simulation results indicate the importance of performing large size and time scale simulations of the structurally heterogeneous vitreous silica exposed to water to describe proton transport at the interface between water and the silica surface.
Glenn K. Lockwood, Shenghong Zhang, Stephen H. Garofalini, "Anisotropic dissolution of α-alumina (0001) and (1120) surfaces into adjoining silicates", Journal of the American Ceramic Society, October 24, 2008, 91:3536-3541, doi: 10.1111/j.1551-2916.2008.02715.x
The dissolutions of the (0001) and (1120) orientations of α-Al2O3 into calcium silicate, aluminosilicate, and calcium aluminosilicate melts were modeled using molecular dynamics simulations. In all cases, it was found that the (1120) surface of the crystal destabilizes and melts at a lower temperature than does the (0001) surface. This anisotropy in dissolution counters the anisotropy in grain growth, in which the outward growth of the (1120) surface occurs more rapidly than that on the (0001) surface, causing platelets. However, anisotropic dissolution occurred only at a certain temperature range, above which dissolution behavior was isotropic. The presence of calcium in the contacting silicate melt plays an important role in this anisotropic dissolution, similar to its role in anisotropic grain growth observed previously. However, anisotropic dissolution also occurs in the silicate melts not containing calcium, indicating the importance of the different surface energies. In combination with previous simulations of anisotropic grain growth in alumina, these simulations reveal a complex kinetic competition between preferential adsorption and growth versus preferential dissolution of the (1120) orientation in comparison with the (0001) orientation as a function of temperature and local composition. This, in turn, indicates potential processing variations in which to design morphology in alumina.
Glenn K. Lockwood, Shane Snyder, Suren Byna, Philip Carns, Nicholas J. Wright, "Understanding Data Motion in the Modern HPC Data Center", 2019 IEEE/ACM Fourth International Parallel Data Systems Workshop (PDSW), Denver, CO, USA, IEEE, 2019, 74--83, doi: 10.1109/PDSW49588.2019.00012
Sudheer Chunduri, Taylor Groves, Peter Mendygral, Brian Austin, Jacob Balma, Krishna Kandalla, Kalyan Kumaran, Glenn Lockwood, Scott Parker, Steven Warren, Nathan Wichmann, Nicholas Wright, "GPCNeT: Designing a Benchmark Suite for Inducing and Measuring Contention in HPC Networks", International Conference on High Performance Computing, Networking, Storage and Analysis (SC'19), November 16, 2019,
Network congestion is one of the biggest problems facing HPC systems today, affecting system throughput, performance, user experience and reproducibility. Congestion manifests as run-to-run variability due to contention for shared resources like filesystems or routes between compute endpoints. Despite its significance, current network benchmarks fail to proxy the real-world network utilization seen on congested systems. We propose a new open-source benchmark suite called the Global Performance and Congestion Network Tests (GPCNeT) to advance the state of the practice in this area. The guiding principles used in designing GPCNeT are described and the methodology employed to maximize its utility is presented. The capabilities of GPCNeT evaluated by analyzing results from several world’s largest HPC systems, including an evaluation of congestion management on a next-generation network. The results show that systems of all technologies and scales are susceptible to congestion and this work motivates the need for congestion control in next-generation networks.
Glenn K. Lockwood, Kirill Lozinskiy, Lisa Gerhardt, Ravi Cheema, Damian Hazen, Nicholas J. Wright, "Designing an All-Flash Lustre File System for the 2020 NERSC Perlmutter System", Proceedings of the 2019 Cray User Group, Montreal, January 1, 2019,
New experimental and AI-driven workloads are moving into the realm of extreme-scale HPC systems at the same time that high-performance flash is becoming cost-effective to deploy at scale. This confluence poses a number of new technical and economic challenges and opportunities in designing the next generation of HPC storage and I/O subsystems to achieve the right balance of bandwidth, latency, endurance, and cost. In this paper, we present the quantitative approach to requirements definition that resulted in the 30 PB all-flash Lustre file system that will be deployed with NERSC's upcoming Perlmutter system in 2020. By integrating analysis of current workloads and projections of future performance and throughput, we were able to constrain many critical design space parameters and quantitatively demonstrate that Perlmutter will not only deliver optimal performance, but effectively balance cost with capacity, endurance, and many modern features of Lustre.
Glenn K. Lockwood, Shane Snyder, Teng Wang, Suren Byna, Philip Carns, Nicholas J. Wright, "A Year in the Life of a Parallel File System", Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, Dallas, TX, IEEE Press, November 11, 2018, 71:1--74:1,
I/O performance is a critical aspect of data-intensive scientific computing. We seek to advance the state of the practice in understanding and diagnosing I/O performance issues through investigation of a comprehensive I/O performance data set that captures a full year of production storage activity at two leadership-scale computing facilities. We demonstrate techniques to identify regions of interest, perform focused investigations of both long-term trends and transient anomalies, and uncover the contributing factors that lead to performance fluctuation.
We find that a year in the life of a parallel file system is comprised of distinct regions of long-term performance variation in addition to short-term performance transients. We demonstrate how systematic identification of these performance regions, combined with comprehensive analysis, allows us to isolate the factors contributing to different performance maladies at different time scales. From this, we present specific lessons learned and important considerations for HPC storage practitioners.
Teng Wang, Shane Snyder, Glenn K. Lockwood, Philip Carns, Nicholas Wright, Suren Byna, "IOMiner: Large-Scale Analytics Framework for Gaining Knowledge from I/O Logs", 2018 IEEE International Conference on Cluster Computing (CLUSTER), Belfast, UK, IEEE, 2018, 466--476, doi: 10.1109/CLUSTER.2018.00062
Modern HPC systems are collecting large amounts of I/O performance data. The massive volume and heterogeneity of this data, however, have made timely performance of in-depth integrated analysis difficult. To overcome this difficulty and to allow users to identify the root causes of poor application I/O performance, we present IOMiner, an I/O log analytics framework. IOMiner provides an easy-to-use interface for analyzing instrumentation data, a unified storage schema that hides the heterogeneity of the raw instrumentation data, and a sweep-line-based algorithm for root cause analysis of poor application I/O performance. IOMiner is implemented atop Spark to facilitate efficient, interactive, parallel analysis. We demonstrate the capabilities of IOMiner by using it to analyze logs collected on a large-scale production HPC system. Our analysis techniques not only uncover the root cause of poor I/O performance in key application case studies but also provide new insight into HPC I/O workload characterization.
Glenn K. Lockwood, Nicholas J. Wright, Shane Snyder, Philip Carns, George Brown, Kevin Harms, "TOKIO on ClusterStor: Connecting Standard Tools to Enable Holistic I/O Performance Analysis", Proceedings of the 2018 Cray User Group, Stockholm, SE, May 24, 2018,
At present, I/O performance analysis requires different tools to characterize individual components of the I/O subsystem, and institutional I/O expertise is relied upon to translate these disparate data into an integrated view of application performance. This process is labor-intensive and not sustainable as the storage hierarchy deepens and system complexity increases. To address this growing disparity, we have developed the Total Knowledge of I/O (TOKIO) framework to combine the insights from existing component-level monitoring tools and provide a holistic view of performance across the entire I/O stack.
A reference implementation of TOKIO, pytokio, is presented here. Using monitoring tools included with Cray XC and ClusterStor systems alongside commonly deployed community-supported tools, we demonstrate how pytokio provides a lightweight foundation for holistic I/O performance analyses on two Cray XC systems deployed at different HPC centers. We present results from integrated analyses that allow users to quantify the degree of I/O contention that affected their jobs and probabilistically identify unhealthy storage devices that impacted their performance.We also apply pytokio to inspect the utilization of NERSC’s DataWarp burst buffer and demonstrate how pytokio can be used to identify users and applications who may stand to benefit most from migrating their workloads from Lustre to the burst buffer.
Glenn K. Lockwood, Wucherl Yoo, Suren Byna, Nicholas J. Wright, Shane Snyder, Kevin Harms, Zachary Nault, Philip Carns, "UMAMI: a recipe for generating meaningful metrics through holistic I/O performance analysis", Proceedings of the 2nd Joint International Workshop on Parallel Data Storage & Data Intensive Scalable Computing Systems (PDSW-DISCS'17), Denver, CO, ACM, November 2017, 55-60, doi: 10.1145/3149393.3149395
I/O efficiency is essential to productivity in scientific computing, especially as many scientific domains become more data-intensive. Many characterization tools have been used to elucidate specific aspects of parallel I/O performance, but analyzing components of complex I/O subsystems in isolation fails to provide insight into critical questions: how do the I/O components interact, what are reasonable expectations for application performance, and what are the underlying causes of I/O performance problems? To address these questions while capitalizing on existing component-level characterization tools, we propose an approach that combines on-demand, modular synthesis of I/O characterization data into a unified monitoring and metrics interface (UMAMI) to provide a normalized, holistic view of I/O behavior.
We evaluate the feasibility of this approach by applying it to a month-long benchmarking study on two distinct large-scale computing platforms. We present three case studies that highlight the importance of analyzing application I/O performance in context with both contemporaneous and historical component metrics, and we provide new insights into the factors affecting I/O performance. By demonstrating the generality of our approach, we lay the groundwork for a production-grade framework for holistic I/O analysis.
Jaehyun Han, Donghun Koo, Glenn K. Lockwood, Jaehwan Lee, Hyeonsang Eom, Soonwook Hwang, "Accelerating a Burst Buffer via User-Level I/O Isolation", Proceedings of the 2017 IEEE International Conference on Cluster Computing (CLUSTER), Honolulu, HI, IEEE, September 2017, 245-255, doi: 10.1109/CLUSTER.2017.60
Burst buffers tolerate I/O spikes in High-Performance Computing environments by using a non-volatile flash technology. Burst buffers are commonly located between parallel file systems and compute nodes, handling bursty I/Os in the middle. In this architecture, burst buffers are shared resources. The performance of an SSD is significantly reduced when it is used excessively because of garbage collection, and we have observed that SSDs in a burst buffer become slow when many users simultaneously use the burst buffer. To mitigate the performance problem, we propose a new user-level I/O isolation framework in a High-Performance Computing environment using a multi-streamed SSD. The multi-streamed SSD allocates the same flash block for I/Os in the same stream. We assign a different stream to each user; thus, the user can use the stream exclusively. To evaluate the performance, we have used open-source supercomputing workloads and I/O traces from real workloads in the Cori supercomputer at the National Energy Research Scientific Computing Center. Via user-level I/O isolation, we have obtained up to a 125% performance improvement in terms of I/O throughput. In addition, our approach reduces the write amplification in the SSDs, leading to improved SSD endurance. This user-level I/O isolation framework could be applied to deployed burst buffers without having to make any user interface changes.
Jialin Liu, Quincey Koziol, Houjun Tang, François Tessier, Wahid Bhimji, Brandon Cook, Brian Austin, Suren Byna, Bhupender Thakur, Glenn K. Lockwood, Jack Deslippe, Prabhat, "Understanding the IO Performance Gap Between Cori KNL and Haswell", Proceedings of the 2017 Cray User Group, Redmond, WA, May 10, 2017,
The Cori system at NERSC has two compute partitions with different CPU architectures: a 2,004 node Haswell partition and a 9,688 node KNL partition, which ranked as the 5th most powerful and fastest supercomputer on the November 2016 Top 500 list. The compute partitions share a common storage configuration, and understanding the IO performance gap between them is important, impacting not only to NERSC/LBNL users and other national labs, but also to the relevant hardware vendors and software developers. In this paper, we have analyzed performance of single core and single node IO comprehensively on the Haswell and KNL partitions, and have discovered the major bottlenecks, which include CPU frequencies and memory copy performance. We have also extended our performance tests to multi-node IO and revealed the IO cost difference caused by network latency, buffer size, and communication cost. Overall, we have developed a strong understanding of the IO gap between Haswell and KNL nodes and the lessons learned from this exploration will guide us in designing optimal IO solutions in many-core era.
C.S. Daley, D. Ghoshal, G.K. Lockwood, S. Dosanjh, L. Ramakrishnan, N.J. Wright, "Performance Characterization of Scientific Workflows for the Optimal Use of Burst Buffers", Workflows in Support of Large-Scale Science (WORKS-2016), CEUR-WS.org, 2016, 1800:69-73,
Shane Snyder, Philip Carns, Kevin Harms, Robert Ross, Glenn K. Lockwood, Nicholas J. Wright, "Modular HPC I/O characterization with Darshan", Proceedings of the 5th Workshop on Extreme-Scale Programming Tools (ESPT'16), Salt Lake City, UT, November 13, 2016, 9-17, doi: 10.1109/ESPT.2016.9
Contemporary high-performance computing (HPC) applications encompass a broad range of distinct I/O strategies and are often executed on a number of different compute platforms in their lifetime. These large-scale HPC platforms employ increasingly complex I/O subsystems to provide a suitable level of I/O performance to applications. Tuning I/O workloads for such a system is nontrivial, and the results generally are not portable to other HPC systems. I/O profiling tools can help to address this challenge, but most existing tools only instrument specific components within the I/O subsystem that provide a limited perspective on I/O performance. The increasing diversity of scientific applications and computing platforms calls for greater flexibility and scope in I/O characterization.
In this work, we consider how the I/O profiling tool Darshan can be improved to allow for more flexible, comprehensive instru- mentation of current and future HPC I/O workloads.We evaluate the performance and scalability of our design to ensure that it is lightweight enough for full-time deployment on production HPC systems. We also present two case studies illustrating how a more comprehensive instrumentation of application I/O workloads can enable insights into I/O behavior that were not previously possible. Our results indicate that Darshan’s modu- lar instrumentation methods can provide valuable feedback to both users and system administrators, while imposing negligible overheads on user applications.
W. Bhimji, D. Bard, M. Romanus, D. Paul, A. Ovsyannikov, B. Friesen, M. Bryson, J. Correa, G. K. Lockwood, V. Tsulaia, S. Byna, S. Farrell, D. Gursoy, C. Daley, V. Beckner, B. Van Straalen, D. Trebotich, C. Tull, G. Weber, N. J. Wright, K. Antypas, Prabhat, "Accelerating Science with the NERSC Burst Buffer Early User Program", Cray User Group, May 11, 2016, LBNL LBNL-1005736,
NVRAM-based Burst Buffers are an important part of the emerging HPC storage landscape. The National Energy Research Scientific Computing Center (NERSC) at Lawrence Berkeley National Laboratory recently installed one of the first Burst Buffer systems as part of its new Cori supercomputer, collaborating with Cray on the development of the DataWarp software. NERSC has a diverse user base comprised of over 6500 users in 700 different projects spanning a wide variety of scientific computing applications. The use-cases of the Burst Buffer at NERSC are therefore also considerable and diverse. We describe here performance measurements and lessons learned from the Burst Buffer Early User Program at NERSC, which selected a number of research projects to gain early access to the Burst Buffer and exercise its capability to enable new scientific advancements. To the best of our knowledge this is the first time a Burst Buffer has been stressed at scale by diverse, real user workloads and therefore these lessons will be of considerable benefit to shaping the developing use of Burst Buffers at HPC centers.
Glenn K. Lockwood, Rick Wagner, Mahidhar Tatineni, "Storage utilization in the long tail of science", Proceedings of the 2015 XSEDE Conference, July 26, 2015, doi: 10.1145/2792745.2792777
The increasing expansion of computations in non-traditional domain sciences has resulted in an increasing demand for research cyberinfrastructure that is suitable for small- and mid-scale job sizes. The computational aspects of these emerging communities are coming into focus and being addressed through the deployment of several new XSEDE resources that feature easy on-ramps, customizable software environments through virtualization, and interconnects optimized for jobs that only use hundreds or thousands of cores; however, the data storage requirements for these emerging communities remains much less well characterized.
To this end, we examined the distribution of file sizes on two of the Lustre file systems within the Data Oasis storage system at the San Diego Supercomputer Center (SDSC). We found that there is a very strong preference for small files among SDSC's users, with 90% of all files being less than 2 MB in size. Furthermore, 50% of all file system capacity is consumed by files under 2 GB in size, and these distributions are consistent on both scratch and projects storage file systems. Because parallel file systems like Lustre and GPFS are optimized for parallel IO to large, widestripe files, these findings suggest that parallel file systems may not be the most suitable storage solutions when designing cyberinfrastructure to meet the needs of emerging communities.
Dong Ju Choi, Glenn K. Lockwood, Robert S. Sinkovits, Mahidhar Tatineni, "Performance of applications using dual-rail InfiniBand 3D torus network on the Gordon supercomputer", Proceedings of the 2014 XSEDE Conference, July 13, 2014, doi: 10.1145/2616498.2616541
Multi-rail InfiniBand networks provide options to improve bandwidth, increase reliability, and lower latency for multi-core nodes. The Gordon supercomputer at SDSC, with its dual-rail InfiniBand 3-D torus network, is used to evaluate the performance impact of using multiple rails. The study was performed using the OSU micro-benchmarks, the P3FFT application kernel, and scientific applications LAMMPS and AMBER. The micro-benchmarks confirmed the bandwidth and latency performance benefits. At the application level, performance improvements depended on the communication level and profile.
Glenn K. Lockwood, Mahidhar Tatineni, Rick Wagner, "SR-IOV: Performance benefits for virtualized interconnects", Proceedings of the 2014 XSEDE Conference, July 13, 2014, doi: 10.1145/2616498.2616537
The demand for virtualization within high-performance computing is rapidly growing as new communities, driven by both new application stacks and new computing modalities, continue to grow and expand. While virtualization has traditionally come with significant penalties in I/O performance that have precluded its use in mainstream large-scale computing environments, new standards such as Single Root I/O Virtualization (SR-IOV) are emerging that promise to diminish the performance gap and make high-performance virtualization possible. To this end, we have evaluated SR-IOV in the context of both virtualized InfiniBand and virtualized 10 gigabit Ethernet (GbE) using micro-benchmarks and real-world applications. We compare the performance of these interconnects on non-virtualized environments, Amazon's SR-IOV-enabled C3 instances, and our own SR-IOV-enabled InfiniBand cluster and show that SR-IOV significantly reduces the performance losses caused by virtualization. InfiniBand demonstrates less than 2% loss of bandwidth and less than 10% increase in latency when virtualized with SR-IOV. Ethernet also benefits, although less dramatically, when SR-IOV is enabled on Amazon's cloud.
Jeff A. Tracey, James K. Sheppard, Glenn K. Lockwood, Amit Chourasia, Mahidhar Tatineni, Robert N. Fisher, Robert S. Sinkovits, "Efficient 3D movement-based kernel density estimator and application to wildlife ecology", Proceedings of the 2014 XSEDE Conference, San Diego, CA, July 13, 2014, doi: 10.1145/2616498.2616541
We describe an efficient implementation of a 3D movement- based kernel density estimator for determining animal space use from discrete GPS measurements. This new method provides more accurate results, particularly for species that make large excursions in the vertical dimension. The downside of this approach is that it is much more computationally expensive than simpler, lower-dimensional models. Through a combination of code restructuring, parallelization and performance optimization, we were able to reduce the time to solution by up to a factor of 1000x, thereby greatly improving the applicability of the method.
Glenn K. Lockwood, Kirill Lozinskiy, Lisa Gerhardt, Ravi Cheema, Damian Hazen, Nicholas J. Wright, "A Quantitative Approach to Architecting All-Flash Lustre File Systems", ISC High Performance 2019: High Performance Computing, edited by Michele Weiland, Guido Juckeland, Sadaf Alam, Heike Jagode, (Springer International Publishing: 2019) Pages: 183--197 doi: 10.1007/978-3-030-34356-9_16
New experimental and AI-driven workloads are moving into the realm of extreme-scale HPC systems at the same time that high-performance flash is becoming cost-effective to deploy at scale. This confluence poses a number of new technical and economic challenges and opportunities in designing the next generation of HPC storage and I/O subsystems to achieve the right balance of bandwidth, latency, endurance, and cost. In this work, we present quantitative models that use workload data from existing, disk-based file systems to project the architectural requirements of all-flash Lustre file systems. Using data from NERSC’s Cori I/O subsystem, we then demonstrate the minimum required capacity for data, capacity for metadata and data-on-MDT, and SSD endurance for a future all-flash Lustre file system.
Greg Butler, Ravi Cheema, Damian Hazen, Kristy Kallback-Rose, Rei Lee, Glenn Lockwood, NERSC Community File System, March 4, 2020,
Glenn K. Lockwood, Kirill Lozinskiy, Kristy Kallback-Rose, NERSC's Perlmutter System: Deploying 30 PB of all-NVMe Lustre at scale, Lustre BoF at SC19, November 19, 2019,
- Download File: 2019-11-Lustre-BOF-talk-KKR-no-notes.pptx (pptx: 3 MB)
Update at SC19 Lustre BoF on collaborative work with Cray on deploying an all-flash Lustre tier for NERSC's Perlmutter Shasta system.
Kirill Lozinskiy, Glenn K. Lockwood, Lisa Gerhardt, Ravi Cheema, Damian Hazen, Nicholas J. Wright, A Quantitative Approach to Architecting All‐Flash Lustre File Systems, Lustre User Group (LUG) 2019, May 15, 2019,
Robert Ross, Lee Ward, Philip Carns, Gary Grider, Scott Klasky, Quincey Koziol, Glenn K. Lockwood, Kathryn Mohror, Bradley Settlemyer, Matthew Wolf, "Storage Systems and I/O: Organizing, Storing, and Accessing Data for Scientific Discovery", 2018, doi: 10.2172/1491994
In September, 2018, the Department of Energy, Office of Science, Advanced Scientific Computing Research Program convened a workshop to identify key challenges and define research directions that will advance the field of storage systems and I/O over the next 5–7 years. The workshop concluded that addressing these combined challenges and opportunities requires tools and techniques that greatly extend traditional approaches and require new research directions. Key research opportunities were identified.
Glenn K. Lockwood, Damian Hazen, Quincey Koziol, Shane Canon, Katie Antypas, Jan Balewski, Nicholas Balthaser, Wahid Bhimji, James Botts, Jeff Broughton, Tina L. Butler, Gregory F. Butler, Ravi Cheema, Christopher Daley, Tina Declerck, Lisa Gerhardt, Wayne E. Hurlbert, Kristy A. Kallback-
Rose, Stephen Leak, Jason Lee, Rei Lee, Jialin Liu, Kirill Lozinskiy, David Paul, Prabhat, Cory Snavely, Jay Srinivasan, Tavia Stone Gibbins, Nicholas J. Wright,
"Storage 2020: A Vision for the Future of HPC Storage",
October 20, 2017,
- Download File: Storage-2020-A-Vision-for-the-Future-of-HPC-Storage.pdf (pdf: 3.6 MB)