NERSCPowering Scientific Discovery for 50 Years

Rollin Thomas

Rollin Thomas
Rollin Thomas Ph.D.
Senior Computing Engineer
Programming Environments and Models Group
1 Cyclotron Road
Mailstop: 59-4010A
Berkeley, CA 94720 us

Biographical Sketch

Rollin is interested in interactive, real-time, and urgent supercomputing for science. He manages NERSC's JupyterHub service and the JupyterLab deployments on Cori and Perlmutter. Prior to joining the Programming Environments and Models Group, Rollin worked in the Data Department at NERSC in the Data and Analytics Services Group and the Data Science Engagement Group.

Journal Articles

Rollin Thomas, Shreyas Cholia, "Interactive Supercomputing With Jupyter", Computing in Science & Engineering, March 26, 2021, 23:93-98, doi: 10.1109/MCSE.2021.3059037

Rich user interfaces like Jupyter have the potential to make interacting with a supercomputer easier and more productive, consequently attracting new kinds of users and helping to expand the application of supercomputing to new science domains. For the scientist user, the ideal rich user interface delivers a familiar, responsive, introspective, modular, and customizable platform upon which to build, run, capture, document, re-run, and share analysis workflows. From the provider or system administrator perspective, such a platform would also be easy to configure, deploy securely, update, customize, and support. Jupyter checks most if not all of these boxes. But from the perspective of leadership computing organizations that provide supercomputing power to users, such a platform should also make the unique features of a supercomputer center more accessible to users and more composable with high-performance computing (HPC) workflows. Project Jupyter’s (https://jupyter.org/about) core design philosophy of extensibility, abstraction, and agnostic deployment has allowed HPC centers like National Energy Research Scientific Computing Center to bring in advanced supercomputing capabilities that can extend the interactive notebook environment. This has enabled a rich scientific discovery platform, particularly for experimental facility data analysis and machine learning problems.

Conference Papers

Daniel Margala, Laurie Stephey, Rollin Thomas, Stephen Bailey, "Accelerating Spectroscopic Data Processing Using Python and GPUs on NERSC Supercomputers", Proceedings of the 20th Python in Science Conference, August 1, 2021, 33-39, doi: 10.25080/majora-1b6fd038-004

The Dark Energy Spectroscopic Instrument (DESI) will create the most detailed 3D map of the Universe to date by measuring redshifts in light spectra of over 30 million galaxies. The extraction of 1D spectra from 2D spectrograph traces in the instrument output is one of the main computational bottlenecks of DESI data processing pipeline, which is predominantly implemented in Python. The new Perlmutter supercomputer system at the National Energy Scientific Research and Computing Center (NERSC) will feature over 6,000 NVIDIA A100 GPUs across 1,500 nodes. The new heterogenous CPU-GPU computing capability at NERSC opens the door for improved performance for science applications that are able to leverage the high-throughput computation enabled by GPUs. We have ported the DESI spectral extraction code to run on GPU devices to achieve a 20x improvement in per-node throughput compared to the current state of the art on the CPU-only Haswell partition of the Cori supercomputer system at NERSC.

Rollin Thomas, Laurie Stephey, Annette Greiner, Brandon Cook, "Monitoring Scientific Python Usage on a Supercomputer", SciPy 2021: Proceedings of the 20th Python in Science Conference, August 1, 2021, 123-131, doi: 10.25080/majora-1b6fd038-010

In 2021, more than 30% of users at the National Energy Research Scientific Computing Center (NERSC) used Python on the Cori supercomputer. To determine this we have developed and open-sourced a simple, minimally invasive monitoring framework that leverages standard Python features to capture Python imports and other job data via a package called Customs. To analyze the data we collect via Customs, we have developed a Jupyter-based analysis framework designed to be interactive, shareable, extensible, and publishable via a dashboard. Our stack includes Papermill to execute parameterized notebooks, Dask-cuDF for multi-GPU processing, and Voila to render our notebooks as web-based dashboards. We report preliminary findings from Customs data collection and analysis. This work demonstrates that our monitoring framework can capture insightful and actionable data including top Python libraries, preferred user software stacks, and correlated libraries, leading to a better understanding of user behavior and affording us opportunity to make increasingly data-driven decisions regarding Python at NERSC.

B Enders, D Bard, C Snavely, L Gerhardt, J Lee, B Totzke, K Antypas, S Byna, R Cheema, S Cholia, M Day, A Gaur, A Greiner, T Groves, M Kiran, Q Koziol, K Rowland, C Samuel, A Selvarajan, A Sim, D Skinner, R Thomas, G Torok, "Cross-facility science with the Superfacility Project at LBNL", IEEE/ACM WND Annual Workshop on Extreme-scale Experiment-in-the-Loop Computig (XLOOP), 2020, pp. 1-7, doi: 10.1109/XLOOP51963.2020.00006., November 12, 2020, 00:1-7, doi: 10.1109/XLOOP51963.2020.00006.

Laurie Stephey, Rollin Thomas, Stephen Bailey, "Optimizing Python-Based Spectroscopic Data Processing on NERSC Supercomputers", Proceedings of the 18th Python in Science Conference, August 13, 2019, 69-76, doi: 10.25080/Majora-7ddc1dd1-00a

We present a case study of optimizing a Python-based cosmology data processing pipeline designed to run in parallel on thousands of cores using supercomputers at the National Energy Research Scientific Computing Center (NERSC).

The goal of the Dark Energy Spectroscopic Instrument (DESI) experiment is to better understand dark energy by making the most detailed 3D map of the universe to date. Over a five-year period starting this year (2019), around 1000 CCD frames per night (30 per exposure) will be read out from the instrument and transferred to NERSC for processing and analysis on the Cori and Perlmutter supercomputers in near-real time. This fast turnaround helps DESI monitor survey progress and update the next night's observing schedule.

The DESI spectroscopic pipeline for processing these data is written almost exclusively in Python. Using Python allows DESI scientists to write very readable and maintainable scientific code in a relatively short amount of time, which is important due to limited DESI developer resources. However, the drawback is that Python can be substantially slower than more traditional high performance computing languages like C, C++, and Fortran.

The goal of this work is to improve the performance of DESI's spectroscopic data processing pipeline at NERSC while satisfying their productivity requirement that the software remain in Python. Within this space we have obtained specific (per node-hour) throughput improvements of over 5x and 6x on the Cori Haswell and Knights Landing partitions, respectively. Several profiling techniques were used to determine potential areas for improvement including Python's cProfile and line\_profiler packages, and other tools like Intel VTune and Tau. Once we identified expensive kernels, we used the following techniques: 1) JIT-compiling hotspots using Numba and 2) restructuring the code to lessen the impact of calling expensive functions. Additionally, we seriously considered substituting MPI parallelism for Dask, a more flexible and robust alternative, but have found that once a code has been designed with MPI in mind, it is non-trivial to transition it to another kind of parallelism. We will also show initial considerations for transitioning DESI spectroscopic extraction to GPUs (coming in the next NERSC system, Perlmutter, in 2020).

Zahra Ronaghi, Rollin Thomas, Jack Deslippe, Stephen Bailey, Doga Gursoy, Theodore Kisner, Reijo Keskitalo, Julian Borrill, "Python in the NERSC Exascale Science Applications Program for Data", PyHPC'17: Proceedings of the 7th Workshop on Python for High-Performance and Scientific Computing, ACM, November 12, 2017, 1-10, doi: 10.1145/3149869.3149873

We describe a new effort at the National Energy Research Scientific Computing Center (NERSC) in performance analysis and optimization of scientific Python applications targeting the Intel Xeon Phi (Knights Landing, KNL) manycore architecture. The Python-centered work outlined here is part of a larger effort called the NERSC Exascale Science Applications Program (NESAP) for Data. NESAP for Data focuses on applications that process and analyze high-volume, high-velocity data sets from experimental or observational science (EOS) facilities supported by the US Department of Energy Office of Science. We present three case study applications from NESAP for Data that use Python. These codes vary in terms of "Python purity" from applications developed in pure Python to ones that use Python mainly as a convenience layer for scientists without expertise in lower level programming languages like C, C++ or Fortran. The science case, requirements, constraints, algorithms, and initial performance optimizations for each code are discussed. Our goal with this paper is to contribute to the larger conversation around the role of Python in high-performance computing today and tomorrow, highlighting areas for future work and emerging best practices.

Keith Jackson, Lavanya Ramakrishnan, Karl Runge, and Rollin Thomas, "Seeking Supernovae in the Clouds: A Performance Study", HPDC '10: Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, ACM, June 2010, 421–429, doi: 10.1145/1851476.1851538

Best Paper, ScienceCloud 2010

Today, our picture of the Universe radically differs from that of just over a decade ago. We now know that the Universe is not only expanding as Hubble discovered in 1929, but that the rate of expansion is accelerating, propelled by mysterious new physics dubbed "Dark Energy." This revolutionary discovery was made by comparing the brightness of nearby Type Ia supernovae (which exploded in the past billion years) to that of much more distant ones (from up to seven billion years ago). The reliability of this comparison hinges upon a very detailed understanding of the physics of the nearby events. As part of its effort to further this understanding, the Nearby Supernova Factory (SNfactory) relies upon a complex pipeline of serial processes that execute various image processing algorithms in parallel on ~10TBs of data.

This pipeline has traditionally been run on a local cluster. Cloud computing offers many features that make it an attractive alternative. The ability to completely control the software environment in a Cloud is appealing when dealing with a community developed science pipeline with many unique library and platform requirements. In this context we study the feasibility of porting the SNfactory pipeline to the Amazon Web Services environment. Specifically we: describe the tool set we developed to manage a virtual cluster on Amazon EC2, explore the various design options available for application data placement, and offer detailed performance results and lessons learned from each of the above design options.

Sarah S. Poon, Rollin C. Thomas, Cecilia R. Aragon, Brian Lee, "Context-linked virtual assistants for distributed teams: an astrophysics case study", CSCW '08: Proceedings of the 2008 ACM conference on Computer supported cooperative work, ACM, November 8, 2008, 361–370, doi: 10.1145/1460563.1460623

Best Paper Honorable Mention, CSCW'08

There is a growing need for distributed teams to analyze complex and dynamic data streams and make critical decisions under time pressure. Via a case study, we discuss potential guidelines for the design of software tools to facilitate such collaborative decision-making. We introduce the term context-linked to characterize systems where both task and context information are included in a shared space. We describe a novel, lightweight, context-linked event notification/virtual assistant system developed to aid a cross-cultural, geographically distributed team of astrophysicists to remotely maneuver a custom-built instrument under challenging operational conditions, where critical decisions must be made in as little as 45 seconds. The system has been in use since 2005 by a major international astrophysics collaboration. We describe the design and implementation of the event notification system and then present a case study, based on event log analysis and user interviews, of its effectiveness in substantially improving user performance during time-critical science tasks. Finally, we discuss the implications of context linking for supporting common ground in distributed teams.