2022 NERSC Summer Internship Projects
Summer 2022 Internship Projects
Data and Analytics
Fusion Energy Workflow Development on Perlmutter
**This position is now filed**
Science/CS domain: scientific workflows and workflow tools, container infrastructure, fusion energy, mixed CPU and GPU analysis, reproducibility
Project Description: One of the NERSC superfacility projects aims to develop realtime processing for the KSTAR tokamak in Korea. They have a complex software stack required for the data transfer and job orchestration. Part of this project will include scripting and testing the complex software build process for reproducibility. Another goal is to move to using containers for all parts of the data transfer and data analysis, so the build script would likely target container infrastructure. The KSTAR team has been experimenting with the Ray workflow engine, so the final part of this project will include testing this workflow engine on our new system Perlmutter and evaluating its suitability for the complex KSTAR workflow needs. This will likely include running heterogeneous CPU and GPU analysis workloads.
Desired Skills/Background: Python, Bash, Docker/container technologies, Ray workflow tool, Git, CI
NERSC/DAS mentor: Laurie Stephey ([email protected])
Understanding usage of I/O libraries in supercomputer production workloads
Science/CS domain: I/O
Project Description: Efficiently storing and retrieving data in supercomputers is a tricky problem due to inter-dependencies among multiple layers of input/output (I/O) software, including high-level I/O libraries (e.g., HDF5, netCDF, ROOT, etc.), MPI-IO, POSIX, and file systems. Despite having metrics transparently collected by profiling tools such as Darshan, it is unclear how those I/O libraries are used in production workloads and how their I/O accesses are translated before reaching the storage system. In this project, we seek to dive into a year’s worth of data to understand the usage of I/O libraries and the I/O characteristics of the applications relying on those high-level libraries. This would lead us to identify I/O performance bottlenecks and devise tuning strategies, as well as design new features in I/O libraries that are beneficial to ever-changing application landscape on supercomputers.
Desired Skills/Background: Python, Jupyter Notebook, and interest in data analytics. Nice: knowledge of statistics, plotting tools,
NERSC/DAS mentor: Suren Byna ([email protected]) and Alberto Chiusole ([email protected])
Machine Learning
AI for cosmological simulations
This position has been filled for Summer 2022.
Science/CS domain: Deep learning & computational cosmology
Project Description: Cosmological simulations allow researchers to model the formation of structure in the universe, but can require extreme computational resources. Recently, AI-driven approaches have shown great promise in modeling these physical systems, reducing the amount of compute resources required while still providing accurate estimates of important quantities associated with dark matter and gas in the universe. This project aims to further investigate deep learning models as a surrogate for ab initio hydrodynamical cosmological simulations, with the goal of improving the accuracy and generalization ability of the model. This will involve training on large, state-of-the-art 3D datasets, potentially developing new model architectures or optimization techniques, and using resources from NERSC’s Perlmutter supercomputer, a cutting-edge system well-suited for AI workloads.
Desired Skills/Background: Python, deep learning, experience with PyTorch or TensorFlow
NERSC/DAS mentor: Peter Harrington ([email protected]), co-mentor: Zarija Lukić ([email protected])
Self-supervised & physics-informed ML for climate
This position has been filled for Summer 2022.
Science/CS domain: Scientific machine learning, weather/climate modeling
Project Description: Forecasting atmospheric variables accurately provides enormous contributions to various public and private sectors throughout the world economy, especially in the era of climate change. Recent times have seen a growing interest in data-driven approaches to modeling atmospheric physics, as such models can overcome modeling biases by learning the physics from data and also boast vast computational gains through efficient deep learning models, enabling better probabilistic forecasts. This project aims to improve upon state-of-the-art deep learning weather models by exploring self-supervision strategies and physics-based constraints to enhance model accuracies and generalization capability. This will involve adapting ideas from current leading models in computer vision, NLP, physics-informed ML, etc., and testing them on high-resolution weather datasets using resources from Perlmutter to handle large-scale AI workloads.
Desired Skills/Background: Python, deep learning, PyTorch/TensorFlow
NERSC/DAS mentor: Shashank Subramanian ([email protected]), Peter Harrington ([email protected])
Generative modeling for high-resolution atmospheric variables
This position has been filled for Summer 2022.
Science/CS domain: Scientific machine learning, weather/climate modeling
Project Description: In climate and weather models, predicting high-impact variables like surface winds, temperature, and precipitation is critical to our ability to prepare for climate change. However, accurately capturing these fields and their extreme events requires modeling high-resolution features, which is a computational challenge for many physical and data-driven models. AI-driven generative modeling presents an exciting opportunity for capturing fine-scale features at low computational cost, as such models can provide deterministic or variational predictions by learning directly from the patterns in observational and simulated data. This project aims to develop and apply such generative models to important climate variables, with the goal of resolving fine-scale details and capturing extreme events accurately. This will involve exploration of model architectures and training procedures that can effectively incorporate physical information from the predictions of state-of-the-art deep learning weather models and datasets, and using resources from NERSC’s Perlmutter supercomputer to train models.
Desired Skills/Background: Python, deep learning, Pytorch/Tensorflow
NERSC/DAS mentor: Shashank Subramanian ([email protected]), Peter Harrington ([email protected])
Scientific AI benchmarking and performance analysis
CS domain: deep learning, benchmarking
Project Description: Scientific AI/ML/DL applications are a transformative emerging workload for supercomputers, and it is critical for HPC centers to have robust methodologies and benchmarks for characterizing these new workloads, evaluating system performance, and to drive innovation in hardware and system design. MLPerf benchmarks and related efforts are pushing on this front with state of the art applications and performance measurements for HPC science. We are looking for an enthusiastic intern to analyze and optimize the performance of scientific AI benchmarks at scale on our new Perlmutter supercomputer, a powerful system featuring over 6,000 NVIDIA A100 GPUs which debuted as the #5 system on the Top500 in 2021 and had leading results on MLPerf HPC v1.0.
Desired Skills/Background: Required: Python, machine learning, experience with PyTorch or TensorFlow. Nice: distributed deep learning, GPU profiling, hyperparameter optimization, model parallelism
NERSC/DAS mentor: Steven Farrell ([email protected]), Hai Ah Nam ([email protected])
Scientific AI workflow development
CS domain: AI/ML/DL, workflows, hyperparameter optimization, interactivity
Project Description: Developing and applying AI solutions to scientific problems often requires sophisticated workflows including distributed model training, hyperparameter optimization (HPO), and/or inference on massive datasets. These workflows need large scale compute resources, interactivity, and visualization to enable productive research and rapid evaluation of ideas. HPC systems such as our new Perlmutter system at NERSC have the required capabilities, but utilizing these systems effectively can still be challenging and many users orchestrate their workflows manually. As an intern on this project, you will evaluate and develop tools (e.g. in Jupyter) to enhance and automate the deployment of large scale distributed training, HPO, and inference workloads on Perlmutter.
Desired Skills/Background: Experience with one or more of the following: deep learning frameworks/tools, workflow tools, Jupyter, HPO
NERSC/DAS mentor: Steven Farrell ([email protected])
Estimating Particle Properties with Normalizing Flows
This position has been filled for Summer 2022.
Science/CS domain: Machine Learning
Project Description: In particle physics experiments (like those at the Large Hadron Collider), the simulation of particles with calorimeters is often the most computationally expensive step for inference. Deep generative models have shown great promise to accelerate slow physics-based simulations. This project will explore how normalizing flows can be used to model these complex processes and how the resulting probability density can be exploited for inference.
Desired Skills/Background: Required: Python (command line and Jupyter notebooks), interest (but not necessarily experience) in particle physics. Nice: Deep learning tools (TensorFlow/PyTorch)
NERSC/DAS mentor: Vinicius Mikuni ([email protected]) ; Wahid Bhimji ([email protected])
Generative Models to Advance Bioengineering
Science/CS domain: Deep learning & computational biology
Project Description: Recent advances in genomic sequencing have resulted in several thousands of full genomes of pseudomonads, a genera of bacteria important in many science areas ranging from biogeochemical cycling in the environment to bacterial pneumonia in humans. With these high quality data sets, combined with tens of thousands of somewhat lower quality metagenomically assembled genomes, this project seeks to create a generative model for pseuodomad genomes. A model sufficient to generate genomes within a given set of parameters, for example, “Generate a genome that is root associated, drought resistant, salt tolerant that will produce this natural product”. This will involve training on large, biological datasets, novel applications of machine learning, and using resources from NERSC’s Perlmutter supercomputer, a cutting-edge system well-suited for AI workloads.
Desired Skills/Background: Python, deep learning, experience with PyTorch or TensorFlow
NERSC/DAS mentor: Shane Canon ([email protected]), co-mentor: Paramvir Dehal ([email protected])
Generative Models to Create Novel Sensors
Science/CS domain: Deep learning & computational biology
Project Description: Through billions of years of selection, bacteria have fine tuned the set of chemical sensors that each species must respond to in order to optimize their fitness. Over one million such sensor domains exist in the gene databases. This project seeks to create a model for generating new sensor domains, ones which can be tested against chemicals/stimuli that are more relevant to humans. Ultimately, these novel sensors could be used to create biosensors capable of alerting us when toxins are present in the environment. This will involve training on large, biological datasets, novel applications of machine learning, and using resources from NERSC’s Perlmutter supercomputer, a cutting-edge system well-suited for AI workloads.
Desired Skills/Background: Python, deep learning, experience with PyTorch or TensorFlow
NERSC/DAS mentor: Shane Canon ([email protected]), co-mentor: Paramvir Dehal ([email protected])
Software Stack Testing
This position has been filled for Summer 2022.
CS domain: Software Reproducibility
Project Description: Extreme-Scale Scientific Software Stack (E4S) is an HPC Software Stack comprised of 80+ scientific software that are most commonly used in HPC. NERSC has installed E4S on quarterly release for Cori and Perlmutter system which is accessible to our user community and available in our E4S user documentation. In this role we are going to be testing the E4S stack by focusing on a subset of E4S application and developing and running test for our deployment stacks. We use buildtest a python based HPC testing framework that helps automate build and execution of test. We use buildtest to test the E4S deployments with public facing test repository https://github.com/buildtesters/buildtest-cori which contains all of our test. In this role you will be contributing to our test repository to increase test coverage of the E4S stack. The candidate should have experience in Python, Git, CI/CD and a keen interest in testing. In this role, we will collaborate with application teams to acquire and integrate test at NERSC for the E4S deployment. The candidate will be required to troubleshoot build and test errors and refactor test to resolve test failures. The candidate should be capable of writing technical documentation and utilize best practices during code development, refactoring code, writing unit tests, and ability to work through code-review process.
Desired Skills/Background: Python, Testing, Git, Gitlab CI, Build Tools (CMake, Make),
NERSC mentor: Shahzeb Siddiqui ([email protected])
Smart Job Script Generator
THIS POSITION HAS BEEN FILLED AND IS NO LONGER ACCEPTING APPLICATIONS.
Science/CS domain: scientific workflows and workflow tools, HPC scientific performance, optimization, UI design
While computational scientists regularly perform studies of the effect of parallelization parameters on supercomputers, research scientists are often only concerned with the results of their simulations; how their scalable jobs are launched is less critical than getting their results as quickly as possible.
This project will begin to create the basics for a “Smart” job script generator that gives users a suggested job launch strategy. It will combine NERSC’s current job script generator with data from the Queue Wait Time plot and a few pieces of data from the user’s application to suggest a more optimal number of nodes (N) and walltime (t).
Tasks will include learning how jobs are scheduled and run on modern supercomputers, working with a team to develop algorithms that predict the optimal configuration, and writing code (Python and JavaScript) to calculate and report the best configuration. Depending on success and interest, this may be expanded to include different target metrics, more complicated job launching strategies, and different methods of reporting the smart configuration to users.
Desired Skills/Background: Python, JavaScript, Applied Mathematics
NERSC mentor: Kevin Gott, Zhengji Zhao
THIS POSITION HAS BEEN FILLED AND IS NO LONGER ACCEPTING APPLICATIONS.
Network Analysis and Performance
THIS POSITION HAS BEEN FILLED AND IS NO LONGER ACCEPTING APPLICATIONS.
Science/CS domain: network performance, HPC testing
Network efficiency is extremely important to scientific supercomputers. An appropriately tuned network and user code can save a tremendous amount of time and money when running scientific simulations. However, network performance is a vast, highly diverse field of study, with multiple library options each with a large number of tuning parameters, and ways of running codes in addition to hardware choices available.
This position seeks to improve our understanding of network use and performance on NERSC supercomputers to better tune, capture and teach users how to write the parallel part of their codes effectively. We are seeking motivated students to work on a wide-variety of networking tasks, focusing primarily on MPI. We are interested in research and development in a variety of directions in this field. Some examples are given below, but student-defined projects that match our interests will also be considered:
Autotune MPI
Tuning a network’s communication libraries and features is a huge undertaking that involves hundreds if not thousands of variables, switches and features that must be properly set for the underlying workload. This project would begin the work of trying to categorize and quantify network performance towards the effort of auto-tuning networks in the future.
Students would begin by attempting to define a small number of “global” metrics that define the performance for a collection of MPI tests (say, collective calls). They would then attempt to optimize the performance across available MPI, network and implementation variables to optimize performance and see how well their metrics describe the performance. This process will be iterated on (adjusting the metrics, re-optimizing the variables, etc) to attempt to identify a solid strategy for auto-tuning MPI performance on a given supercomputer.
MPI Performance consistency
The MPI standard aspires to consistency across its hierarchy of operations such that no combined MPI function should be slower than the sum of its parts. For example, an MPI_Bcast to N processes should have the same or better performance than N pairs of MPI_Send and MPI_Recv operations. This project would seek to create a set of tests to confirm consistency for MPI implementations. Students would start by building and implementing an initial test and then expanding to create a suite of the most commonly used MPI calls. They will also test their suite on NERSC’s systems to identify any current issues with our available MPI packages.
NERSC OSU-like microbenchmark suite
The OSU MPI-microbenchmark test suite is a go-to, industry standard test suite for simple tests. However, it has some deficiencies when being used on today’s modern supercomputers. This project would work towards improving the test suite to better fit NERSC’s network testing needs. Target improvements include, but are not limited to:
-
Building scaling tests inside a single srun.
-
Reporting better statistics (arbitrary percentiles vs min, avg, max).
-
Setting up better defaults (i.e. very large max message sizes).
-
Making the suite open source and community driven (i.e. on github and accepts PRs).
-
Adding data-friendly output formats (e.g. json).
Desired Skills/Background: MPI, Parallel Computing, C++/Fortran, HPC computing experience
THIS POSITION HAS BEEN FILLED AND IS NO LONGER ACCEPTING APPLICATIONS.
NERSC mentor: Kevin Gott, Brandon Cook
Spack Infrastructure - Delivering HPC Software Stack to users
THIS POSITION HAS BEEN FILLED AND IS NO LONGER ACCEPTING APPLICATIONS.
Science/CS domain: HPC computing experience, Software Deployment
Project Description: NERSC supports 8000+ users with 100s of software package installed on our systems (Cori, Perlmutter). We use spack, a package manager used to install scientific software optimized for the target system. Spack allows one to support multiple versions of a particular package along with multiple build configurations for a particular package.
In this role, the candidate will be responsible for supporting the Spack Infrastructure project that contains our spack configuration used for our software stack deployments. The spack infrastructure project leverages Gitlab to automate deployments of the Extreme-Scale Scientific Software Stack (E4S) which is a community effort to provide open source software packages for HPC platforms.
In this role, the candidate will be responsible for building the software stack via spack, troubleshooting build errors, monitoring CI pipelines from Gitlab and write end-user documentation (https://docs.nersc.gov/). The candidate will help develop training material for spack to be delivered to end-users that will be made available in the Spack Infrastructure documentation (http://nersc-spack-infrastructure.readthedocs.io/).
In this role, we plan on deploying E4S 22.05 on Perlmutter scheduled for release May 2022 that will be available in https://github.com/E4S-Project/e4s. In this role, the candidate will take part in the software deployment process of an E4S stack, we anticipate this deployment will be available to end-users in July, 2022.
Desired Skills/Background: Spack, Git, Gitlab CI, environment-modules/Lmod, Tcl, Lua, Build Tools (CMake, Make)
NERSC mentor: Shahzeb Siddiqui ([email protected])