NERSC Summer Student Puts MPI Under the Microscope
October 25, 2022
By Elizabeth Ball
Contact: [email protected]
The supercomputers at the National Energy Research Scientific Computing Center (NERSC) support all kinds of research across the scientific spectrum, but sometimes, they’re also the subject of research in their own right, or part of the question to be answered. This summer, Muna Tageldin, a Ph.D. candidate in electrical and computer engineering at Marquette University, collaborated with NERSC staff as part of the Berkeley Lab Computing Sciences Summer Program, developing a microbenchmark to analyze variances in message-passing interface (MPI) performance on NERSC systems and looking for the best statistical methods to characterize the results.
MPI is a standard protocol commonly used by parallel applications to send and receive data over high-speed internal networks on supercomputers, and analyzing its performance is essential for understanding applications’ scalability and portability. Tageldin worked with NERSC application performance specialists Kevin Gott and Brandon Cook, and NERSC User Engagement Group lead Rebecca Hartman-Baker to investigate some complexities of MPI performance.
“We find that the MPI all-to-all collective performance measurements form a multimodal distribution on a system running a production workload,” said Tageldin. “My project is understanding MPI performance variation and also finding statistical tests that can correctly describe this variation.”
According to Tageldin, frequently used summary statistics like minimum, maximum, median, and mean don’t always capture the intrinsic characteristics of MPI performance variations that can have quite complex forms. To find statistical methods that more accurately characterize the distributions of run times, she spent her summer recording collective MPI_Alltoall measurements on both NERSC systems, Cori and Perlmutter, using the microbenchmark she developed. Her process included iteratively adjusting MPI parameters like the size of message transmitted and communication size (the number of processors involved in the communication). She found that multimodal distributions were common across many configurations, and she worked to find statistical methods like time domain features that can describe the variance in MPI performance data. Though summer has come to an end, her goal is to analyze multimodal distributions using statistics and associate those distributions with different MPI configurations, work that may continue at NERSC even after she’s returned to Marquette.
Tageldin says the Summer Student Program has been an opportunity to branch out, explore, and gain experience working on topics slightly outside of her primary area of research. “My dissertation is on analyzing high performance computing (HPC) systems performance using probabilistic models, ” she said. “And in this internship, I’m tackling HPC performance from a statistics and coding perspective. It’s kind of interesting because MPI is an area I haven’t delved into in detail, and now we’re doing detailed performance analysis.”
Additionally, she notes that her probabilistic modeling and statistics background is turning out to be more connected to hands-on research than she previously expected, and she hopes to leverage that link in the future. With the summer over, she’ll finish her dissertation and consider how she wants to apply her skills and experience, possibly in a research career.
Tageldin’s experience in the Summer Student Program wasn’t just a benefit to her; according to NERSC staff, working with Tageldin and students like her is an infusion of energy and new ways of thinking about their work.
“The summer student program has been an amazing experience, providing me with fresh perspectives and new insights into the research being done at the Lab and its place in the broader scientific community,” said Gott, who served as Tageldin’s mentor this summer. “My summer interns have added a lot to my understanding of HPC today. I hope they've learned as much here as I have from them.”
About NERSC and Berkeley Lab
The National Energy Research Scientific Computing Center (NERSC) is a U.S. Department of Energy Office of Science User Facility that serves as the primary high-performance computing center for scientific research sponsored by the Office of Science. Located at Lawrence Berkeley National Laboratory, the NERSC Center serves more than 7,000 scientists at national laboratories and universities researching a wide range of problems in combustion, climate modeling, fusion energy, materials science, physics, chemistry, computational biology, and other disciplines. Berkeley Lab is a DOE national laboratory located in Berkeley, California. It conducts unclassified scientific research and is managed by the University of California for the U.S. Department of Energy. »Learn more about computing sciences at Berkeley Lab.