NERSCPowering Scientific Discovery for 50 Years

AI-based Approach Speeds Diagnosis of I/O Performance Bottlenecks in HPC

Science Highlight

March 18, 2024

By Kathy Kincade
Contact: cscomms@lbl.gov

Science Breakthrough

Researchers from Lawrence Berkeley National Laboratory (Berkeley Lab) have developed a novel AI-based method for diagnosing input/output (I/O) performance bottlenecks in high performance computing (HPC) that automatically identifies these bottlenecks at the job level and offers potential solutions. In modeling experiments run at the National Energy Research Scientific Computing Center (NERSC), this approach – dubbed “AIIO” (Artificial Intelligence for I/O) – demonstrated that real applications and even unseen applications (those not used in the training model) can use these diagnostic results to improve I/O performance on HPC systems.

A high-level overview of the AIIO approach to applying AI and its interpretation technologies to diagnose I/O performance bottlenecks for a job. (Credit: Bin Jong)


This work demonstrates how AI prediction-based performance functions, combined with new AI interpretation technologies, could be used to calculate the impact of various factors on I/O performance. It also lays the foundation for using AI to automatically identify and address I/O performance issues across multiple scientific applications and their communication, memory access, and computing capabilities.

Science Background

Efficient I/O management is critical for minimizing data transfer times and optimizing overall system performance for large-scale scientific simulations and data-intensive applications. But the complex software and hardware parallel I/O stack of HPC platforms creates a challenge for end users to achieve optimal I/O performance and understand the root causes of I/O bottlenecks they encounter along the way. Thus, it is important for users to be able to quickly identify the causes of I/O performance bottlenecks in HPC applications because this information can significantly reduce I/O costs and shorten runtimes.

Manually diagnosing I/O bottlenecks has long been the norm, but this approach is tedious and error-prone and requires domain scientists to have deep knowledge of complex HPC storage systems. While some automated diagnostic methods do exist, they too have limitations; in particular, the analysis is confined to the platform or group level rather than the job (application) level, so the diagnostic results cannot be applied to an individual job.

These challenges prompted data management researchers in Berkeley Lab's Scientific Data Division to spend the last decade-plus investigating a variety of approaches to better understand I/O performance bottlenecks and address these bottlenecks automatically. Initially, the team tried different methods – including classical statistical methods, analytical models, data mining approaches, and a relatively new visualization tool (Drishti) – to obtain multiple I/O performance logs and use them to identify the root causes of poor performance. But ultimately they realized that AI tools might help identify the parameters that most affect I/O performance, and that using these technologies would enable them to focus on analyzing a single application’s I/O logs rather than multiple application logs. This approach is at the core of AIIO.

Science Breakdown

The Berkeley Lab team is not the first to apply AI techniques to I/O performance analysis, but – to the best of their knowledge – AIIO is the first to use AI and its interpretation technologies to automatically diagnose I/O performance bottlenecks at the job level.

Through their research, they identified key factors that affect performance and diagnostic issues in this process, leading to the incorporation of both an AI prediction-based performance function and an AI interpretation-based diagnosis function in the AIIO software:

To reduce the performance function for a single job, AIIO uses multiple AI models depending on the job and domain; these currently include MLP (a neural network), XGBoost (a gradient boosting method used to build machine learning models), LightGBM (a machine learning model that uses a gradient boosting decision tree), CatBoost (which also uses gradient boosting), and TabNet (a deep neural network).

AIIO’s AI multiple interpretation-based diagnosis functions include SHapley Additive exPlanations (SHAP), a game-theory-based diagnostic tool that unifies other diagnostic methods such as LIME, PDP, and DeepLIFT.

The team evaluated AIIO using synthetic and real application workloads from diverse domains and 40 months of logs from the Darshan I/O log database on NERSC’s Cori system. Additionally, they used Jupyter and Spin on Cori to train their model and provide AIIO service, respectively. Finally, they also tested it on six different I/O patterns on three currently used DOE applications: E2E, OpenPMD, and DASSA. Using AIIO, the I/O performance bottleneck diagnosis improvements on these applications ranged from 1.8x on E2E, 2.1x on OpenPMD, and 146x on DASSA.

The researchers are now investigating how AIIO could enable runtime systems – not just humans – to identify what is going on in an application’s I/O performance environment at any given time.

Research Lead

Bin Dong, Scientific Data Division, Berkeley Lab

Co-authors

Jean Luca Bez, Scientific Data Division, Berkeley Lab

Suren Byna, The Ohio State University; Scientific Data Division, Berkeley Lab

Publications

AIIO: Using Artificial Intelligence for Job-Level andAutomatic I/O Performance Bottleneck Diagnosis. HPDC '23: Proceedings of the 32nd International Symposium on High-Performance Parallel and Distributed Computing, August 2023, 155–167.

Funding

Exascale Computing Project (ExaIO sub-project), ASCR

User Facilities

NERSC


About NERSC and Berkeley Lab
The National Energy Research Scientific Computing Center (NERSC) is a U.S. Department of Energy Office of Science User Facility that serves as the primary high performance computing center for scientific research sponsored by the Office of Science. Located at Lawrence Berkeley National Laboratory, NERSC serves almost 10,000 scientists at national laboratories and universities researching a wide range of problems in climate, fusion energy, materials science, physics, chemistry, computational biology, and other disciplines. Berkeley Lab is a DOE national laboratory located in Berkeley, California. It conducts unclassified scientific research and is managed by the University of California for the U.S. Department of Energy. »Learn more about computing sciences at Berkeley Lab.