TOKIO: Total Knowledge of I/O
The Total Knowledge of I/O (TOKIO) project is developing algorithms and a software framework to analyze I/O performance and workload data from production HPC resources at multiple system levels. This holistic I/O characterization framework provides a clearer view of system behavior and the causes of deleterious behavior to application scientists, facility operators and computer science researchers in the field. TOKIO is a collaboration between the Lawrence Berkeley and Argonne National Laboratories and is funded by the DOE Office of Science through the Office of Advanced Scientific Computing Research, and its reference implementation is open for contributions and download on GitHub.
The framework combines a multitude of component-level I/O characterization utilities to continuously monitor I/O at various levels including application profiling with Darshan and back-end storage server monitoring using file system-specific tools.
Data from these component-level monitoring tools is retained on disk in its native format, and TOKIO normalizes and indexes the data across the different component-level monitoring outputs to minimize the need for expert understanding of how each tool expresses its view of the I/O subsystem components.
TOKIO provides a simple API into these indexed views which are then used by analysis modules to present the correlated data in a meaningful way through standard query interfaces for users and applications. For example, TOKIO includes the tools necessary to create Unified Monitoring and Metrics Interfaces (UMAMI) which provide a simple visualization of how different components of the I/O subsystem were behaving on a day of interest.
Similar analyses can be quickly built upon the Python implementation of the TOKIO framework, pytokio, available on GitHub.
- Nicholas J. Wright (LBNL) - Lead Principal Investigator
- Philip Carns (ANL) - Institutional Principal Investigator
- Suren Byna (LBNL) - Co-investigator
- Rob Ross (ANL) - External collaborator
- Prabhat (LBNL) - External collaborator
- Glenn K. Lockwood (LBNL)
- Shane Snyder (ANL)
- Teng Wang (LBNL)
Publications and Presentations
- Philip Carns, Julian Kunkel, Glenn K. Lockwood, Ross Miller, Eugen Betke, Wolfgang Frings. "Analyzing Parallel I/O." Birds of a Feather session, International Conference for High Performance Computing, Networking, Storage and Analysis (SC17), Denver, USA. November 2017.
- Glenn K. Lockwood, Wucherl Yoo, Suren Byna, Nicholas J. Wright, Shane Snyder, Kevin Harms, Zachary Nault, Philip Carns. "UMAMI: a recipe for generating meaningful metrics through holistic I/O performance analysis." In Proceedings of the 2nd Joint International Workshop on Parallel Data Storage & Data Intensive Scalable Computing Systems (PDSW-DISCS'17), 2017. (Slides)
- Philip Carns. "Characterizing data-intensive scientific applications with Darshan." CS/NERSC Data Seminar, National Energy Research Scientific Computing Center. June 2017.
- Philip Carns. "Characterizing HPC I/O: from Applications to Systems." ZIH Colloquium at Technische Universität Dresden, Dresden. April 2017.
- Philip Carns. "TOKIO: Using Lightweight Holistic Characterization to Understand, Model, and Improve HPC I/O Performance." SIAM Conference on Computational Science and Engineering, Atlanta GA. March 2017.
- Shane Snyder. "Leveraging Holistic Characterization for Insights into HPC I/O Behavior." 2017 Understanding I/O Performance Behavior (UIOP) Workshop, DKRZ, Hamburg. March 2017.
- Shane Snyder, Philip Carns, Kevin Harms, Robert Ross, Glenn K. Lockwood, Nicholas J. Wright. "Modular HPC I/O Characterization with Darshan." In Proceedings of 5th Workshop on Extreme-scale Programming Tools (ESPT 2016), 2016.
Cong Xu, Suren Byna, Vishwanath Venkatesan, Robert Sisneros, Omkar Kulkarni, Mohamad Chaarawi, and Kalyana Chadalavada, "LIOProf: Exposing Lustre File System Behavior for I/O Middleware." 2016 Cray User Group, London. May 2016.
- Glenn K. Lockwood, Nicholas J. Wright. "Understanding I/O performance on burst buffers through holistic I/O characterization." MCS Seminar, Argonne National Laboratory. May 2016.
- Glenn K. Lockwood. "Developing a holistic understanding of I/O workloads on future architectures." 2016 SIAM Conference on Parallel Processing for Scientific Computing, Paris. April 2016.
- Julian Kunkel, Philip Carns, Shane Snyder, Huong Luu, Matthieu Dorier, Wolfgang Frings, and Glenn K. Lockwood. "Analyzing Parallel I/O." Birds of a Feather session, International Conference for High Performance Computing, Networking, Storage and Analysis (SC15), Austin. November 2015.
- Wahid Bhimji, Debbie Bard, Melissa Romanus, et al. "Accelerating science with the NERSC burst buffer early user program." 2016 Cray User Group, London. May 2016.