NERSCPowering Scientific Discovery Since 1974

TOKIO: Total Knowledge of I/O

The Total Knowledge of I/O (TOKIO) project is developing algorithms and a software framework to analyze I/O performance and workload data from production HPC resources at multiple system levels. This holistic I/O characterization framework provides a clearer view of system behavior and the causes of deleterious behavior to application scientists, facility operators and computer science researchers in the field. TOKIO is a collaboration between the Lawrence Berkeley and Argonne National Laboratories and is funded by the DOE Office of Science through the Office of Advanced Scientific Computing Research, and its reference implementation is open for contributions and download on GitHub.

TOKIO Architecture

The framework combines a multitude of component-level I/O characterization utilities to continuously monitor I/O at various levels including application profiling with Darshan and back-end storage server monitoring using file system-specific tools.

TOKIO scalable collection framework

TOKIO's component-level monitoring and scalable collection framework

 

Data from these component-level monitoring tools is retained on disk in its native format, and TOKIO normalizes and indexes the data across the different component-level monitoring outputs to minimize the need for expert understanding of how each tool expresses its view of the I/O subsystem components. The complete TOKIO architecture is described in a paper presented at the 2018 Cray User Group meeting.

TOKIO also provides a simple, portable API that allows sophisticated visualization and analysis to be developed on top of these component-level monitoring tools. For example, TOKIO includes the tools necessary to create Unified Monitoring and Metrics Interfaces (UMAMI) which provide a simple visualization of how different components of the I/O subsystem were behaving on a day of interest.

UMAMI of HACC on Edison scratch3

Unified Monitoring and Metrics Interface (UMAMI) of an anomalously performing HACC job on Edison's scratch3 file system

 

Similar analyses can be quickly built upon the Python implementation of the TOKIO framework, pytokio, available on GitHub. To give pytokio a try, you can download all of the code and data necessary to reproduce a paper presented at SC'18, A Year in the Life of a Parallel File System.

Participants

Publications

Presentations

Related Work