Agentic AI for HPC Operations and Scientific AI Infrastructure

Abstract tech AI

Science/CS domains

AI systems, agentic AI, HPC operations, software engineering, data systems

Project description

High performance computing environments are increasingly complex, and scientific AI workloads demand faster, more reliable operations support. 

This internship project will design and prototype agentic AI capabilities that improve HPC operational efficiency at NERSC, including natural-language interfaces for operational data, intelligent assistance for troubleshooting and incident triage, and automated synthesis of system-health and performance insights.

This work will support the HPC ecosystem that underpins the DOE Genesis Mission AI-for-Science workloads.

Project tasks

The intern will integrate AI agents with relevant operational tools and data sources (e.g., scheduler data, incident/workflow systems, and internal knowledge channels), then evaluate the agents’ usefulness, correctness, and performance in realistic workflows.

Desired skills/background

Required

  • Python
  • Strong software engineering fundamentals
  • Familiarity with APIs and data integration
  • Interest in AI agents

Nice to have

  • LLM systems
  • Retrieval/tool-using agents
  • Observability/monitoring stacks
  • HPC systems experience

Project mentors

Steven Farrell

Group Lead (Acting)

National Energy Research Scientific Computing Center (NERSC)

Science Engagement & Workflows Dept.

Data & AI Services Group

Meet Steven

Mah Kadidia Konate

Data Scientist

National Energy Research Scientific Computing Center (NERSC)

Science Engagement & Workflows Dept.

Data & AI Services Group

Meet Mah Kadidia