Metascheduling for High-Throughput Applications and Workflow Tools
Science/CS domains
HPC schedulers, high-throughput computing, Slurm, HTCondor, bioinformatics, high-energy physics, nuclear physics, workflow tools
Project description
In large HPC environments, scheduling is often prioritized for large block allocations, allowing HPC centers to increase system utilization. This can often leave smaller work in the backfill queue, only running when extra capacity is not needed.
For projects with many smaller tasks, workflow tools can often be used to allocate large blocks of resources for the project, and then use those blocks to run many smaller tasks as part of the workflow. This can at times be accomplished by using a metascheduler, a scheduler controlled by the project to maintain high throughput and efficient resource allocation within the current block or nodes.
The problems come when we need to allocate these blocks of nodes:
- How large should our block of nodes be?
- How long should we keep our block of nodes?
- What do we do when we are done with the work?
The goal of the internship is to tackle these questions by addressing a real-world scientific problem using real bioinformatics workflows from the Joint Genome Institute’s JAWS workflow tool.
The project will focus on determining the fastest scheduling and most efficient way to start jobs on large supercomputers.
Desired skills/background
- Python
- REST APIs
- Slurm/HTCondor or other HPC scheduling systems
- Knowledge about AI agents or curious about agentic workflows
Apply to join this project
To apply or ask a question about this project:
Project mentors
Nicholas Tyler
Computer Systems Engineer 3
National Energy Research Scientific Computing Center (NERSC)
Science Engagement & Workflows Dept.
Data & AI Services Group
Dani Cassol
Bioinformatics Computing Consultant
DOE Joint Genome Institute (JGI)