A Brief Primer on the SLURM Scheduler
October 17, 2016 by Rebecca Hartman-Baker
Edison is the only NERSC supercomputer currently available to users, so it is unusually oversubscribed. This has led to long wait times for many users, and a number of questions to NERSC consulting about the poor throughput many users are getting. Users often wonder whether the scheduler is working properly.
We can confirm that the scheduler is working well, and the machine is achieving nearly perfect utilization. But we thought that NERSC users would be interested in the inner workings of the SLURM scheduler!
How the Scheduler Works
Every five minutes or so, the scheduler performs two passes through the priority-ordered list of jobs. On the first pass, creates a schedule from the top-priority jobs reaching four days into the future. On the second pass, it performs backfill, where it fills gaps in the schedule created at the first pass with smaller jobs.
The scheduler looks at the priority-ordered list of jobs and tries to fit each one into the next 96 hours. Because NERSC has many thousands of jobs in the queue at any given time, in this phase the scheduler only examines the top several hundred jobs that are above a certain priority threshold.
The schedule created in the first pass inevitably has gaps in it. In the second phase, the scheduler looks through the list of all jobs, and determines if it can place any of the jobs into these gaps, without reordering or changing the schedule from the first pass.
In the first phase, the scheduler does a pretty good job of packing the schedule. Generally any gaps are about an hour at most (sometimes two hours in particularly difficult scheduling situations). Interestingly, as long as you are requesting under 2000 nodes on Edison, it generally doesn't matter how many nodes you are requesting; it's really the walltime that determines whether your job will be able to be used as backfill.
Strategies for NERSC Users
There are two ways that a job can be run: either by gaining enough priority to run, or by backfilling. Backfill is generally quicker than scheduling, so if it's possible to reduce your walltime to an hour, we recommend this approach. Otherwise, a job will have to gain enough priority to be scheduled. In the current oversubscribed environment, even premium jobs can take many days to be scheduled.
We recognize how difficult it is right now for users to get throughput at NERSC, so we are working diligently to return Cori to service at the earliest date possible. In the meantime, we are grateful for your patience and understanding!