Edison Queues and Policies
Users submit jobs to a QOS (Quality of Service) and wait in line until nodes become available to run a job. NERSC's queue structures are intended to be fair and to allow jobs of various sizes to run efficiently. Note that the intended use of each system differs. Edison's purpose is to run large jobs, so the queue policy significantly favors large jobs using more than 682 nodes. If your workload requires smaller jobs (using less than 682 nodes), we encourage you to run on Cori Phase I, which is intended for smaller and/or data intensive jobs.
The following is the current queue structure on Edison. Since Edison has just migrated to use Slurm as the workload manager, the queue configuration may need to be adjusted as we gain more insight about how Slurm works for NERSC workloads. Please send questions, feedback, or concerns about the queue structures to the consultants.
Note that Edison does not support the low QOS jobs any more starting from March 1, 2017. The large job discount was also reduced to 20% (from 40%) on the same day, and will be completely removed from Edison when on August 1, 2017.
Edison queue has been reconfigured (simplified) when the AY 2018 starts.
|QOS1)||Nodes||Physical Cores||Max Wallclock||Run Limit||Submit Limit||Relative Priority (lower is higher priority)||NERSC Hours Charged per Node Per Hour2)|
|shared4)||1||1-12||48 hrs||-||10,000||3||2 x (no. of cores used)|
|realtime5)||custom||custom||custom||custom||custom||1 (special permission)||--|
3) Up to 5,586 nodes may be available, depending on the state of the system. The closer the request is to this number, the higher the probability is that the job will take a (possibly very) long time to start.
The "realtime" Queue Request Form can be found here.The "realtime" QOS are permitted to the groups with the special approval only.
a zero or low individual MPP balance (e.g., if running the job would make the repo's MPP balance negative). Jobs in the scavenger QOS have the lowest priority and will wait in the queue longer. Note, scavenger jobs must be submitted with a minimum time (#SBATCH --time-min=2:00:00), so that they can run with a time limit anywhere between the minimum time and the specified time limit. See here for a sample job script for submitting scavenger QOS jobs. The scavenger jobs are intended for the jobs that can do checkpoint and restart.
Notes about queue policies
- The debug QOS is to be used for code development, testing, and debugging. Production runs are not permitted in the debug QOS. User accounts are subject to suspension if they are determined to be using the debug QOS for production computing. In particular, job "chaining" in the debug QOS is not allowed. Chaining is defined as using a batch script to submit another batch script.
- The intent of the premium QOS is to allow for faster turnaround before conferences and urgent project deadlines. It should be used with care, since it costs twice the normal QOS.
- The intent of the scavenger QOS is to allow users with a zero or negative balance in one of their repositories to continue to run on Edison. The scavenger QOS is not available for jobs submitted against a repository with a positive balance. The charging rate for this QOS is 0 and it has the lowest priority on all systems.
Tips for getting your job through the queue faster
- Submit shorter jobs. If your application has the capability to checkpoint and restart, consider submitting your job for shorter time periods. On a system as large as Edison there are many opportunities for backfilling jobs. Backfill is a technique the scheduler uses to keep the system busy. If there is a large job at the top of the queue the system will need to drain resources in order to schedule that job. During that time, short jobs can run. Jobs that request short walltimes are good candidates for backfill.
- Make sure the wall clock time you request is accurate. As noted above, shorter jobs are easier to schedule. Many users unnecessarily enter the largest wall clock time possible as a default.
- Run jobs before a system maintenance. A system must drain all jobs before a maintenance so there is an opportunity for good turn around for shorter jobs.
Reserving a Dedicated Time Slot for Running Jobs
You can request dedicated access to a pool of nodes up to the size of the entire machine time on Edison by filling out the
Please submit your request at least 72 hours in advance. Your account will be charged for all the nodes dedicated to your reservation for the entire duration of the reservation.