Queues and Scheduling Policies
Edison's queue configuration is subject to change throughtout the testing period. We will increase the wall clock limits for prodcution queues (regular and ccm_queue) and will add more queue classes as needed when the machine becomes more stable. Currently we have a simple queue structure; users should submit jobs to the regular, debug, and ccm_queue (for the cluster compatibility mode jobs only). The machine is free of charge until further notice.
|Submit Queue||Execution Queue||Nodes||Physical|
|Max Wallclock||Relative Priority||Run Limit||Queued Limit||Queue Charge Factor|
Note: on Edison you can type qstat -Qf command for a more detailed view of the queue configuration.
Notes about queue policies
- Do NOT submit scripts directly to an execution queue. Always use the submit queue name.
- If you have reached the run limit in an execution queue, then the queued limit becomes zero.
- There is a limit of 500 submitted jobs per execution queue per user.
- The debug queue is to be used for code development, testing, and debugging. Production runs are not permitted in the debug queue. User accounts are subject to suspension if they are determined to be using the debug queue for production computing. In particular, job "chaining" in the debug queue is not allowed. Chaining is defined as using a batch script to submit another batch script.
- 64 nodes are reserved for debugging jobs from 5am - 6pm Pacific Time.
Fairshare on Edison
Fairshare is a mechanism that allows historical resource utilization information to be incorporated into job feasibility and priority decisions. On Edison we are deploying the fairshare feature from the batch scheduler (Moab) to calculate the job priorities in addition to the credentials based on aging and other queue configurations. Edison Phase 1 system is shared between the DARPA mission partners and NERSC. Currently, 25% of the machine hours are allowed to the DARPA mission partners, and the rest 75% are shared among all NERSC users. Within the NERSC share, we applied the fairshare targets to each DOE office based on the allocations.
DARPA: 25% NERSC: 75% ASCR: 4% BER: 13% BES: 22% FES: 13% HEP: 10% NP: 8%
Overhead (staff): 5%
Fairshare is new to our major computing platforms, and we are still experiementing the fairshare parameters now and will likely continue to do so through the pre-prodcution stage (the machine is free of charge). You may experience that your jobs start with lower priorities in the queue if the members of your fairshare group used your share extensively (ran many jobs) at the time when you submit jobs. However, if you think there is a problem in the job priorites, please let us know (email consult at nersc dot gov).
Tips for getting your job through the queue faster
- Submit shorter jobs. If your application has the capability to checkpoint and restart, consider submitting your job for shorter time periods. On a system as large as Hopper there are many opportunities for backfilling jobs. Backfill is a technique the scheduler uses to keep the system busy. If there is a large job at the top of the queue the system will need to drain resources in order to schedule that job. During that time, short jobs can run. Jobs that qualify for reg_short are good candidates for backfill.
- Make sure the wall clock time you request is accurate. As noted above, shorter jobs are easier to schedule. Many users unnecessarily enter the largest wall clock time possible as a default.
- Run jobs before a system maintenance. A system must drain all jobs before a maintenance so there is an opportunity for good turn around for shorter jobs.
Compute nodes reservation
- You can request dedicated time on Edison for interactive debugging by filling out the Compute Reservation Request Form. Please submit your request at least 24 hours in advance.