Running Jobs on Cori FAQ
Below are some frequently asked questions and answers for running jobs on Cori.
Q: What are some possible causes for the "Job submit/allocate failed: Unspecified" error message at job submission time?
A: This error could happen if a user has no active repo on Cori. Please make sure your NERSC account is renewed with an active allocation.
This error message could also occur if you are on Edison instead of Cori (sometimes users are confused), and submit a bach script with -C haswell or -C knl,... that are not supported node features on Edison.
Q: What are some possible causes for the "Job submit/allocate failed: Invalid qos specification" error message at job submission time?
A: This error mostly happen if a user has no access to certain Slurm qos. For example, when only NESAP users had access to a qos allowing job submission to the regular partition for over 2 hrs of run time on KNL nodes, non NESAP users would see this error when submitting a job longer than 2 hours.
Q: How do I check for how many free nodes are available in each partition?
A: Below is a sample Slurm command "sinfo" with the selected output fields. Column 1 shows the partition name, Column 2 shows the status of this partition, Column 3 shows the max wall time limit for this partition, and Column 4 shows the number of nodes Available/Idle/Other/Total in this partition.
cori11% sinfo -o "%.10P %.10a %.15l %.20F"
PARTITION AVAIL TIMELIMIT NODES(A/I/O/T)
debug* up 30:00 1864/70/2/1936
regular up 4-00:00:00 1775/31/2/1808
regularx up 2-00:00:00 1864/70/2/1936
special up 14-00:00:00 1885/115/4/2004
realtime up 12:00:00 1885/115/4/2004
shared up 2-00:00:00 17/41/2/60
knl up 1-12:00:00 9646/17/25/9688
knl_regula up 1-12:00:00 9646/17/25/9688
knl_reboot up 1-12:00:00 3385/0/15/3400
Q: How do I check for how many KNL nodes are idle now in each cluster mode?
A: Below is a sample Slurm command "sinfo" with the selected output fields. Column 1 shows the available computer node features (Haswell or various KNL cluster modes), and Column 2 shows the number of nodes Available/Idle/Other/Total in this partition.
cori11% sinfo -o "%.20b %.20F"
Q: What does the “CF” state mean in the "squeue" or "sqs" output? Why does my job seem to not be doing anything while in this state?
A: "CF" stands for "CONFIGURING", is one of the job states that return from the "squeue" or "sqs" command, meaning the job has been allocated resources, but are waiting for them to become ready for use (e.g. booting).
In KNL, when you request a specific cluster mode, such as -C knl,quad,flat, when there are not enough KNL nodes in such mode in the pool of nodes that your job is allocated, it will reboot some nodes in other cluster modes. While rebooting, the job is in the "CF" state. It can take about 25 min or longer, and from your job's point of view, it is doing nothing since the job has not started yet.
Note that time spent rebooting is exempted from the job's walltime limit, but the reboot time is charged to your account. For example, say you requested:
#SBATCH -N 4
#SBATCH -t 60
#SBATCH -C knl,quad,flat
srun -n 4 ./myexec
The nodes need rebooting, so you see something like:
JOBID ST REASON USER NAME NODES USED REQUESTED SUBMIT PARTITION RANK_P RANK_BF
9094931 CF None elvis myjob 4 12:19 1:00:00 2017-07-28T10:00:00 knl N/A
It appears that your job has used more than 12 minutes, but has not started yet. Say the node finishes rebooting after 31:00 minutes, then Slurm will allow the job to run until 1 hour and 31 minutes before killing it with a TIMEOUT (so you get the requested 60 minutes of walltime from hen the nodes are ready). However, the charge for the job will be for the entire reboot+run time, ie 1:31:00
Q: Could I run jobs using both Haswell and KNL compute nodes?
A: Currently only available for certain NERSC and Cray staff for benchmarking. We will evaluate whether (and if yes, how) to make it available for general NERSC users.
Q: Is it possible to choose to wait longer in the queue in order to avoid KNL cluster mode reboot?
A: We have just upgraded the Slurm version from 17.02 to 17.11 on Jan 9, 2018. There is a feature in 17.05 that we have not configured yet, which will allow users to specify a maximum wait time for a job to wait for nodes in the requested mode instead of rebooting nodes immediately. More details will be provided once such feature is available.
Q: How do I improve my I/O performance?
A: Consider using the Burst Buffer - this is a tier of SSDs that sits inside the Cori HSN, giving high-performance I/O. See this page for more details. If you are using the $SCRATCH file system, take a look at this page for I/O optimisation tips.