NERSCPowering Scientific Discovery Since 1974

Using SLURM PDSF Batch - sbatch

PDSF-specific SLURM commands in a nut-shell

Update: this web-page will not be maintained any more. NERSC documentation has moved to mkdoc and the latest PDSF instruction is now HERE

To execute in SLURM  the equivalent of  'qsub jobscript.sh'   do

$ ssh -X pdsf.nersc.gov
$ sbatch -p shared-chos -t 24:00:00 jobscript.sh
>>>Submitted batch job 102992

Note 1: your default chos will be used to run jobscript.sh
Note 2: do not run sbatch and qsub on the same x-terminal because SLURM module corrupts qsub

Start  interactive session on a SLURM worker node with

salloc -p shared-chos -t 1:00:00
>>> salloc: Granted job allocation 93574

Check if you can run jobs in PDSF SLURM

sacctmgr show assoc where user=$USER
>>> Cluster Account User Share
>>> pdsf1 lz balewski 10

 List all yours queued and running jobs w/ sqs , no arguments. sqs can also list jobs for other users, see 'sqs --help'. e.g.

pdsf8 $ sqs -u dybspade
JOBID ST REASON USER NAME NODES USED REQUESTED SUBMIT PARTITION RANK_P RANK_BF
20105 R None dybspade rmq_pdsf_kup 1 19:51:39 24:00:00 2017-06-22T13:39:51 shared N/A N/A
20106 R None dybspade rmq_pdsf_kup 1 19:50:02 24:00:00 2017-06-22T13:41:28 shared N/A N/A

To learn more info about one job you can use Rebecca's line:

$sacct --format=job,user,submit,start,end,exitcode,nnodes,alloccpus,timelimit,cputime,state%20,maxvmsize,qos,maxrs -j 21115

JobID User Submit Start End ExitCode NNodes AllocCPUS Timelimit CPUTime State MaxVMSize QOS
------------ --------- ------------------- ------------------- ------------------- -------- -------- ---------- ---------- ---------- ---------- ---------- ----------
21115 kkrizka 2017-06-23T13:23:53 2017-06-23T13:23:53 2017-06-23T13:32:54 0:0 1 1 00:25:00 00:09:01 COMPLETED normal
21115.batch 2017-06-23T13:23:53 2017-06-23T13:23:53 2017-06-23T13:32:54 0:0 1 1 00:09:01 COMPLETED 130940K

 Why my job is not starting?

pdsf8 $ scontrol show job 28547_300
JobId=28547 ArrayJobId=28547 ArrayTaskId=300 JobName=atlas-chos
Priority=1802 Nice=0 Account=atlas QOS=normal
JobState=PENDING Reason=Resources Dependency=(null)
Partition=shared-chos AllocNode:Sid=pdsf8:15532
NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=1,mem=3008,node=1
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=2 MinMemoryNode=3008M MinTmpDiskNode=0 

How can I get example code for experiment specific SLURM jobs ?

ssh pdsf
git clone https://bitbucket.org/balewski/tutorNersc
cd tutorNersc/2017-05-pdsf3.0
ls

Who else runs jobs on PDSF SLURM now?

pdsf6 $ slusers
Current SLURM usage summed over all PDSF users, ver 1.2

Rjob Rcpu Rcpu*h PDjob PDcpu user:account:partition
126 126 177.7 0 0 balewski star shared
2 2 0.1 5657 5657 luxprod lux shared-cho

Rjob Rcpu Rcpu*h PDjob PDcpu account:partition
2 2 0.1 5657 5657 lux shared-cho
126 126 177.7 0 0 star shared

128 128 177.8 5657 5657 TOTAL

How can I kill job array ?

$ sqs
JOBID ST REASON USER
1525642_[0-31] PD Priority rsooraj
$ scancel 1525642_*
# OR
$ scancel 1525642
$ sqs

more