NERSCPowering Scientific Discovery Since 1974

Using SLURM PDSF Batch - sbatch

PDSF-specific SLURM commands in a nut-shell

To execute in SLURM  the equivalent of  'qsub jobscript.sh'   do

$ ssh -X pdsf.nersc.gov
$ sbatch -p shared-chos -t 24:00:00 jobscript.sh
>>>Submitted batch job 102992

Note 1: your default chos will be used to run jobscript.sh
Note 2: do not run sbatch and qsub on the same x-terminal because SLURM module corrupts qsub

Start  interactive session on a SLURM worker node with

salloc -p shared-chos -t 1:00:00
>>> salloc: Granted job allocation 93574

Check if you can run jobs in PDSF SLURM

sacctmgr show assoc where user=$USER
>>> Cluster Account User Share
>>> pdsf1 lz balewski 10

 List all yours queued and running jobs w/ sqs , no arguments. sqs can also list jobs for other users, see 'sqs --help'. e.g.

pdsf8 $ sqs -u dybspade
JOBID ST REASON USER NAME NODES USED REQUESTED SUBMIT PARTITION RANK_P RANK_BF
20105 R None dybspade rmq_pdsf_kup 1 19:51:39 24:00:00 2017-06-22T13:39:51 shared N/A N/A
20106 R None dybspade rmq_pdsf_kup 1 19:50:02 24:00:00 2017-06-22T13:41:28 shared N/A N/A

To learn more info about one job you can use Rebecca's line:

$sacct --format=job,user,submit,start,end,exitcode,nnodes,alloccpus,timelimit,cputime,state%20,maxvmsize,qos,maxrs -j 21115

JobID User Submit Start End ExitCode NNodes AllocCPUS Timelimit CPUTime State MaxVMSize QOS
------------ --------- ------------------- ------------------- ------------------- -------- -------- ---------- ---------- ---------- ---------- ---------- ----------
21115 kkrizka 2017-06-23T13:23:53 2017-06-23T13:23:53 2017-06-23T13:32:54 0:0 1 1 00:25:00 00:09:01 COMPLETED normal
21115.batch 2017-06-23T13:23:53 2017-06-23T13:23:53 2017-06-23T13:32:54 0:0 1 1 00:09:01 COMPLETED 130940K

 Why my job is not starting?

pdsf8 $ scontrol show job 28547_300
JobId=28547 ArrayJobId=28547 ArrayTaskId=300 JobName=atlas-chos
Priority=1802 Nice=0 Account=atlas QOS=normal
JobState=PENDING Reason=Resources Dependency=(null)
Partition=shared-chos AllocNode:Sid=pdsf8:15532
NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=1,mem=3008,node=1
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=2 MinMemoryNode=3008M MinTmpDiskNode=0 

How can I get example code for experiment specific SLURM jobs ?

ssh pdsf
git clone https://bitbucket.org/balewski/tutorNersc
cd tutorNersc/2017-05-pdsf3.0
ls

Who else runs jobs on PDSF SLURM now?

pdsf6 $ slusers
Current SLURM usage summed over all PDSF users, ver 1.2

Rjob Rcpu Rcpu*h PDjob PDcpu user:account:partition
126 126 177.7 0 0 balewski star shared
2 2 0.1 5657 5657 luxprod lux shared-cho

Rjob Rcpu Rcpu*h PDjob PDcpu account:partition
2 2 0.1 5657 5657 lux shared-cho
126 126 177.7 0 0 star shared

128 128 177.8 5657 5657 TOTAL

How can I kill job array ?

$ sqs
JOBID ST REASON USER
1525642_[0-31] PD Priority rsooraj
$ scancel 1525642_*
$ sqs

more