NERSCPowering Scientific Discovery Since 1974

Monitoring and Managing Jobs

Action How to do it
Comment
Get a summary of all batch jobs sgeusers Shows a tally of all jobs of all users including their states. This is a script that parses the output of qstat and is maintained by PDSF staff (located in /common/usg/bin). Do "sgeusers -h" for usage info.
Get a listing of your jobs and their states
qstat -u user_name If you skip the -u option, you'll get all the jobs on PDSF.
qstat_long -u user_name
Regular qstat truncates job names to 10 characters. If you need a full name - use qstat_long.
Get detailed info about a specific job
qstat -j job_ID You can get job_ID by listing your jobs as described above.
See how much cputime a job has used qstat -j job_ID Look in the next to the last line or grep the output on "usage".   Note that in the memory usage GBs stands for Gigabyte-seconds.
Kill a specific job qdel job_ID If qdel doesn't work try qdel -f job_ID
Kill all your jobs qdel -u user_name
Select a job to run first qalter -js NN job_ID
NN is some positive number
In SGE you control the relative priority of your jobs by adjusting their job shares. A larger job share results in a higher priority.
Use multiple job slots for your job qalter -pe single NN job_ID
NN is some positive number
Set NN to the number of job slots your job need to prevent overloading the node. For example, if you are are running a multithreaded job set NN to the number of threads.
Clear jobs in Eqw state qmod -cj job_ID The Eqw state means the job started but there was some error.  Check the error with "qstat -j job_ID".  It will be listed near the end of the output.  Fix it if necessary before clearing the job or it will just go back into the Eqw state again.