NERSCPowering Scientific Discovery Since 1974

Monitoring and Managing Jobs

Commonly Used Commands

ActionHow to do itComment
Get a summary of all batch jobs sgeusers Shows a tally of all jobs for all users including their states. This is a script that parses the output of qstat and is maintained by PDSF staff (located in /common/usg/bin). Do "sgeusers -h" for usage info.
Get a listing of your jobs and their states qstat -u user_name If you skip the -u option, you'll get all the jobs on PDSF.
Get detailed info about a specific job qstat -j job_ID You can get job_ID by listing your jobs as described above.
See how much cputime a job has used qstat -j job_ID Look in the next to the last line or grep the output on "usage". Note that in the memory usage GBs stands for Gigabyte-seconds.
Kill a specific job qdel job_ID If qdel doesn't work try qdel -f job_ID
Kill all your jobs qdel -u user_name  
Select a job to run first qalter -js NN job_ID
NN is some positive number
You can control the relative priority of your jobs by adjusting their job shares. A larger job share results in a higher priority.
Clear jobs in Eqw state qmod -cj job_ID The Eqw state means the job started but there was some error. Check the error with "qstat -j job_ID".  It will be listed near the end of the output. You must fix whatever caused the error before clearing the job or it will just go back into the Eqw state again.

The qacct command can be used to access the UGE accounting information about your completed jobs. This information is saved in a file every night so unless you use the -f option (see below) you will just get information about your jobs in the current accounting period.

ActionHow to do itComment
Check on your finished jobs qacct -o user_name -j If you don't specify the -o option you'll get a summary of all the jobs ran by all the users during the last accounting period. If you don't specify the -j option you'll just get a summary report.
Check on older jobs qacct -o user_name -j -f accounting_file To check on older jobs you need to specify an accounting file corresponding to the day your job finished. On PDSF the accounting files are kept at $SGE_ROOT/default/common.

 You can also access information about your completed jobs by querying the PDSF Completed Jobs Database.

Where Did My Job Go?

If your job exceeds the requested memory, UGE will automatically and silently kill it. This can sometimes lead to jobs apparently vanishing from the queue. To help with diagnosing this, we have the "wheres_my_job" command, which will let you know if your job has been killed by the batch system for exceeding memory in the last week. You invoke it with

wheres_my_job <user_name>

Information about jobs is filled by a cron that runs hourly, so if you don't see your job ID, check again in an hour. You can also query the extended records about your job with the qacct command

qacct -o <user_name> -j

If your job ran more than 1 day ago, you may also need to add a "-f <name_of_file>" where the file name usually looks something like $SGE_ROOT/default/common/accounting.6_2u2_1.2015.03.02.03_10_02.

Why Don't My Jobs Run?

If your jobs are just sitting in the "qw" state and not starting you cannot ask UGE directly why they aren't running (that service is turned off for scalability) so you have to do some detective work. Things to consider include:

- Is the cluster full? Sometimes you just need to be patient, especially if your project doesn't have any shares (jobs from projects without shares will only run if the cluster isn't full). Jobs on PDSF are subject to a 24 time limit, so at worst case your job will start tomorrow.

- Check your job's resource requirements. It might be that you are incorrectly specifying some resource or requesting something that is not available. For example you might be specifying "-l eliza18io=1" and eliza18 might be down or the IO units are all used up. You can use "sgeusers -i" to check how many IO resources are in use. See the IO Resources page for more details.

If you can't see a reason for your jobs to be sitting in the queue, feel free to ask us to look into it.