Monitoring and Managing Jobs
Commonly Used Commands
|Action||How to do it||Comment|
|Get a summary of all batch jobs||sgeusers||Shows a tally of all jobs for all users including their states. This is a script that parses the output of qstat and is maintained by PDSF staff (located in /common/usg/bin). Do "sgeusers -h" for usage info.|
|Get a listing of your jobs and their states||qstat -u user_name||If you skip the -u option, you'll get all the jobs on PDSF.|
|Get detailed info about a specific job||qstat -j job_ID||You can get job_ID by listing your jobs as described above.|
|See how much cputime a job has used||qstat -j job_ID||Look in the next to the last line or grep the output on "usage". Note that in the memory usage GBs stands for Gigabyte-seconds.|
|Kill a specific job||qdel job_ID||If qdel doesn't work try qdel -f job_ID|
|Kill all your jobs||qdel -u user_name|
|Select a job to run first||qalter -js NN job_ID
NN is some positive number
|You can control the relative priority of your jobs by adjusting their job shares. A larger job share results in a higher priority.|
|Clear jobs in Eqw state||qmod -cj job_ID||The Eqw state means the job started but there was some error. Check the error with "qstat -j job_ID". It will be listed near the end of the output. You must fix whatever caused the error before clearing the job or it will just go back into the Eqw state again.|
The qacct command can be used to access the UGE accounting information about your completed jobs. This information is saved in a file every night so unless you use the -f option (see below) you will just get information about your jobs in the current accounting period.
|Action||How to do it||Comment|
|Check on your finished jobs||qacct -o user_name -j||If you don't specify the -o option you'll get a summary of all the jobs ran by all the users during the last accounting period. If you don't specify the -j option you'll just get a summary report.|
|Check on older jobs||qacct -o user_name -j -f accounting_file||To check on older jobs you need to specify an accounting file corresponding to the day your job finished. On PDSF the accounting files are kept at $SGE_ROOT/default/common.|
You can also access information about your completed jobs by querying the PDSF Completed Jobs Database.
Where Did My Job Go?
If your job exceeds the requested memory, UGE will automatically and silently kill it. This can sometimes lead to jobs apparently vanishing from the queue. To help with diagnosing this, we have the "wheres_my_job" command, which will let you know if your job has been killed by the batch system for exceeding memory in the last week. You invoke it with
Information about jobs is filled by a cron that runs hourly, so if you don't see your job ID, check again in an hour. You can also query the extended records about your job with the qacct command
qacct -o <user_name> -j
If your job ran more than 1 day ago, you may also need to add a "-f <name_of_file>" where the file name usually looks something like $SGE_ROOT/default/common/accounting.6_2u2_1.2015.03.02.03_10_02.
Why Don't My Jobs Run?
If your jobs are just sitting in the "qw" state and not starting you cannot ask UGE directly why they aren't running (that service is turned off for scalability) so you have to do some detective work. Things to consider include:
- Is the cluster full? Sometimes you just need to be patient, especially if your project doesn't have any shares (jobs from projects without shares will only run if the cluster isn't full). Jobs on PDSF are subject to a 24 time limit, so at worst case your job will start tomorrow.
- Check your job's resource requirements. It might be that you are incorrectly specifying some resource or requesting something that is not available. For example you might be specifying "-l eliza18io=1" and eliza18 might be down or the IO units are all used up. You can use "sgeusers -i" to check how many IO resources are in use. See the IO Resources page for more details.
If you can't see a reason for your jobs to be sitting in the queue, feel free to ask us to look into it.