Monitoring and Managing Jobs
| Action | How to do it | Comment |
|---|---|---|
| Get a summary of all batch jobs | sgeusers | Shows a tally of all jobs of all users including their states. This is a script that parses the output of qstat and is maintained by PDSF staff (located in /common/usg/bin). Do "sgeusers -h" for usage info. |
| Get a listing of your jobs and their states |
qstat -u user_name | If you skip the -u option, you'll get all the jobs on PDSF. |
| qstat_long -u user_name |
Regular qstat truncates job names to 10 characters. If you need a full name - use qstat_long. | |
| Get detailed info about a specific job |
qstat -j job_ID | You can get job_ID by listing your jobs as described above. |
| See how much cputime a job has used | qstat -j job_ID | Look in the next to the last line or grep the output on "usage". Note that in the memory usage GBs stands for Gigabyte-seconds. |
| Kill a specific job | qdel job_ID | If qdel doesn't work try qdel -f job_ID |
| Kill all your jobs | qdel -u user_name | |
| Select a job to run first | qalter -js NN job_ID NN is some positive number |
In SGE you control the relative priority of your jobs by adjusting their job shares. A larger job share results in a higher priority. |
| Use multiple job slots for your job | qalter -pe single NN job_ID NN is some positive number |
Set NN to the number of job slots your job need to prevent overloading the node. For example, if you are are running a multithreaded job set NN to the number of threads. |
| Clear jobs in Eqw state | qmod -cj job_ID | The Eqw state means the job started but there was some error. Check the error with "qstat -j job_ID". It will be listed near the end of the output. Fix it if necessary before clearing the job or it will just go back into the Eqw state again. |


