Monitoring and Managing Jobs
Monitoring and Managing Batch Jobs
These are some basic commands for monitoring and modifiying batch jobs while they're queued or running.
NERSC has developed a new tool for monitoring and viewing the state of batch jobs for genepool called qs. Please read about Monitoring jobs with qs
| Action | How to do it | Comment |
|---|---|---|
| Get a listing of your jobs and their states | qs -u | If you skip the -u option, you'll get all the jobs on Genepool/Phoebe. |
| qstat -u user_name | If you skip the -u option, you'll only see jobs for your username. | |
| qstat_long -u user_name | Regular qstat truncates job names to 10 characters. If you need a full name - use qstat_long. | |
| Get detailed info about a specific job | qstat -j job_ID | You can get job_ID by listing your jobs as described above. |
| See how much cputime a job has used | qstat -j job_ID | Look in the next to the last line or grep the output on "usage". Note that in the memory usage GBs stands for Gigabyte-seconds. |
| Kill a specific job | qdel job_ID | If qdel doesn't work try qdel -f job_ID |
| Kill all your jobs | qdel -u user_name | |
| Select a job to run first | qalter -js NN job_ID NN is some positive number |
In UGE you control the relative priority of your jobs by adjusting their job shares. A larger job share results in a higher priority. |
| Use multiple job slots for your job | qalter -pe pe_slots NN job_ID NN is some positive number |
Set NN to the number of job slots your job needs to prevent overloading the node. For example, if you are are running a multithreaded job set NN to the number of threads. |
| Clear jobs in Eqw state | qmod -cj job_ID | The Eqw state means the job started but there was some error. Check the error with "qstat -j job_ID". It will be listed near the end of the output. Fix it if necessary before clearing the job or it will just go back into the Eqw state again. This can only be done from the genepool login nodes (not the gpints). |


