NERSCPowering Scientific Discovery Since 1974

Best Practices - and Practices to Avoid

Best Practices - Dos and Don'ts on the cluster

Please keep the following in mind when running/submitting jobs on Genepool:

Use qstat and qmod interactively.  Please DO NOT use these commands in scripts or tight loops.  Please DO NOT clear job errors from scripts.  To determine whether or not a job has finished, use job dependencies, empty files (flags), logs, signal traps, etc.  You can also set up your script so that you recieve an email when the job is running and when the job is completed.
Use qs instead of qstat.  qs is a cached version of the qstat data; the data is refreshed every two minutes.  qs makes the queued job data more accessible than qstat.
Request 12 hours or less.  The are a great many more nodes dedicated to shorter jobs than the long jobs.  It is worth taking a bit of time to consider how to get your production jobs to run faster than 12 hours if at all possible!
Checkpoint your pipelines!  If there are breaks in your workflow, write the intermediate output to a checkpoint file.  This will save you from having to restart the calculation from the beginning if the job is killed or dies before completion.  This also applies to pipeline testing: do not expect that the debug queue will run for a long time in order for the complete pipeline run to finish.
Use 'ulimit -c 0' option with qsub in your scripts to disable coredumps.  Otherwise, the fileserver may become overwhelmed when hundreds of coredump files are written to the same location.  We are currently trying to make this the default behavior.
Use array jobs instead of many individual jobs whenever possible.  This reduces load on the scheduler and reduces the number of scripts that you need to maintain.
Don't run short jobs on the cluster!  If the jobs require less than a minute to complete, consider combining them into longer jobs.  Scheduling such short jobs will cost more than the jobs themselves!  We recommend that jobs take at least 20 minutes to complete, but 3-4 hours per task would be a better target.
Do not overestimate the runtime needed for a job to complete by too much.  Shorter runtimes allow your job to be scheduled and run more quickly.  It might end up being faster to request the lower bound on the run time for most of the jobs and re-run any that don't complete.  This is also where checkpointing your pipelines can be helpful - if you run out of time and the job has to be re-run, it is far more efficient to not start the job over from the beginning.
Request soft limits that are a bit smaller than the hard limits for consumable options and trap signals to know why a job was killed.
Use the '-w e' option in qsub.  This will prevent unschedulable jobs from entering the queue.  This may be the default option in the future.