Many users must run jobs that read datasets that are too large to transfer to the /scratch directory. When large numbers of these jobs run simultaneously, it can have an adverse affect on the performance of the disks serving the data causing your jobs to run very inefficiently. We have configured a resource in the batch system that should help manage the amount of concurrent access. These are called IO resources, and they depend on the bandwidth available from the storage hardware.
All jobs should have appropriate IO resource flags set. PDSF is intelligently scanning running jobs and will signal sysadmins about offending jobs. A user can be disabled for batch if the sysadmin think their jobs might cause system overload. Disabled users will be notified by email and a ticket will be created.
Submit Jobs With IO Resource
Submit your jobs as follows for files on the various elizas:
qsub -l eliza<#>io=1 <job script>
The <#> should be replaced with a number corresponding to the volume that contains your data. For example, if your data resides on /eliza15, use eliza15io as your resource. All PDSF volumes have these resources as well as NGF (projectio, gscratchio) and HPSS (hpssio). We recommend setting your resource to 1, as shown in the example above, and if that turns out to be too low it can be raised.
If you use /project there is also an IO resource for that, used as follows:
qsub -l projectio=1 <job script>
And if you use both an eliza and /project you should specify both:
qsub -l projectio=1, eliza<#>io=1 <job script>
If your job uses global scratch (aka $GSCRATCH or /global/scratch2/sd/<username>) you must use the global scratch IO resource:
qsub -l gscratchio=1 <job script>
Global scratch is only mounted on the newer compute nodes, so using this flag ensures your job will be routed to the proper compute nodes. Without it, there's a good chance your jobs will fail.
projectio and projectlio flags
If your job only "touches" global project, i.e. your software is located there or your are directing stdout to project, you can use the projectlio flag (note the "l" for light).
If your job analyzes data located on project, you need to use the projectio flag.
Note that you HAVE TO use projectio or projectlio flags if your job uses Global Project in any way. Otherwise your job may crash.
Modify Existing Jobs To Add IO Resource Flags
For example you have a job with number 4057136
qstat -j 4057136
Then look for a line hard resource_list:
Assume it says:
hard resource_list: h_vmem=1100M,h_stack=10240K
Do the following:
qalter -l h_vmem=1100M,h_stack=10240K,eliza8io=1 4057136
(The trick is to pick up the old list and add eliza8io).
If this seems too complicated - you can delete your old jobs and resubmit. You will not lose your place in the queue. It does not matter how long the job was waiting.
Observe Available IO Resources
The following command can be used to see how many io units are assigned to your resource and how many are being used at any given time: