NERSCPowering Scientific Discovery Since 1974

Best Practices

The general guiding principle of working on PDSF is that the bulk of the workload should be shifted to the compute nodes.

When a user first logs into PDSF, they end up on one of four computers: pdsf1, pdsf2, pdsf3, or pdsf4. These four computers are called the interactive nodes. If a job is run that requires a lot of computational resources this can slow down the node for everyone. Whenever possbile, jobs should be run on the compute nodes.

Run your job on the compute nodes if:

  • It will consume more than 10% of the CPU
  • It will take more than 1 hour to finish

If you want to know how much CPU your job is taking, you need to log into the same interactive node in new window. You can do this by typing ssh pdsf[1-4] in a new terminal. Once you are logged in you can see how much CPU your job is taking by typing "top". If your job is taking too much CPU, please kill it and restart it on the compute nodes.

Running on the Compute Nodes

Jobs are run on the compute nodes using the batch system. Most users write scripts and submit them to the batch system. A submitted job waits in the queue until a slot opens up on one of the compute nodes, then the job is run on that compute node. You can find instructions on how to submit jobs to the batch system here. However, there are some cases where you might want to run a job that requires interaction (maybe you're doing exploratory plotting in ROOT or something similar). It's generally better to run these sorts of things on the compute nodes as well, especially if they will be 'heavy' (see the above section). You can run on the compute nodes by submitting an 'interactive' job to the queue (instructions are here). This will give you a terminal that will act just like the interactive nodes, except it will be on a compute node. You can run jobs that require a lot of CPU without slowing things down for everyone else. You could even run jobs longer than 1 hour, but for longer jobs you may want to think about writing a script and submitting it to the batch queue.

Data Transfers

Any time you are working with data from the eliza disks, it is better to use one of pdsf's data transfer nodes (their names are pdsfdtn1 and pdsfdtn2). This includes any time you are doing a cp, mv, or rm with data on the eliza disks. This also includes doing a remote copy (via scp) from another system. The data transfer nodes have much faster connections, so your data will get where you want it much faster. If you're running one of these on an interactive node, it will slow the node down for everyone else.

Deleting Your Jobs

By accident, one of your jobs is taking up too much memory and you need to kill it. You can find your job id by typing

ps -efl | grep <your_user_name>

This command should be run on the same interactive node where the job is running. You'll see a line that looks like this:

0 S usgweb   <pid> 31189  0  80   0 -  6247 poll_s 18:08 pts/15   00:00:00 python

Where <pid> is a number that represents the process ID of the job. You can kill this job by typing:

kill -9 <pid>