JacquardVendor Manuals (PDF)PathScale User GuideACML User Guide PBS Pro User Guide Status & StatisticsUP Wed 10/31 14:54NERSC MOTD Announcements Jacquard Queue Status Completed Jobs Jacquard Job Stats |
Running Jobs on JacquardAll parallel jobs on Jacquard must be run through the batch system. The batch software is PBS Pro.
Note: Do not set group (or world) write permission for your home directory. If you do this, pbs cannot run your jobs.
Interactive jobsRunning parallel jobs from the command line is possible through starting an interactive batch job. The following command jacquard% qsub -I -q interactive -A reponame -l nodes=N:ppn=[1|2] N is the number of nodes desired and ppn should be specified as either 1 or 2 processors per node. The interactive queue limits listed below will apply. If the reponame is not specified, charges will be applied against the user's default repo. The preceeding command will start a new shell, from which you can lauch jobs with the mpirun or mpiexec command (see below). The directory the job was submitted from is defined in the environment variable $PBS_O_WORKDIR. You can also use mpirun directly from the command line:
jacquard% mpirun -np number_of_tasks executable In this case a wrapper script issues the appropriate qsub command on your behalf. The availability of nodes for interactive use may at times cause interactive jobs to stall or timeout. Batch jobsA batch script - a text file with PBS directives and job commands - is required to submit jobs. PBS directive lines, which tell the batch system how to run your job, begin with #PBS. A minimal script for Jacquard will be very similar to the following example: #PBS -l nodes=8:ppn=2,walltime=00:30:00 #PBS -N jobname #PBS -o job.out #PBS -e job.err #PBS -A repo #PBS -q debug #PBS -V cd $PBS_O_WORKDIR mpirun -np 16 ./a.out or replace the last two lines as follows to use mpiexec cd $PBS_O_WORKDIR mpiexec -n 16 ./a.out repo is to be replaced by the repository you want to charge the job against. Currently mpirun does not propagate environment variables to all the tasks in the parallel job. Using mpiexec to launch the job is one way to accomplish this. You can also mpirun a shell script which defines the variables you need. mpiexec supports many useful features that NERSC staff are working to see implemented in the standard mpirun. In order to accomodate the needs of as many users as possible for now we provide both launchers. See below for more information about these job launchers. Notes:
Jobs that read or write large files should be executed in the $SCRATCH file system. In the sample script above, the line cd $PBS_O_WORKDIR changes the current working directory to the directory from which the script was submitted. The easiest way to run a job using $SCRATCH is to submit the job from a $SCRATCH directory. You may also cd to your $SCRATCH directory in place of cd $PBS_O_WORKDIR.
All options may be specified as either (1) qsub command-line options or (2) as directives in the batch script as #PBS option. Parallel job launch commandsThe standard MPI job launch command for MPICH/MVPICH programs is called mpirun. mpirun uses SSH to execute non-interactive remote commands on the compute nodes, and therefore does not propagate environment variables into the parallel compute environment. jacquard% mpirun -np number_of_tasks executable An alternative job launch program installed by NERSC is mpiexec. This program has the advantage over mpirun that environment variables are propagated from the batch script environment into the parallel run environment. jacquard% mpiexec -n number_of_tasks executable The mpiexec launcher talks directly to PBS services as opposed to using remote shells. More information on mpiexec is available here. You are particularly encouraged to launch your job with mpiexec if you using a large numbers of nodes. See below for more information.
Account (repo) chargingJobs are charged against your default repository unless otherwise specified. (See Accounts and Charging on Jacquard for more information.) The NIM web interface is used to view and change your default repo. You can specify the repo to be charged in your PBS script. Use this keyword: #PBS -A repo_name or, use the -A reponame option to qsub. Interactive and debug jobs are charged at the regular priority rate. Batch queuesThere are four submit queues on Jacquard. The submit queues will route your job to the correct execution queue based on its requirements. The lower the relative priority of the queue the higher the actual priority of the jobs in the queue as far as the scheduler is concerned. Other things being equal a job in a queue with a relative priority of n will be scheduled ahead of one in a queue with a relative priority of n+1.
Notes:1 There is a maximum of 4 running jobs per user over the whole
system.
Running large (>129 node) jobsAll large jobs on Jacquard, particularly those using 129 nodes or more, are encouraged to use the mpiexec to launch their jobs instead of mpirun. The reason for this limit is that mpirun is actually a batch script, and beyond a certain node count jobs that use this script to launch may run into shell line length limitations. STDOUT, STDERR bufferingPBS stages standard output and standard error to temporary files that are not written into a user's disk space until the job has completed, You can redirect STDOUT and STDERR from the command line into a file that is visible to you during the run, but this scheme may not work in all situations. NERSC is investigating ways to make this redirection more reliable. STDIN, STDOUT redirectionIf your code requires that you must redirect stdin or stdout on the command line, you may wish to try putting the command line part of the mpirun command in quotes: jacquard% mpirun -np number_of_tasks "executable <inputfile >outputfile" Sample batch scriptThis is a sample Jacquard pbs batch script which runs a 64 node 128 processor job with a 5 hour wall clock limit. The executable is called hello and puts the standard output of the job into a file named hello.out and any standard error into a file hello.err. #PBS -l nodes=64:ppn=2,walltime=05:00:00 #PBS -N hello #PBS -o hello.out #PBS -e hello.err #PBS -q batch #PBS -A repo #PBS -V cd $PBS_O_WORKDIR mpirun -np 128 ./hello repo is the repository name against which to charge the job. The nodes keyword gives the number of nodes on which the job will run, and ppn the number of processors on each node which on Jacquard can only be 1 or 2. If a queue is not specified in the script, the job will run on the batch queue. The "./" before the name of the executable is required when invoking mpirun, even if "." is in $PATH. Submitting a jobTo submit a job for execution, type jacquard% qsub batchscript where batchscript is the name of the batch script. The output of the qsub command will include the jobid. Users should record this information, as it is very useful in debugging job failures. Deleting a jobTo delete a previously submitted job, type jacquard% qdel jobid where jobid is the job's identification, produced by the qsub command. Job monitoringJob progress can be monitored on the web, or on Jacquard with the PBS command qstat or the NERSC-provided command qs. The NERSC qs command gives queue status information tailored to Jacquard. Output from qs is a terminal formatted summary of running and queued jobs. jacquard% qs JOBID ST USER NAME NDS REQ USED SUBMIT 57839 R user5 calcoastNH 16 00:30:00 00:04:30 Aug 1 16:42:51 57676 R user1 fspack 64 05:00:00 03:30:52 Aug 1 10:23:32 57666 R user1 fspack 32 06:00:00 03:01:15 Aug 1 09:45:23 57677 R user1 fspack 32 05:00:00 01:49:11 Aug 1 10:24:58 57801 R user6 b 1 06:00:00 01:10:49 Aug 1 15:30:03 57803 R user6 c 1 06:00:00 01:10:48 Aug 1 15:31:05 57809 R user6 d 1 06:00:00 01:02:44 Aug 1 15:45:04 57814 R user6 e 1 06:00:00 00:54:48 Aug 1 15:52:50 57824 R user4 test77 4 06:00:00 00:40:43 Aug 1 16:08:05 57836 R user1 fspack 32 06:00:00 00:12:40 Aug 1 16:35:17 57591 H user7 tftr79100_ 64 06:00:00 - Jul 31 19:02:54 57675 H user3 prot_10 64 05:00:00 - Aug 1 10:21:50 57697 H user7 jt6032841_ 64 05:58:00 - Aug 1 13:24:39 57815 H user8 a_to_cen_ 64 03:00:00 - Aug 1 15:55:02 57584 H user7 jt6032844_ 64 06:00:00 - Jul 31 18:35:55 57838 H user1 fspack 64 06:00:00 - Aug 1 16:35:25 57680 H user1 fspack 32 06:00:00 - Aug 1 10:30:20 57817 H user8 x0_39_ 64 04:00:00 - Aug 1 15:56:14 57678 H user3 prot17_ 64 05:00:00 - Aug 1 10:26:25 57792 H user2 J128 64 06:00:00 - Aug 1 15:15:38 57818 H user8 hole_38_ 64 04:00:00 - Aug 1 15:56:57 57816 H user8 a0_43_ 64 04:00:00 - Aug 1 15:55:47 57698 H user7 tftr79128_ 64 05:58:00 - Aug 1 13:27:21 57679 H user1 fspack 32 06:00:00 - Aug 1 10:30:04 The qs script includes -u username and -w options that allow decreasing or increasing the amount of information reported. OutputYour standard output file will contain a system provided header before your actual job output and a sytem provided footer after your output. The header will look like this: PBS Leader node is jaccn150 Job setup time: Thu Mar 3 12:13:05 PST 2005 Setting up security Job startup at Thu Mar 3 12:13:05 PST 2005 ------------------------------------------------------------------- Warning: no access to tty (Bad file descriptor). Thus no job control in this shell. The last two lines do not indicate an error. The system provided footer will follow your job output, and it will look like this: ---------------------------------------------------------------- Job testbar1/21451.jacin03 completed Thu Mar 3 12:13:05 PST 2005 Submitted by mstewart/mstewart using null Job Limits: ncpus=2,neednodes=1:ppn=2,walltime=00:02:00 Job Resources used: cpupercent=0,cput=00:00:00,mem=1488kb,ncpus=2, vmem=6000kb,walltime=00:00:00 Nodes used: jaccn150 Killing any leftover processes... Job completed. Using script variablesThe suggested mpirun job launch script does not propagate variables from the script environment into the parallel run environment, except for the LD_LIBRARY_PATH variable. This means that codes that need variables defined only in the script will fail at runtime. This includes any module commands that appear only in the script. If your job needs the environment variables defined in the batch script environment, try using mpiexec ( see above) instead of mpirun. VAPI_RETRY_EXC_ERR errorYou may run into this error, particularly when running large node count jobs on Jacquard: Got completion with error, code=VAPI_RETRY_EXC_ERR, vendor code=81 You can eliminate the problem by setting the following environment variables in your batch script before the mpirun/mpiexec job launch command. bash login shell: export VIADEV_DEFAULT_RETRY_COUNT=7 export VIADEV_DEFAULT_TIME_OUT=21 csh/tcsh login shell: setenv VIADEV_DEFAULT_RETRY_COUNT 7 setenv VIADEV_DEFAULT_TIME_OUT 21 For assistance or to report problems, contact consult@nersc.gov. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
![]() |
Page last modified: Mon, 04 Feb 2008 23:42:30 GMT Page URL: http://www.nersc.gov/nusers/resources/jacquard/running_jobs.php Web contact: webmaster@nersc.gov Computing questions: consult@nersc.gov Privacy and Security Notice |
![]() |