STAT and ATP
STAT (the Stack Trace Analysis Tool) is a highly scalable, lightweight tool that gathers and merges stack traces from all of the processes of a parallel application. The results are then presented graphically as a call tree showing the location that each process is executing.
It suppports distributed-memory parallel programming only such as MPI, Coarray Fortran and UPC (Unified Parallel C).
STAT needs to be run on a MOM (batch management) node. The steps are:
1. Find the MOM node your job is on by running the following command:
% qstat -f <jobid> | grep login_node_id
% qstat -f 722272 | grep login_node_id login_node_id = nid02051
2. Login to the MOM node that launched your application:
% ssh -XY nid02051
3. Find the process id of the aprun command for your application:
% ps -fH UID PID PPID C STIME TTY TIME CMD wyang 23953 16045 0 Feb01 pts/0 00:00:00 aprun -n 4 ./jacobi_mpi wyang 23961 23953 0 Feb01 pts/0 00:00:00 aprun -n 4 ./jacobi_mpi
So you see that the PID for the aprun command is 23953.
4. Load the module stat
% module load stat
5. Run the STAT command on this process. You may want to use the -i flag to gather source line numbers, too:
% STAT -i 23953 Attaching to application... Attached! ... Results written to /scratch1/scratch/wyang/parallel_jacobi/stat_results/jacobi_mpi.0010
STAT takes several backtrace samples after attaching to the running processes. The result file is created in the 'stat_results' subdirectory under the current working directory. This subdirectory contains another subdirectory whose name is based on your parallel application's executable name that contains the merged stack trace file in DOT format.
6. Then, run the GUI command, 'STATview', with the file above to visualize the generated *.dot files for stack backtrace information.
% STATview stat_results/jacobi_mpi.010/jacobi_mpi.0010.3D.dot
The above call tree diagram reveals that rank 0 is in the 'init_fields' routine (line 172 of jacobi_mpi.f90), rank 3 in the 'set_bc' routine (line 214 of the same source file), and the other ranks (1 and 2) are in the MPI_Sendrecv function. If this pattern persists, it means that the code hangs in these locations. With this information, you may want to use a full-fledged parallel debugger such as DDT or TotalView to find out why your code is stuck in these places.
Another useful tool in the same vein is ATP (Abnormal Termination Processing) that Cray has developed. ATP gathers stack backtraces when the code crashes, by running STAT before it exits.
The 'atp' module is load by default on Cray systems, but it is not enabled. To enable it so that it generates stack backtrace info upon a failure, set the following environment variable before your aprun command in your batch script:
setenv ATP_ENABLED 1 # for csh/tcsh export ATP_ENABLED=1 # for bash/sh/ksh
In addition, you need to set the 'FOR_IGNORE_EXCEPTIONS' environment variable if you're using Fortran and you have built with the Intel compiler:
setenv FOR_IGNORE_EXCEPTIONS true # for csh/tcsh export FOR_IGNORE_EXCEPTIONS=true # for bash/sh/ksh
If your Fortran code is built with the GNU compiler, you will need to link with the '-fno-backtrace' option.
When atp is loaded no core file will be generated. However, you can get core dumps (core.atp.<apid>.<rank>) if you set coredumpsize to unlimited:
unlimit coredumpsize # for csh/tcsh ulimit -c unlimited # for bash/sh/ksh
More information can be found in the man page: type 'man intro_atp' or, simply, 'man atp'.
The following is to test ATP using an example code available in the ATP distribution package:
% cp $ATP_HOME/demos/testMPIApp.c . % cc -o testMPIApp testMPIApp.c % cat runit #!/bin/csh #PBS -l mppwidth=24 #PBS -l walltime=5:00 #PBS -q debug #PBS -j oe cd $PBS_O_WORKDIR setenv ATP_ENABLED 1 aprun -n 8 ./testMPIApp 1 4 % qsub runit 714152.edique02 % cat runit.o714152 … Application 2885291 is crashing. ATP analysis proceeding... Stack walkback for Rank 4 starting: _start@start.S:113 firstname.lastname@example.org:226 main@0x400ffb email@example.com:41 Stack walkback for Rank 4 done Process died with signal 4: 'Illegal instruction' View application merged backtrace tree with: statview atpMergedBT.dot You may need to: module load stat …
ATP creates a merged backtrace files in DOT fomat in atpMergedBT.dot and atpMergedBT_line.dot. The latter shows source line numbers, too. To view the collected backtrace result, you need to load the 'stat' module and run 'STATview':
% module load stat % STATview atpMergedBT.dot # or statview atpMergedBT_line.dot
ATP can be a useful tool in debugging a hung application, too. You can force ATP to generate backgraces for a hung application by killing the application. To do that, you should have done necessary preparational work such as setting the ATP_ENABLED environment variable, etc. in the batch script for the job in question.
% apstat # find the apid … Apid ResId User PEs Nodes Age State Command 2885161 140092 wyang 4 1 0h02m run jacobi_mpi … % apkill 2885161 # kill the application % cat runit.o714080 … aprun: Apid 2885161: Caught signal Terminated, sending to application … Process died with signal 15: 'Terminated' View application merged backtrace tree with: statview atpMergedBT.dot …
The above example is to use SIGTERM in killing the application. There are other signals accepted by ATP. For info, please read the atp and apkill man pages.
|Package||Platform||Category||Version||Module||Install Date||Date Made Default|Package Platform Category Version Module Install Date Date Made Default atp