NERSCPowering Scientific Discovery Since 1974

STAT and ATP

STAT

STAT (the Stack Trace Analysis Tool) is a highly scalable, lightweight tool that gathers and merges stack traces from all of the processes of a parallel application. The results are then presented graphically as a call tree showing the location that each process is executing.

It suppports distributed-memory parallel programming only such as MPI, Coarray Fortran and UPC (Unified Parallel C).

STAT needs to be run on a MOM (batch management) node.  The steps are:

1. Find the MOM node your job is on by running the following command:

% qstat -f <jobid> | grep login_node_id 

For example:

% qstat -f 722272 | grep login_node_id
login_node_id = nid02051

2. Login to the MOM node that launched your application:

% ssh -XY nid02051

3. Find the process id of the aprun command for your application:

% ps -fH
UID        PID  PPID  C STIME TTY          TIME CMD
wyang    23953 16045  0 Feb01 pts/0    00:00:00 aprun -n 4 ./jacobi_mpi
wyang    23961 23953  0 Feb01 pts/0    00:00:00   aprun -n 4 ./jacobi_mpi

So you see that the PID for the aprun command is 23953.

4. Load the module stat

% module load stat

5. Run the STAT command on this process. You may want to use the -i flag to gather source line numbers, too:

% STAT -i 23953
Attaching to application...
Attached!
...
Results written to /scratch1/scratch/wyang/parallel_jacobi/stat_results/jacobi_mpi.0010

STAT takes several backtrace samples after attaching to the running processes. The result file is created in the 'stat_results' subdirectory under the current working directory. This subdirectory contains another subdirectory whose name is based on your parallel application's executable name that contains the merged stack trace file in DOT format.

6. Then, run the GUI command, 'STATview', with the file above to visualize the generated *.dot files for stack backtrace information.

% STATview stat_results/jacobi_mpi.010/jacobi_mpi.0010.3D.dot

The above call tree diagram reveals that rank 0 is in the 'init_fields' routine (line 172 of jacobi_mpi.f90), rank 3 in the 'set_bc' routine (line 214 of the same source file), and the other ranks (1 and 2) are in the MPI_Sendrecv function. If this pattern persists, it means that the code hangs in these locations. With this information, you may want to use a full-fledged parallel debugger such as DDT or TotalView to find out why your code is stuck in these places.

ATP

Another useful tool in the same vein is ATP (Abnormal Termination Processing) that Cray has developed. ATP gathers stack backtraces when the code crashes, by running STAT before it exits.

The 'atp' module is load by default on Cray systems, but it is not enabled. To enable it so that it generates stack backtrace info upon a failure, set the following environment variable before your aprun command in your batch script:

setenv ATP_ENABLED 1          # for csh/tcsh

export ATP_ENABLED=1          # for bash/sh/ksh

In addition, you need to set the 'FOR_IGNORE_EXCEPTIONS' environment variable if you're using Fortran and you have built with the Intel compiler:

setenv FOR_IGNORE_EXCEPTIONS true   # for csh/tcsh

export FOR_IGNORE_EXCEPTIONS=true   # for bash/sh/ksh

If your Fortran code is built with the GNU compiler, you will need to link with the '-f no-backtrace' option.

When atp is loaded no core file will be generated. However, you can get core dumps (core.atp.<apid>.<rank>) if you set coredumpsize to unlimited:

unlimit coredumpsize   # for csh/tcsh

ulimit -c unlimited    # for bash/sh/ksh

More information can be found in the man page: type 'man intro_atp' or, simply, 'man atp'.

The following is to test ATP using an example code available in the ATP distribution package:

% cp $ATP_HOME/demos/testMPIApp.c .
% cc -o testMPIApp testMPIApp.c

% cat runit
#!/bin/csh
#PBS -l mppwidth=24
#PBS -l walltime=5:00
#PBS -q debug
#PBS -j oe
cd $PBS_O_WORKDIR
setenv ATP_ENABLED 1
aprun -n 8 ./testMPIApp 1 4

% qsub runit
714152.edique02

% cat runit.o714152
…
Application 2885291 is crashing. ATP analysis proceeding...

Stack walkback for Rank 4 starting:
  _start@start.S:113
  __libc_start_main@libc-start.c:226
  main@0x400ffb
  raise@pt-raise.c:41
Stack walkback for Rank 4 done
Process died with signal 4: 'Illegal instruction'
View application merged backtrace tree with: statview atpMergedBT.dot
You may need to: module load stat
…

ATP creates a merged backtrace files in DOT fomat in atpMergedBT.dot and atpMergedBT_line.dot. The latter shows source line numbers, too. To view the collected backtrace result, you need to load the 'stat' module and run 'STATview':

% module load stat
% STATview atpMergedBT.dot   # or statview atpMergedBT_line.dot

ATP can be a useful tool in debugging a hung application, too. You can force ATP to generate backgraces for a hung application by killing the application. To do that, you should have done necessary preparational work such as setting the ATP_ENABLED environment variable, etc. in the batch script for the job in question.

% apstat               # find the apid
…
 Apid  ResId     User  PEs Nodes    Age State       Command
2885161 140092    wyang    4     1  0h02m   run    jacobi_mpi
…
% apkill 2885161       # kill the application
% cat runit.o714080
…
aprun: Apid 2885161: Caught signal Terminated, sending to application
…
Process died with signal 15: 'Terminated'
View application merged backtrace tree with: statview atpMergedBT.dot
…

The above example is to use SIGTERM in killing the application. There are other signals accepted by ATP. For info, please read the atp and apkill man pages.

Avalability

PackagePlatformCategoryVersionModuleInstall DateDate Made Default
stat hopper applications/ general 2.1.0.1 stat/2.1.0.1 2014-06-12 2014-06-12

PackagePlatformCategoryVersionModuleInstall DateDate Made Default
atp edison applications/ debugging 1.7.1 atp/1.7.1 2013-12-13 2013-12-10
atp edison applications/ debugging 1.7.2 atp/1.7.2 2014-04-24 2014-04-24
atp hopper applications/ general 1.5.0 atp/1.5.0 2012-07-11 2012-08-15
atp hopper applications/ general 1.5.2 atp/1.5.2 2012-11-29
atp hopper applications/ general 1.5.2 atp/1.5.2 2012-11-29
atp hopper applications/ general 1.6.0 atp/1.6.0 2013-01-15 2013-02-27
atp hopper applications/ general 1.6.2 atp/1.6.2 2013-05-24 2013-06-20
atp hopper applications/ general 1.7.0 atp/1.7.0 2013-09-25 2013-12-11
atp hopper applications/ general 1.7.1 atp/1.7.1 2013-12-18
atp hopper applications/ general 1.7.3 atp/1.7.3 2014-06-12