NERSCPowering Scientific Discovery Since 1974

GDB and ATP

These two tools can be used to quickly and easily examine a core file that was produced when an execution crashed. They can give an approximate answer to the question of where the code was when the crash happened.

Run gdb with the following syntax:

gdb executable-file core-file

After gdb starts use the "backtrace" command.  For example:

% gdb mpi-hello core
GNU gdb 6.6
Copyright (C) 2006 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB. Type "show warranty" for details.
This GDB was configured as "x86_64-suse-linux"...
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `./mpi-hello'.
Program terminated with signal 8, Arithmetic exception.
#0 0x0000000000400453 in MAIN () at ./mpi-hello.f:6
6     CALL MPI_INIT( ierr )
(gdb) backtrace
#0 0x0000000000400453 in MAIN () at ./mpi-hello.f:6
#1 0x0000000000400330 in main ()
Current language: auto; currently fortran

Keep in mind that line number tracing when a code crashes is imprecise.  The line number above about CALL MPI_INIT() is probably not the line that the floating point exception occurred.  Consult the gdb man pages for more information on this utility.

Another software that is useful in debugging is using the "atp" (Abnormal Termination Processing) that Cray developed. Again, line numbers it reports are not accurate. To use:

1) module load atp, if necessary; this module is loaded by default on Hopper
2) compile and link your code from scratch (relink is required)
3) in batch script, add before your aprun command:
     setenv ATP_ENABLED 1


When atp is loaded no core file will be generated but you will get a backtrace in your stderr. More information can be found in the man page: type 'man intro_atp' or, simply, 'man atp'.

The following example is using an interactive session on a "MOM" node.

% setenv ATP_ENABLED 1
% aprun -n 24 ./segfault
Application 6243573 is crashing. ATP analysis proceeding...

Stack walkback for Rank 2 starting:
  _start@start.S:113
  __libc_start_main@libc-start.c:226
  main@0x40071f
  MAIN_@segfault.f:9
  sub_@segfault.f:21
  pghpf_rnum@0x4d1f95
  prng_loop_r_lf@0x4d1cd3
Stack walkback for Rank 2 done
Process died with signal 11: 'Segmentation fault'
View application merged backtrace tree file 'atpMergedBT.dot' with 'statview'
You may need to 'module load stat'.

_pmiu_daemon(SIGCHLD): [NID 02313] [c12-0c0s4n1] [Wed Mar  7 17:30:23 2012] PE RANK 1 exit signal Killed
[NID 02313] 2012-03-07 09:30:23 Apid 6243573: initiated application termination
Application 6243573 exit codes: 137
Application 6243573 resources: utime ~0s, stime ~1s