Debugging and Profiling
NERSC provides many popular debugging and profiling tools. Some of them are general-purpose tools and others are geared toward more specific tasks.
A quick guideline on when to use which debugging tool is as follows:
- DDT and TotalView: general purpose parallel debuggers allowing users to interactively control the pace of execution of a program using a graphical user interface
- gdb: serial command line mode debugger; can be useful in quickly examining core files to see where the code crashed (DDT and TotalView can be used for this purpose, too)
STAT: used for obtaining call backtraces for all parallel tasks from a live parallel application and displaying a call tree graphically, showing where each task is executing; useful in debugging a hung application
ATP: used for generating call backtraces for all parallel tasks when a code crashes; useful in debugging a hung appliation; can be a good starting point if a code crashes with little hint left behind
CCDB and lgdb: the unique and great feature is to run two versions of a code (e.g., one working version and an incorrect version, or a code run with two different numbers of tasks) side by side to find out where the two runs start to generate diverging results
Valgrind: a suite of debugging and profiling tools; the best known tool is memcheck which can detect memory errors or memory leaks; other tools include cache profiling, heap memory profiling tools and more
A quick guideline for performance analysis tools below:
- IPM: a low-overhead easy-to-use tool for getting hardware counters data, MPI function timings, and memory usage
- CrayPat: a suite of sophisticated Cray tools for a more detailed performance analysis which can show routine-based hardware counters data, MPI message statistics, I/O statistics, etc; in addition to getting performance data deduced from a sampling method, tracing of certain routines (or library routines) can be performed for better understanding of performance statistics associated with the selected routines
MAP: a sampling tool for performance metrics; time series of the collected data for the entire run of the code is displayed graphically, and the source code lines are annotated with performance metrics
For more information about how to use a tool, click on the relevant item below.
Parallel Debugging with lgdb lgdb (Cray Line Mode Parallel Debugger) is a GDB-based parallel debugger, developed by Cray. It allows programmers to either launch an application or attach to an already-running application that was launched with aprun, to debug the parallel code in command-line mode. These features can be useful, but you will probably want to use a more powerful GUI-based debuggers instead. Below is an example of running lgdb for a parallel application: % qsub -IV -lmppwidth=24 -q… Read More »
Allinea MAP is a parallel profiler with simple GUI. It can be run with up to 512 processors. perf-report is a new tool from Allinea, which may be available for a limited time, that characterizes code performance based on percentage of walltimes used in different performance metrics categories. Read More »