CCDB and lgdb
lgdb (Cray Line Mode Parallel Debugger)
The Cray parallel debuggers lgdb supports Cray's comparative debugging feature.
From Cray's release notes: lgdb is a GDB-based parallel debugger used to debug applications compiled with CCE, PGI, and GNU Fortran, C and C++ compilers. It allows programmers to either launch an application or attach to an already-running application that was launched with aprun. Additionally, preliminary comparative debugger technology has been introduced that enables programmers to compare data structures between two executing applications. Some features of lgdb include:
- Command line parallel debugger allows for launching/attaching applications via aprun.
- Utilizes process sets to operate on a subset of application ranks.
- gdb like feel, also implements a gdbmode to enable a true parallel gdb.
- Prototype of Cray Comparative Debugging Technology.
- OpenACC debugging support.
- Workload manager supported with launch command.
To use, first load the module:
% module load cray-lgdb
Then please see the man page "man lgdb" for usage information. Cray also has a web page that documents the comparative debugging feature.
CCDB (Cray Comparative Debugger)
CCDB is a GUI tool for comparative debugging, which runs on top of lgdb. It allows users to compare data structures between two executing applications.
% qsub -IV -lmppwidth=48 # request enough nodes for launching two applications % cd $PBS_O_WORKDIR % module load cray-ccdb $ ccdb
Launch two applications from the CCDB window.
To compare something between two applicaions, you need to let CCDB know the name of the variable, and the location where a comparison is to be made and how the data is distributed over MPI processes. For these, CCDB uses 3 entities:
- PE Set: A set of MPI processes
- Decomposition: How a variable is distributed over the MPI processes in the PE set
- Assertion script: A collection of mathematical relaships (e.g., equality) to be tested
Below is an assertion script which tests whether the 6 variables have the same values between the applications, at line 418 of HPL_pdtest.c. It shows that resid0 and XmormI have different values between the applications and therefore both applications have stopped at line 418.