From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Mon May 26 2008 - 12:35:34 PDT
Parviz,
BLCR is not able to save/restore the association between the debugger
and the executable, making what you are trying slightly difficult (but
hopefully not impossible). For that reason, in the 0.7.0 release (due
out soon) the default behavior will be to refuse to checkpoint while a
debugger is attached (an additional option will need to be specified to
allow the checkpoint in such a case). In neither the 0.6.x or 0.7.0
release will checkpointing gdb and the debugged process together (as
process group, process tree, etc) work. If it did, your task would have
been much easier (just "cr_checkpoint <pid-of-gdb>").
The Trace/BPT trap you see is the restarted executable executing a
breakpoint (bpt) trap instruction that the debugger inserted. Since at
restart time no debugger is attached, the trap is a fatal error. The
problem is that any breakpoint trap instruction written by the first gdb
is still present in the checkpointed process, having replaced
instuction(s) in the process. When gdb wrote that instruction into
process memory, it would have saved the original instruction byte in its
own memory (to restore when executing past the breakpoint, or when
removing it). However that information was lost when the first gdb
exited. This doesn't appear to have a good solution other than deleting
all breakpoints before you take the checkpoint. If you consult a gdb
expert (I am not one) you may be able to get gdb to print all the
breakpoint data in a form that can be fed back into the new gdb (or
perhaps you only have one at this stage). So, I recommend the following
steps:
1) Run under control on gdb until it stops at your "safe" breakpoint
2) delete all breakpoints/watchpoints
3) checkpoint the process (may require you to "c" in response to the
BLCR-generated signal)
At restart time there is the question of attaching gdb "soon enough" to
regain control before the buggy code runs. Since we had to remove all
the breakpoints, there seems to be nothing preventing the code from
executing normally, bugs and all. If you are restarting from a point
early enough (say 1 minute or more) before your suspected bug then you
can probably just restart and then attach gdb "fast enough". If you are
too slow it costs you little to try again. However, it might not be
possible to do that in general. To deal with that on can try passing
"--stop" to the cr_restart command, which will freeze the executable
(with a SIGSTOP) immediately on restart (before returning control to the
point where BLCR interrupted execution). That should allow you to
attach a debugger, which then may need to send SIGCONT to the process to
resume execution. However, I am not sure that gdb will correctly attach
to a STOPed process. In my experiments there were some cases where "gdb
<exectuable> <pid>" appeared to hang when the process was STOPed in this
manner. If so, try sending a SIGCONT from another window/terminal
("kill -CONT <pid>"); hopefully that will resolve it, but it didn't
always do so for me. I think this depends on the gdb and/or kernel
release. In short, my recommendation if "attach gdb fast enough" isn't
possible is:
1) Restart with the "--stop" command line option to freeze the process
2) Attach gdb to the restarted-but-stopped process
3) Send SIGCONT, either from gdb (if it attached OK) or from a command
line (if gdb looks "stuck").
Hope this helps. Let us know if the instructions above do or do not
work for you. Perhaps you'd be interested in helping to write up a
"mini howto" based on your experiences?
-Paul
Parviz Fariborz wrote:
>
> Hi,
>
> I am trying to use blcr to shorten the debug time for a large
> executable. I have described the approach that I have taken and the
> issues that I ran into below. Perhaps someone in this mailing list has
> done the same and can give me some guidance.
>
> When debugging a long running executable in gdb (multiple hours), I
> want to use blcr to checkpoint the running executable at a breakpoint
> close to the problem area where I can safely assume things are in good
> state. In the next round of debugging, instead of running the
> executable in gdb, I want to re-start the checkpoint and attach the
> gdb to running process. This gets me to the point of interest a lot
> faster.
>
> My questions are : Is it possible to stop a running process in gdb at
> a breakpoint and create a checkpoint? I tried it and was able to
> create the checkpoint file, But the re-start always failed with the
> following message :
>
> .Trace/BPT trap
>
> Also, is there a better approach? If so, please describe it.
>
> Thanks in advance for your help
>
> -Parviz
--
Paul H. Hargrove PHHargrove_at_lbl_dot_gov
Future Technologies Group
HPC Research Department Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900