From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Tue Apr 01 2008 - 12:07:43 PST
Yuan,
Sorry for the delay in getting back to you. I had to ask a colleague
to install R for me and then I left on travel about the time that was
finished.
I tried today with BLCR 0.6.5 and was able to checkpoint and restart
the script you provided. I verified that
/usr/lib64/gconv/gconv-modules.cache was mmapped (it was not when I had
LANG=C in my environment, but changing it LANG=en_US.UTF-8 caused it to
be mmapped).
Since I cannot reproduce your problem, I am not sure what I can do at
this point to help you. If you have any ideas about what makes your
system different, please let me know.
While not related to a "permission denied" error, it is worth nothing
that your test script looks at wallclock time, which BLCR does not
"virtualize". So if I restart more than 180 seconds after the original
program began, then I get only a single ">" line as output. Not exactly
a problem, but I was confused by it initially.
-Paul
Yuan Wan wrote:
>
>
> Paul,
> --------------------------------------------------------------------------------------
>
> $ ls -l /usr/lib64/gconv/gconv-modules.cache
> -rw-r--r-- 1 root root 21546 Oct 2 14:51
> /usr/lib64/gconv/gconv-modules.cache
> $ tcsh -c 'cat /proc/$$/maps' | grep gconv
> 2a9892f000-2a98935000 r--s 00000000 08:01 522135
> /usr/lib64/gconv/gconv-modules.cache
> ---------------------------------------------------------------------------------------
>
>
> I cannot see any difference on permission.
>
> Can you restart my test script from checkpoint on your machine?
>
> -------------------------------------------
> #!/bin/sh
>
> PATHTOR=/usr/bin
> # Below, the phrase "EOF" marks the beginning and end of the HERE document.
> $PATHTOR/R --no-save <<EOF
> mod<-function (x, y)
> {
> x1 <- trunc(trunc(x/y) * y)
> z <- trunc(x) - x1
> z
> }
>
> z0 <- unclass(Sys.time())
>
> repeat{
>
> z1<-unclass(Sys.time())
> secs<-floor(z1-z0)
> if (mod(secs, 10)==0) print(secs)
> if ((secs)>180) break
>
> }
> EOF
>
> -------------------------------------------
>
>
>
> --Yuan
>
>
>
> On Fri, 14 Mar 2008, Paul H. Hargrove wrote:
>
>> Yuan,
>>
>> What do you get if you run the following two commands?
>> $ ls -l /usr/lib64/gconv/gconv-modules.cache
>> $ tcsh -c 'cat /proc/$$/maps' | grep gconv
>>
>> What I see is a world readable file and a shared read-only mmap in tcsh:
>> $ ls -l /usr/lib64/gconv/gconv-modules.cache
>> -rw-r--r-- 1 root root 21514 Jun 3 2005
>> /usr/lib64/gconv/gconv-modules.cache
>> $ tcsh -c 'cat /proc/$$/maps' | grep gconv
>> 2b8e36967000-2b8e3696d000 r--s 00000000 00:0f 9486631
>> /usr/lib64/gconv/gconv-modules.cache
>>
>> So, there shouldn't be a problem unless there is something different
>> about your system.
>>
>> -Paul
>>
>> Paul H. Hargrove wrote:
>>> Yuan,
>>>
>>> I've not seen that particular failure before, but some quick research
>>> indicates that gconv-modules.cache is a part of glibc and I suspect that
>>> it is getting mapped in much the same way as the NCSD file is. I will
>>> continue to look into the problem to see what BLCR might be able to do
>>> differently,
>>>
>>> -Paul
>>>
>>> Yuan Wan wrote:
>>>
>>>> Hi Paul,
>>>>
>>>> Thanks for replying.
>>>> The error messege I got from /var/log/messeges is as the following:
>>>>
>>>> vmadump: mmap failed: /usr/lib64/gconv/gconv-modules.cache
>>>> thaw_threads returned error, aborting. -13
>>>>
>>>> The failure seems not caused by NSCD. What do you think?
>>>>
>>>> --Yuan
>>>>
>>>>
>>>> On Mon, 10 Mar 2008, Paul H. Hargrove wrote:
>>>>
>>>>
>>>>> Yuan,
>>>>>
>>>>> The most likely cause is that the restart failed to open one of the
>>>>> files that was open() or mmap()ed at the time the checkpoint was
>>>>> taken.
>>>>> Based on the fact that you see this w/ a shell script, but not C code,
>>>>> my best guess is that you are encountering a problem with the file
>>>>> that
>>>>> the Name Service Cache Daemon (NSCD) uses. Please see the
>>>>> following FAQ
>>>>> entry for more detail (including what to look for in the system logs)
>>>>> http://upc-bugs.lbl.gov/blcr/doc/html/FAQ.html#nscd
>>>>> The only known work-around is to remove NSCD from your system.
>>>>>
>>>>> -Paul
>>>>>
>>>>> Yuan Wan wrote:
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> I'm trying to restart my shell script jobs (bash and R) with BLCR but
>>>>>> failed with the following error:
>>>>>>
>>>>>> "Restart failed: Permission denied"
>>>>>>
>>>>>> I can checkpoint the job and get context file. The restart will be
>>>>>> successful if executed by root but fail if run by normal users. The
>>>>>> context file does belongs to me, so I'm wondering where the
>>>>>> permission
>>>>>> is required. I can also restart a C code as a regular user without
>>>>>> problem.
>>>>>>
>>>>>> Anyone know the possible reason? Thanks
>>>>>>
>>>>>> --Yuan
>>>>>>
>>>>>> Yuan Wan
>>>>>>
>>>>>
>>>>>
>>>
>>>
>>>
>>
>>
>>
>
--
Paul H. Hargrove PHHargrove_at_lbl_dot_gov
Future Technologies Group
HPC Research Department Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900