NERSCPowering Scientific Discovery Since 1974

Open Issues

[PATCHED] python/2.7.4 gzip package fails

September 24, 2013 by Doug Jacobsen | 0 Comments

The modules version of python (python/2.7.4) had a bug in the default gzip python package.  This was due to problems introduced in python 2.7.4 and fixed in python 2.7.5.  Receiving a TypeError or structError upon opening and reading a gzip'ped file were the phenotypes of this bug.  This has been corrected by installing the python 2.7.5 version of gzip.py into our python distribution.

0 comments | Read the full post

[FIXED] JGI data loss in /projectb/sandbox area [purge]

August 19, 2013 by Kjiersten Fagnan | 0 Comments

We have discovered a serious bug in our purge scripts on /global/projectb. The */global/projectb/sandbox* areas are supposed to be immune from the purge (like the project directories); however, there is a bug in the purge script that caused some files to be deleted if they had not been touched for 90+ days (like data in the scratch directories). *The sandbox areas are not backed up*, so if this data was not in more than one location on disk or in HPSS, it has been lost. We have found the bug and have suspended the purge script until the bug is fixed.

0 comments | Read the full post

[FIXED] perl 5.16.0 File::Glob() causes crashes

August 15, 2013 by Doug Jacobsen | 1 Comments

There is an issue with the default modules installation of perl where the glob() function can crash the perl executable.  This happens if multiple (space separated) patterns are being matched by glob():

1 comments | Read the full post

RESOLVED: Projectb filesystem outage July 9, 2012

July 9, 2012 | 0 Comments

 

The projectb filesystem had a hardware failure that potentially generated I/O errors.  The filesystem logs indicate that the earliest abnormal event on the filesystem occurred at 9:19AM and the filesystem was taken down for maintenance at 10:42AM.  The filesystem returned to service at 11:20AM.  Jobs running on the cluster would not have been able to read from or write to the projectb filesystem between 10:42AM and 11:20AM.
 
Between 9:19AM and 10:42AM one out of the 20 GPFS controllers on projectb was down, and didn't failover (as it should have).
 
This means:
1/20 file I/O operations could have failed between 9:19AM and 10:42AM
 
If your job was performing a large number of short reads and writes, then there is a better chance you were affected.
 
Any data that was successfully written (to a complete file) is good.
 
Any data that was written in random I/O (e.g. fseek/fwrite) could be suspect, and should be looked at with care.
 
Please check your data that was written between 9:19AM and 10:42AM.
Please check your jobs that were operating between these time periods if they were performing I/O on projectb.
Please file a ticket with NERSC if you need help.  (http://help.nersc.gov)

Sincerely,
Doug Jacobsen
NERSC Bioinformatics Computing Consultant

 

0 comments | Read the full post

Important notice about using /house

July 6, 2012 | 0 Comments

Description

There have been a lot of issues recently with NFS hangs on the gpint machines.
The origin of the gpint hanging has been determined to be a defect in the Isilon filesystem software, and happens when a file being written is simultaneously opened for reading on the same host.
 
This most frequently happens when people tail files being written by the same machine.
 
E.g.:  DO NOT DO THIS:
gpint17 $ ./somewritingProcess > outfile &
gpint17 $ tail -f outfile
 
This very common (and desirable) operation has been determined to hang the filesystem on the host reading/writing the file.  We are working with the vendor to try to correct this situation, but in the meantime a work-around is to read the file from a different machine than the writer:
 
E.g.: THIS IS OK
gpint17 $ ./somewritingProcess > outfile &
gpint17 $ ssh gpintXX
gpintXX $ tail -f outfile
 
Please note that these problems are not limited to the gpint machines - but any machine connected to /house.  Please write me back with any questions, and file tickets at http://help.nersc.gov if you run into any trouble.

 

0 comments | Read the full post