NERSCPowering Scientific Discovery Since 1974

Open Issues

RESOLVED: Projectb filesystem outage July 9, 2012

July 9, 2012 | 0 Comments

 

The projectb filesystem had a hardware failure that potentially generated I/O errors.  The filesystem logs indicate that the earliest abnormal event on the filesystem occurred at 9:19AM and the filesystem was taken down for maintenance at 10:42AM.  The filesystem returned to service at 11:20AM.  Jobs running on the cluster would not have been able to read from or write to the projectb filesystem between 10:42AM and 11:20AM.
 
Between 9:19AM and 10:42AM one out of the 20 GPFS controllers on projectb was down, and didn't failover (as it should have).
 
This means:
1/20 file I/O operations could have failed between 9:19AM and 10:42AM
 
If your job was performing a large number of short reads and writes, then there is a better chance you were affected.
 
Any data that was successfully written (to a complete file) is good.
 
Any data that was written in random I/O (e.g. fseek/fwrite) could be suspect, and should be looked at with care.
 
Please check your data that was written between 9:19AM and 10:42AM.
Please check your jobs that were operating between these time periods if they were performing I/O on projectb.
Please file a ticket with NERSC if you need help.  (http://help.nersc.gov)

Sincerely,
Doug Jacobsen
NERSC Bioinformatics Computing Consultant

 

0 comments | Read the full post

Important notice about using /house

July 6, 2012 | 0 Comments

Description

There have been a lot of issues recently with NFS hangs on the gpint machines.
The origin of the gpint hanging has been determined to be a defect in the Isilon filesystem software, and happens when a file being written is simultaneously opened for reading on the same host.
 
This most frequently happens when people tail files being written by the same machine.
 
E.g.:  DO NOT DO THIS:
gpint17 $ ./somewritingProcess > outfile &
gpint17 $ tail -f outfile
 
This very common (and desirable) operation has been determined to hang the filesystem on the host reading/writing the file.  We are working with the vendor to try to correct this situation, but in the meantime a work-around is to read the file from a different machine than the writer:
 
E.g.: THIS IS OK
gpint17 $ ./somewritingProcess > outfile &
gpint17 $ ssh gpintXX
gpintXX $ tail -f outfile
 
Please note that these problems are not limited to the gpint machines - but any machine connected to /house.  Please write me back with any questions, and file tickets at http://help.nersc.gov if you run into any trouble.

 

0 comments | Read the full post