RESOLVED: Projectb filesystem outage July 9, 2012

July 9, 2012


The projectb filesystem had a hardware failure that potentially generated I/O errors.  The filesystem logs indicate that the earliest abnormal event on the filesystem occurred at 9:19AM and the filesystem was taken down for maintenance at 10:42AM.  The filesystem returned to service at 11:20AM.  Jobs running on the cluster would not have been able to read from or write to the projectb filesystem between 10:42AM and 11:20AM.
Between 9:19AM and 10:42AM one out of the 20 GPFS controllers on projectb was down, and didn't failover (as it should have).
This means:
1/20 file I/O operations could have failed between 9:19AM and 10:42AM
If your job was performing a large number of short reads and writes, then there is a better chance you were affected.
Any data that was successfully written (to a complete file) is good.
Any data that was written in random I/O (e.g. fseek/fwrite) could be suspect, and should be looked at with care.
Please check your data that was written between 9:19AM and 10:42AM.
Please check your jobs that were operating between these time periods if they were performing I/O on projectb.
Please file a ticket with NERSC if you need help.

Doug Jacobsen
NERSC Bioinformatics Computing Consultant