NERSCPowering Scientific Discovery Since 1974

RESOLVED: Projectb filesystem outage July 9, 2012

July 9, 2012 (0 Comments)

 

The projectb filesystem had a hardware failure that potentially generated I/O errors.  The filesystem logs indicate that the earliest abnormal event on the filesystem occurred at 9:19AM and the filesystem was taken down for maintenance at 10:42AM.  The filesystem returned to service at 11:20AM.  Jobs running on the cluster would not have been able to read from or write to the projectb filesystem between 10:42AM and 11:20AM.
 
Between 9:19AM and 10:42AM one out of the 20 GPFS controllers on projectb was down, and didn't failover (as it should have).
 
This means:
1/20 file I/O operations could have failed between 9:19AM and 10:42AM
 
If your job was performing a large number of short reads and writes, then there is a better chance you were affected.
 
Any data that was successfully written (to a complete file) is good.
 
Any data that was written in random I/O (e.g. fseek/fwrite) could be suspect, and should be looked at with care.
 
Please check your data that was written between 9:19AM and 10:42AM.
Please check your jobs that were operating between these time periods if they were performing I/O on projectb.
Please file a ticket with NERSC if you need help.  (http://help.nersc.gov)

Sincerely,
Doug Jacobsen
NERSC Bioinformatics Computing Consultant

 


Post your comment

You cannot post comments until you have logged in. Login Here.

Comments

No one has commented on this page yet.

RSS feed for comments on this page | RSS feed for all comments