nersc
Close this window

Email Announcement Archive

[Users] NERSC Data Corruption Bug Has Been Mitigated

Author: Rebecca Hartman-Baker <rjhartmanbaker_at_lbl.gov>
Date: 2020-03-21 07:46:18

Dear NERSC Users, At the beginning of March, NERSC asked for your help tracking down an I/O bug that I described as “a rare occurrence that manifests as a missing block or a block of null characters” when writing to the Community File System. Thanks to your help, not only were we able to better understand the issue, but we found a signature in the Cori error logs, identified all the user jobs impacted by it, and deployed a fix that has been in place since March 12. If your job was impacted by this issue, you should have received a notification from us in the past two weeks. If you did not receive notice, that means your jobs were all in the clear. The user input helped us, in collaboration with our vendor, to find the underlying cause of the issue. Once we found and understood it, we were able to deploy a patch to the compute nodes. Since it was a fairly minor change, we were able to perform a rolling upgrade to the compute nodes, which began on March 10 and completed two days later. While we believed that the patch would fix the problem, we performed extensive testing and observed user jobs for a time to be sure that it had worked before making this announcement. Without the assistance of the NERSC community, it would have taken us a lot longer to find and fix this problem. Thank you for all your help in identifying the issue, and your patience as we tracked down this tricky bug! Regards, -Rebecca -- Rebecca Hartman-Baker, Ph.D User Engagement Group Leader National Energy Research Scientific Computing Center | Berkeley Lab rjhartmanbaker@lbl.gov | phone: (510) 486-4810 fax: (510) 486-6459 Pronouns: she/her/hers _______________________________________________ Users mailing list Users@nersc.gov

Close this window