NERSC Sitewide Power Outage, January 27-February 1, 2025

By Rebecca Hartman-Baker

CRT

At least every three years, NERSC must perform a complete power-systems maintenance to ensure that all power systems are operating safely and efficiently. The maintenance process, which includes visual inspections of medium-voltage switchgear equipment and the replacement of any faulty or aging parts, requires a prolonged shutdown of the facility’s primary power infrastructure. This means that NERSC must bring down the majority of its systems, including Perlmutter, for the duration of the maintenance, which is scheduled for the week of January 27 through February 1, 2025. We have additionally decided to leverage this period, during which most systems are already unavailable, to perform a lengthy file-system checking procedure on the Community File System (CFS).

We are performing this file-system check to address an error that could indicate data corruption in two files on CFS. The files had been previously deleted and the error occurred while freeing data blocks during a routine file-system action. We have not seen evidence of any additional data corruption, but out of an abundance of caution we have elected to run a full scan of CFS to determine whether any other files have been damaged. By running this procedure during this full-facility downtime, we are able to perform these tests during time that Perlmutter would already be unavailable, reducing the total amount of time that NERSC systems will be unavailable to users by at least three days.

On the morning of Monday, January 27, NERSC will prepare all systems for the power shutdown and begin the first phase of the file-systems check on CFS. We will begin bringing down Perlmutter at 9:00 a.m. Spin workloads that use CFS will be brought down beginning at 9:00 a.m. CFS will be unmounted from the DTNs at 10:00 a.m., and the first phase of the file systems check will begin shortly after that.

On the morning of Tuesday, January 28, power systems will be de-energized for maintenance. Several systems will remain up and will be available for use:

  • HPSS will remain up on generator power, but on-site transfers between it and CFS or Perlmutter scratch will not be possible.
  • DTNs will remain up on generator power, in a degraded state, without access to CFS or Perlmutter scratch.
  • Spin will remain up on generator power, and all workloads that do not use CFS will be available.
  • Iris will remain up on generator power and accessible to users for password changes, new user accounts, PI operations, etc.
  • The Superfacility API will be generator powered, and will reflect the center status.
  • NERSC websites (including www.nersc.gov and docs.nersc.gov) and the help.nersc.gov help-desk system (ServiceNow) are externally cloud hosted and will be available.

CFS will stay up on generator power as well, but it will remain unavailable through the entire power outage event as we continue to perform the file-system check. Perlmutter will remain de-energized through the three days of power maintenance. All data on NERSC file systems will remain intact throughout the maintenance.

On Friday, January 31, the power maintenance will be complete and NERSC will begin restoring all systems upon the final completion of the CFS file system check. We anticipate that all systems will be restored to users by 10:00 p.m. on Saturday, February 1, 2025. In a best-case scenario, systems could return as early as Friday. In a worst-case scenario, where we discover a significant issue on CFS, it could take additional time; however, at this time we believe a Saturday return to users is most likely. We will update the timeline when we have a better understanding of the time requirements, especially as the file-system check proceeds and we know more about the severity of the issue.

We appreciate your understanding as we ensure the safety of the electrical systems required to operate large-scale resources at NERSC and the integrity of CFS, and we look forward to continuing to serve you throughout 2025.