System Status Definitions

Resources

Perlmutter

Resource	Description
Compute Nodes	High-performance nodes on which user jobs submitted through the batch system run
Container Runtimes	Tools such as Shifter and Podman that enable running container-based jobs on compute and login nodes
Scratch	The high-performance Scratch File System recommended for use by running jobs ($SCRATCH)
Slurm Commands	The capability to run commands that submit, view, and manage jobs on the system, as well as commands to see historical job information
Urgent Jobs	Interactive jobs, jobs running in the Realtime queue, and Jupyter jobs running on compute nodes

Storage

Resource	Description
Community File System	The Community File System, the recommended storage system for shared files within a project
Global Common	The global file system mounted at /global/common on most NERSC hosts, recommended for storing software shared within a project
Global Homes	The global file system mounted at /global/homes, which contains user home directories
HPSS Archive	The primary archival storage system for NERSC users
Regent	A legacy archival storage system still in use by a small subset of NERSC users

Websites

Resource	Description
help.nersc.gov	NERSC’s user help system
iris.nersc.gov	User accounting portal
jupyter.nersc.gov	Jupyter notebook service
portal.nersc.gov	Science gateway service that serves files from project /www directories on CFS
www.nersc.gov	NERSC’s front-door public website

Services

Resource	Description
Authentication	Login and password handling services, including multifactor authentication
Data Transfer Nodes	Nodes optimized for transferring large files or file collections efficiently
Globus	High-performance, large-scale data transfers into, out of, and within NERSC via Globus, including data sharing via Globus collections
Registry	The NERSC-managed registry for container images
Network	Wide area network connecting on-site NERSC systems to the Internet, and local area “data center” networks connecting NERSC systems to each other
Science Databases	Database services managed by NERSC staff on behalf of users
Spin	Container-based edge services hosting for users
Superfacility API	Web API for programmatic access to user-facing resources
ThinLinc	Support for access to high-performance systems via the ThinLinc graphical client

Resource states

Down

A resource will be marked as “Down” if users are not able to perform expected functions with it due to a malfunction of hardware, software, or other infrastructure, or due to planned maintenance. Some resources have more explicit definitions that govern when we consider them to be down:

Compute nodes are down if utilization drops below 50% due to a malfunction of hardware, software, or other infrastructure.
Login nodes are down if users are unable to log into the system and perform normal functions due to a malfunction of the login node hardware or software. Normal functions include compiling codes, submitting batch jobs or querying the batch system, transferring data into and out of the system, and using the primary scratch file system.
Slurm Commands are down if users are unable to submit batch jobs, view jobs in the queue, and manage their submitted jobs.
Resources used in data transfer, such as data transfer nodes, networking, and Globus, are down if users are unable to use them to move data due to a malfunction or planned maintenance.
The Scratch file system is down if users are unable to use it from login or compute nodes.
The HPSS file system is down if HPSS logins or transfers are unavailable for all users via all supported clients.
Any filesystem is down if it is inaccessible due to network, hardware, or power failure from all of NERSC’s access points.
Spin is down if all workloads are inaccessible for over 15 minutes.
Any resource can be marked as down if NERSC management determines that it fails to adequately fulfill its intended function (e.g., if the compute nodes experience a high job failure rate).

Degraded

A resource will be marked as “Degraded” if users can perform expected functions with it but some aspect of its use is impaired due to a malfunction of hardware, software, or other infrastructure. Some resources have more explicit definitions that govern when we consider them degraded:

Compute nodes are degraded if system utilization is less than 90% and greater than 50% due to a malfunction of hardware, software, or other infrastructure. They are also degraded if partitions are down for more than 15 minutes.
Any mounted file system, such as CFS, Scratch, or HPSS, is degraded if it is experiencing reduced performance reading, writing, listing, creating, or deleting files. A file system can also be considered degraded if service for some users and/or some data in the system is unavailable, or if it is unstable due to network issues.
HPSS is degraded if it is available but at a reduced level of performance.
The NERSC network is degraded if wide area or local area network performance is experiencing significant drops in performance or connectivity, or if bandwidth decreases on the order of 50% due to an outage or planned maintenance.
Spin is degraded if all workloads are down for a brief period of time (under 15 minutes); if a subset of workloads is inaccessible; if performance is below normal levels; if the Rancher UI is inaccessible (preventing new deployments or updates of existing ones); or in other circumstances at the judgment of the Spin POC.
Any resource can be marked degraded if NERSC management determines that it significantly impacts user productivity but does not meet the criteria for being down.

Up

Any resource not meeting the criteria for being down or degraded and not under active investigation for a potential malfunction will be marked as “Up.” A system that is marked as “Investigating” is also considered to be up unless and until it is marked as “Down” or “Degraded.”

Investigating

When a system is up, users may still encounter issues that may or may not turn out to be the result of a malfunction. If user reports suggest a possible malfunction, NERSC engineers will look into the problem to determine whether that is the case. During the investigative period, the status of “Investigating” will be reported for the resource.

This status is never used when we are looking into a problem that is already known to be causing a resource to be down or degraded. It applies to any situation where the system is up and the following are both true:

NERSC engineers are aware of user reports that could be the result of a malfunction of hardware, software, or other infrastructure affecting system utilization.
Engineers are actively investigating such an issue, but have not yet determined whether it meets the criteria for being down or degraded.

Scheduled outages

For an outage to be considered a scheduled outage, the user community must be notified of the need for a maintenance event window no less than 24 hours in advance of the outage (emergency fixes). Users will be notified of regularly scheduled maintenance (i.e., scheduled outages that repeat at relatively consistent time intervals) in advance, on a schedule that provides sufficient notification, no less than 72 hours prior to the event, and preferably as much as seven calendar days prior. If a regularly scheduled maintenance is not needed, users will be informed of the cancellation of that maintenance event in a timely manner. Any interruption of service that does not meet the minimum notification window is categorized as an unscheduled outage.

Outages that extend past a scheduled maintenance window by 4 hours or less are considered part of the original scheduled maintenance. After 4 hours, the downtime becomes a new event, an unscheduled outage.

About Us

QIS@Perlmutter Issues 2026 Call for Proposals

What’s Happening at NERSC?

NERSC Quantum Report Projects Progress for Science

Empowering and accelerating science through high performance computing

NERSC Issues 2026 Call for Proposals for IBM Quantum Innovation Center

News, Training, Essential Links