System Status Definitions
Resources
Perlmutter
| Resource | Description |
|---|---|
| Compute Nodes | High-performance nodes on which user jobs submitted through the batch system run |
| Container Runtimes | Tools such as Shifter and Podman that enable running container-based jobs on compute and login nodes |
| Scratch | The high-performance Scratch File System recommended for use by running jobs ($SCRATCH) |
| Slurm Commands | The capability to run commands that submit, view, and manage jobs on the system, as well as commands to see historical job information |
| Urgent Jobs | Interactive jobs, jobs running in the Realtime queue, and Jupyter jobs running on compute nodes |
Storage
| Resource | Description |
|---|---|
| Community File System | The Community File System, the recommended storage system for shared files within a project |
| Global Common | The global file system mounted at /global/common on most NERSC hosts, recommended for storing software shared within a project |
| Global Homes | The global file system mounted at /global/homes, which contains user home directories |
| HPSS Archive | The primary archival storage system for NERSC users |
| Regent | A legacy archival storage system still in use by a small subset of NERSC users |
Websites
| Resource | Description |
|---|---|
| help.nersc.gov | NERSC’s user help system |
| iris.nersc.gov | User accounting portal |
| jupyter.nersc.gov | Jupyter notebook service |
| portal.nersc.gov | Science gateway service that serves files from project /www directories on CFS |
| www.nersc.gov | NERSC’s front-door public website |
Services
| Resource | Description |
|---|---|
| Authentication | Login and password handling services, including multifactor authentication |
| Data Transfer Nodes | Tools such as Shifter and Podman that enable running container-based jobs on compute and login nodes |
| Globus | High-performance, large-scale data transfers into, out of, and within NERSC via Globus, including data sharing via Globus collections |
| Registry | The NERSC-managed registry for container images |
| Network | Wide area network connecting on-site NERSC systems to the Internet, and local area “data center” networks connecting NERSC systems to each other |
| Science Databases | Database services managed by NERSC staff on behalf of users |
| Spin | Container-based edge services hosting for users |
| Superfacility API | Web API for programmatic access to user-facing resources |
| ThinLinc | Support for access to high-performance systems via the ThinLinc graphical client |
Resource states
Down
A resource will be marked as “Down” if users are not able to perform expected functions with it due to a malfunction of hardware, software, or other infrastructure, or due to planned maintenance. Some resources have more explicit definitions that govern when we consider them to be down:
- Compute nodes are down if utilization drops below 50% due to a malfunction of hardware, software, or other infrastructure.
- Login nodes are down if users are unable to log into the system and perform normal functions due to a malfunction of the login node hardware or software. Normal functions include compiling codes, submitting batch jobs or querying the batch system, transferring data into and out of the system, and using the primary scratch file system.
- Slurm Commands are down if users are unable to submit batch jobs, view jobs in the queue, and manage their submitted jobs.
- Resources used in data transfer, such as data transfer nodes, networking, and Globus, are down if users are unable to use them to move data due to a malfunction or planned maintenance.
- The Scratch file system is down if users are unable to use it from login or compute nodes.
- The HPSS file system is down if HPSS logins or transfers are unavailable for all users via all supported clients.
- Any filesystem is down if it is inaccessible due to network, hardware, or power failure from all of NERSC’s access points.
- Spin is down if all workloads are inaccessible for over 15 minutes.
- Any resource can be marked as down if NERSC management determines that it fails to adequately fulfill its intended function (e.g., if the compute nodes experience a high job failure rate).
Degraded
A resource will be marked as “Degraded” if users can perform expected functions with it but some aspect of its use is impaired due to a malfunction of hardware, software, or other infrastructure. Some resources have more explicit definitions that govern when we consider them degraded:
- Compute nodes are degraded if system utilization is less than 90% and greater than 50% due to a malfunction of hardware, software, or other infrastructure. They are also degraded if partitions are down for more than 15 minutes.
- Any mounted file system, such as CFS, Scratch, or HPSS, is degraded if it is experiencing reduced performance reading, writing, listing, creating, or deleting files. A file system can also be considered degraded if service for some users and/or some data in the system is unavailable, or if it is unstable due to network issues.
- HPSS is degraded if it is available but at a reduced level of performance.
- The NERSC network is degraded if wide area or local area network performance is experiencing significant drops in performance or connectivity, or if bandwidth decreases on the order of 50% due to an outage or planned maintenance.
- Spin is degraded if all workloads are down for a brief period of time (under 15 minutes); if a subset of workloads is inaccessible; if performance is below normal levels; if the Rancher UI is inaccessible (preventing new deployments or updates of existing ones); or in other circumstances at the judgment of the Spin POC.
- Any resource can be marked degraded if NERSC management determines that it significantly impacts user productivity but does not meet the criteria for being down.
Up
Any resource not meeting the criteria for being down or degraded and not under active investigation for a potential malfunction will be marked as “Up.” A system that is marked as “Investigating” is also considered to be up unless and until it is marked as “Down” or “Degraded.”
Investigating
When a system is up, users may still encounter issues that may or may not turn out to be the result of a malfunction. If user reports suggest a possible malfunction, NERSC engineers will look into the problem to determine whether that is the case. During the investigative period, the status of “Investigating” will be reported for the resource.
This status is never used when we are looking into a problem that is already known to be causing a resource to be down or degraded. It applies to any situation where the system is up and the following are both true:
- NERSC engineers are aware of user reports that could be the result of a malfunction of hardware, software, or other infrastructure affecting system utilization.
- Engineers are actively investigating such an issue, but have not yet determined whether it meets the criteria for being down or degraded.
Scheduled outages
For an outage to be considered a scheduled outage, the user community must be notified of the need for a maintenance event window no less than 24 hours in advance of the outage (emergency fixes). Users will be notified of regularly scheduled maintenance (i.e., scheduled outages that repeat at relatively consistent time intervals) in advance, on a schedule that provides sufficient notification, no less than 72 hours prior to the event, and preferably as much as seven calendar days prior. If a regularly scheduled maintenance is not needed, users will be informed of the cancellation of that maintenance event in a timely manner. Any interruption of service that does not meet the minimum notification window is categorized as an unscheduled outage.
Outages that extend past a scheduled maintenance window by 4 hours or less are considered part of the original scheduled maintenance. After 4 hours, the downtime becomes a new event, an unscheduled outage.