NERSC Data Management Strategy and Policies
NERSC provides its users with the means to store, manage, and share their research data products.
In addition to systems specifically tailored for data-intensive computations, we provide a variety of storage resources optimized for different phases of the data lifecycle; tools to enable users to manage, protect, and control their data; high-speed networks for intra-site and inter-site (ESnet) data transfer; gateways and portals for publishing data for broad consumption; and consulting services to help users craft efficient data management processes for their projects.
OSTP/Office of Science Data Management Requirements
Project Principal Investigators are responsible to meet OSTP (Office of Science and Technology Policy) and DOE Office of Science data management requirements for long-term data sharing and preservation. The OSTP has issued a memorandum on Increasing Access to the Results of Federally Funded Scientific Research (http://www.whitehouse.gov/sites/default/files/microsites/ostp/ostp_public_access_memo_2013.pdf).
NERSC resources are intended for users with active allocations, and as described below, NERSC cannot guarantee long-term data access without a prior, written, service-level agreement. Please carefully consider these policies, including their limitations and restrictions, as you develop your data management plan.
NERSC Global Filesystem (NGF)
NGF is a collection of centerwide file systems, based on IBM’s GPFS, available on nearly all systems at the facility. The several different file systems that comprise NGF include: one providing a common login user environment for all our systems, one for sharing data among collaborators on a science project or team, and one for high bandwidth short term storage across systems at the facility. The main focus of NGF is data sharing, ease of workflow management (i.e., not moving data around or maintaining unnecessary copies of data), and data analysis.
Scratch File Systems
Our two primary computational systems, Hopper and Edison, have dedicated parallel file systems based on Lustre that are optimized for short-term storage of application output and checkpoints.
Archival Storage (HPSS)
HPSS provides long-term archival storage to users at the facility. The main focus of the system is data stewardship or preservation, but it also supports some data sharing needs. NERSC maintains a second archive for backup of the disk-based file systems.
Data Transfer, Data Analysis, and Collaborative Capabilities
Data Transfer Nodes
To enable high speed data transfer, we provide a set of parallel transfer servers tuned primarily for WAN data movement (into and out of the facility), and secondarily for high speed local transfers (e.g., NGF to HPSS). Staff from NERSC and ESnet with extensive data transfer experience are available to aid scientists in organizing, packaging, and moving their data to the most appropriate storage resource at our facility.
Our facility provides numerous endpoints (access points) into our various storage systems. The key benefit of using the GlobusOnline software is its ease of use in providing access to high performance storage resources, including third party transfers and reliable fire-and-forget self-managed transfers.
Direct Web Access to Data
For rapid and simple data sharing, users can enable web access to files in their project file system. See the Getting Started section of the Science Gateways page.
For more complex collaborative data, analytics, and computing projects, NERSC promotes building scientific web apps that connect stakeholders to computing resources, NGF, and HPSS. This includes web architectures for science, rapid data subselection, deep search including machine learning, and simulation science in addition to bulk data movement. Staff with programming, web application, and portal development skills can aid scientists in creating web-based interfaces to their data for sharing, comparison, provenance, and federation. See Science Gateways.
NERSC Web Toolkit (NEWT)
While NERSC provides users with the systems and services that aid them in managing their data, users have ultimate responsibility for managing their data:
- Select the most appropriate resources to meet their individual needs.
- Use shared resources responsibly.
- Set appropriate access control limits.
- Archive and back up critical data.
- Never keep a single copy of critical data; NERSC is not responsible for data loss.
- Follow proper use policies, described in the NERSC User Agreement.
Data Confidentiality and Access Control
The NERSC network is an open research network, intended primarily for fundamental research. We cannot guarantee complete confidentiality for data that resides at NERSC. It is your responsibility to set access controls appropriate to your needs, understanding that your data is stored on a semi-public system. While we take care to secure our systems, security breaches could expose your data to others.
Files are protected only using UNIX file permissions based on NIM user and group IDs. It is the user’s responsibility to ensure that file permissions and umasks are set to match their needs. See Sharing Data for information on how to control file access.
NERSC system administrators with “root” privileges are not constrained by the file permissions, and they have the ability to open and/or copy all files on the system. They can also assume a user’s identity on the system. NERSC HPC consultants also have the capability to assume a user’s identity to help users troubleshoot application or user environment problems. Vendor support personnel acting as agents of NERSC may also have administrative privileges.
Administrators only use these elevated privileges under certain highly restricted situations and, generally speaking, they only do so when requested, or if there is a suspected problem/security issue. Following are specific instances where we might look at your files:
- We keep copies of all error, output, and job log files and may review them to determine if a job failure was due to user error or a system failure.
- If you request our assistance via any mechanism, e.g., help ticket, direct personal email, in person, etc., we interpret that request to be explicit permission to view your files if we think doing so will aid us in resolving your issue.
Users may encrypt data to provide extra measures of privacy if desired.
Under ordinary circumstances, our staff will not copy, expose, discuss, or in any other way communicate your project information to anyone outside of your project or NERSC. There are two key exceptions:
- When an account expires or a user leaves a project, NERSC will honor requests to change file ownership when instructed by the original user or the most recent principal investigator (or designated PI proxy) of the sponsoring project.
- NERSC is required to address, safeguard against, and report misuse, abuse, security violations, and criminal activities. NERSC therefore retains the right, at its discretion, to disclose any and all data files or records of network traffic to appropriate cyber security organizations and law enforcement officials.
Selecting the Appropriate Data Storage Resource
NERSC provides a broad range of storage solutions to address different needs. The following describes each offered storage solution: its intended use (capabilities), data protection (backup), data retention, and available allocations. Users should familiarize themselves with these solutions, and select the most appropriate for their individual needs. Most users will use a combination of all four, depending on usage, performance, and data retention needs.
Home File System
The home file system is intended to hold source code, executables, configuration files, etc., and is optimized for small to medium sized files. It is NOT meant to hold the output from your application runs; the scratch or project file systems should be used for computational output.
This file system has redundancy in the servers and storage, and duplicate copies of metadata, so we expect good stability and the highest availability in this space.
There are nightly tape backups performed to enable recovery of files older than one day after file creation. Backups are kept for 90 days.
Data for active users is not purged from this space. A user is considered inactive if they do not have an active allocation and have not accessed their data for at least one year. All files in your home file system will be archived to tape and maintained on disk for one year from the date your account is deactivated.
Each user is allocated a directory with a 40 GB hard quota in the home file system. This is the default allocation granted to all users.
Project File System
The project file system is primarily intended for sharing data within a team or across computational platforms. The project file systems are parallel file systems optimized for high-bandwidth, large-block-size access to large files. Once any active production and/or analysis is completed and you don't need regular access (> 1 year) to the data any longer, you should either archive the data in the HPSS data archive (below) or transfer it back to your home institution.
There is a separate projectb file system on the Joint Genome Institute's Genepool system. See Genepool File Storage.
These file systems have redundancy in the servers and storage and duplicate copies of metadata, so we expect good stability and reliability. With high demand and capacity shared usage, we do expect some degradation on availability of the system (97% is the target overall availability).
Daily backups are performed for project directories that have quotas equal to or below 5 TB, and are kept for 90 days. Restore time is dependent on actual amount of data and can take several days to complete. No backups are performed on project directories with a quota above 5 TB. A user may opt to archive their data in HPSS.
Data for active users is not purged from this space. A user or project will be considered inactive if they do not have an active allocation and have not accessed the data in at least one year. All project directories will be archived to tape and maintained on disk for one year from the date your account is deactivated.
Each repository is allocated a directory in the project file system. The default quota for project directories is 1 TB and the default name of a project directory is the repository name (i.e. m767). These directories are owned by the Principal Investigators (PIs) and are accessible to everyone in the Unix group associated with the repository. PIs and PI Proxies may request additional space via the quota increase request form. If files need to be shared with a group that is different from a repository group, PIs and PI Proxies can request a special project directory by filling out the Project Directory Request form.
Scratch File Systems
Hopper and Edison each have large, local, parallel scratch file systems dedicated to the users of those systems. The global scratch file system, while accessible from all systems, is primarily for use with our midrange systems (Carver, Genepool, PDSF, etc.). The scratch file systems are intended for temporary uses such as storage of checkpoints or application result output. If you need to retain files longer than the purge period (see below), the files should be copied to the project or home file systems, or to HPPS.
These file systems have redundancy in the servers and storage, so we expect good stability and reliability. The extreme high demand placed on these storage systems results in some degradation of availability for the systems (97% is the target overall availability).
Due to the extremely large volume of data and its temporary nature, backups are not performed.
These file systems are purged on a regular interval as communicated to users of the system (e.g., all files not accessed within 12 weeks). All files in scratch are eligible for deletion one week after your account is deactivated.
Each user is allocated a directory in the local scratch file systems and in the global scratch file system. The default local scratch quota (on Hopper and Edison) is 5 TB. On the global scratch file system, the default quota is 20 TB. Users may request an increase in quota using a web form. Quota increase requests are evaluated by the NERSC data group team, which considers the merits and duration of the request, and also must consider the amount of disk space available.
HPSS Data Archive
The HPSS data archive is intended for long-term, offline storage (tape) of results that you wish to retain, but either you do not need immediate access to, or you do not have room in the other file systems (above). The primary HPSS access is via HSI, or HTAR for small files. We also have access to HPSS via gridFTP or GlobusOnline. Tape backups of the other file systems are stored in a special partition of the archive.
The HPSS service is configured to be stable and reliable, but is routinely taken offline for short periods for planned maintenance.
By default, a single copy of the data will be written to tape. Data loss due to hardware faults can occur, but is very rare. All tapes are currently stored at the NERSC facility. There is no off-site backup, and in principle there is a chance of permanent data loss in the case of a site disaster such as a fire. Critical data should be manually protected by making an explicit second copy: You can make another copy within the data archive, or you can request an account on our separate backup archive for additional copies. In both cases, you will be “charged” twice for the storage from your HPSS allocation. Alternatively, you can transfer the data to another site.
We do not actively remove data from HPSS, and will communicate with users or data owners should a need arise to reclaim space. Note that in NERSC’s history, we have not deleted files in the HPSS archive due to storage limitations. Projects requiring a firm commitment for data retention should contact NERSC.
Projects receive HPSS allocations at the same time that computational resources are allocated. All projects receive an HPSS allocation; the allocation units are called Storage Resource Units.
Requests for Data Storage Beyond Defaults
NERSC establishes defaults for allocations, retention, and backup to serve the needs of the majority of our users, avoid oversubscription, and place practical limits on the costs of storage provided. We are willing to work with individual projects to accommodate special needs. If you need additional space, please fill out the Quota increase form here or send an email to email@example.com and we will address those issues on a case by case basis. Exceptionally large requests may require alternative funding.
Proper Use of Software and Data
Please consult the NERSC User Agreement for the policies regarding appropriate use (and limitations on use) of software and data. NOTE: The following classes of software and data raise red flags, and may be prohibited or restricted in some way. Users should carefully consult the User Agreement before moving such information to NERSC systems:
- Classified or controlled military or defense information
- Software without proper licenses
- Export controlled or ITAR software or data
- Personally identifiable information
- Medical or health information
- Proprietary information