NERSCPowering Scientific Discovery Since 1974

Data Management

Please remove ALL data from /house!

 Do you still have data in /house/homedirs?  Do you know if you have data in /house/homedirs?  Please check now and make a plan for moving that data to the archiver or one of the NERSC file systems (for more information on these filesystems go to File storage and I/O). 

Moving data from house to DnA

The DnA file system is primarily for finished projects, data that is ready to be archived, or data that is shared between groups.  It is mounted read-only on the cluster, but you can write to directories on this file system in a few ways:

Moving data from house to Projectb

Projectb is where compute jobs run and output both intermediate files as well as finished products.  The Projectb file system is NOT backed up.  The SCRATCH section of this file system is subject to a 90 day purge cycle.  The SANDBOX area is not purged and is managed by each JGI group.  If you are actively working on something in /house, this data should be moved to space within Projectb. 

  • Data Transfer Nodes until December 1, 2013 (examples here)
  • Any queue on the Genepool cluster until November 15, 2013 (examples here)
  • Gpints and login nodes can be used for data transfer to /house until November 15, 2013

Moving data from house to HPSS

For LEGACY data (owned by a group, not necessarily an individual), please consider using JAMO, the archiver developed by the Sequence Data Management group at the JGI.  This archiving tool was built in conjunction with a queryable metadata system that will make it easier for you to find legacy data years from now!  If you have questions about how to get started with this system, please contact consult at nersc dot gov.

For your own personal data, use the htar and hsi commands detailed below.

Moving data from Projectb to DnA

Once you have completed some analysis in Projectb and you want the data to be either backed up to the archive, or made available to other groups at the JGI, you will need to move the data to DnA (data n' archive).  If you would like to register these files for access with JAMO (simplifies sharing between groups), please contact consult at nersc dot gov.  

To move the data yourself you can either:


          Use the fast datatransfer nodes to move data quickly

NERSC has setup 2 fast data transfer nodes to help JGI users move data between file systems and back up data to HPSS. Note that the netapps file system will not be available on Genepool, but the house and new projectb file systems will.  This means that users need to move data out of the netapps onto house or projectb.

Login to the data transfer nodes with the following commands

ssh dtn03.nersc.gov

or

ssh dtn04.nersc.gov

          Archiving your data with HPSS

These are some basic examples of data transfer and access with HPSS.

Access Example

Using HSI from a NERSC Production System

All of the NERSC computational systems available to users have the hsi client already installed.  To access the Archive storage system you can type hsi with no arguments:

% hsi

That is, the utility is set up to connect to the Archive system by default.  This is equivalent to typing:

% hsi -h archive.nersc.gov

HSI Usage Example

You can run hsi commands in several different ways:

From a command line: % hsi
Single-line execution: % hsi "mkdir run123;  cd run123; put bigdata.0311
Read commands from a file: % hsi "in command_file"
Read commands from standard input: % hsi < command_file
Read commands from a pipe: % cat command_file | hsi

Just typing hsi will enter an interactive command shell, placing you in your home directory on the Archive system.  From this shell, you can run the ls command to see your files, cd into storage system subdirectories, put files into the storage system and get files from it.

Specifying local and HPSS file names when storing or retreiving files

The HSI put command stores files from your local file system into HPSS and the get command retrieves them.  The command:

% put myfile

will store the file named "myfile" from you current local file system directory into a file of the same name into your current HPSS directory.  So, in order to store "myfile" into the "run123" subdirectory of your home in HPSS, you can type:

% hsi
A:/home/j/joeuser-> cd run123
A:/home/j/joeuser-> put myfile

or

% hsi "cd run123; put myfile"

The hsi utility uses a special syntax to specify local and HPSS file names when using the put and get commands:

  1. The local file name is always on the left and the HPSS file name is always on the right.
  2. Use a ":" (colon character) to seperate the names

That is:

% put local_file : hpss_file
% get local_file : hpss_file

This format is convenient if you want to store a file named "foo" in the local directory as "foo_2010_09_21" in HPSS:

% hsi "put foo : foo_2010_09_21"

You can also use this method to specify the full or relative pathnames of files in both the local and HPSS file systems:

% hsi "get bigcalc/hopper/run123/datafile.0211 : /scratch2/scratchdirs/joeuser/analysis/data"

Archiving your data with HTAR

HTAR is a command line utility that creates and manipulates HPSS-resident tar-format archive files.  It is ideal for storing groups of files in HPSS.  Since the tar file is created directly in HPSS, it is generally faster and uses less local space than creating a local tar file then storing that into HPSS.  However, there is a file size limit of 64GB for an individual file within the archive (archives themselves can be much larger).  So if you have individual files that are larger than 64GB that you need to back up, use hsi for those files.

Examples of when to use HTAR

HTAR is useful for storing groups of related files that you will probably want to access as a group in the future.  Examples include:

  • archiving a source code directory tree
  • archiving output files from a code simulation run
  • archiving files generated by the run of an experiment

If stored individually, the files will likely be distributed across a collection of tapes, requiring possibly long delays (due to multiple tape mounts) when fetching them from HPSS.  On the other hand, an HTAR archive file will likely be stored on a single tape, requiring only a single tape mount when it comes time to retrieve the data.

HTAR Usage Example

The basic syntax of HTAR is similar to the standard tar utility:

 htar -{c|K|t|x|X} -f tarfile [directories] [files]

As with the standard unix tar utility the "-c" "-x" and "-t" options respectively function to create, extract, and list tar archive files. The "-K" option verifies an existing tarfile in HPSS and the "-X" option can be used to re-create the index file for an existing archive.  
Please note, you cannot add or append files to an existing archive.

Note: when HTAR creates an archive, it places an additional file (with a strange name) at the end of the archive.  Just ignore the file, it is for HTAR interal use and will not be retrieved when you extract the files from the archive.

# Create an archive with directory "nova" and file "simulator"
% htar -cvf nova.tar nova simulator
HTAR: a   nova/                                                                   
HTAR: a   nova/sn1987a
HTAR: a   nova/sn1993j
HTAR: a   nova/sn2005e
HTAR: a   simulator
HTAR: a   /scratch/scratchdirs/joeuser/HTAR_CF_CHK_61406_1285375012
HTAR Create complete for nova.tar. 28,396,544 bytes written for 4 member files, max threads: 4 Transfer time: 0.420 seconds (67.534 MB/s)
HTAR: HTAR SUCCESSFUL      

# Now List the contents
% htar -tf nova.tar
HTAR: drwx------  joeuser/joeuser          0 2010-09-24 14:24  nova/
HTAR: -rwx------  joeuser/joeuser    9331200 2010-09-24 14:24  nova/sn1987a
HTAR: -rwx------  joeuser/joeuser    9331200 2010-09-24 14:24  nova/sn1993j
HTAR: -rwx------  joeuser/joeuser    9331200 2010-09-24 14:24  nova/sn2005e
HTAR: -rwx------  joeuser/joeuser     398552 2010-09-24 17:35  simulator
HTAR: -rw-------  joeuser/joeuser        256 2010-09-24 17:36  /scratch/scratchdirs/joeuser/HTAR_CF_CHK_61406_1285375012
HTAR: HTAR SUCCESSFUL

# now, as an example, using hsi remove the nova.tar.idx index file from HPSS
# (Note: you generally do not want to do this)
% hsi "rm nova.tar.idx"
...
rm: /home/j/joeuser/nova.tar.idx (2010/09/24 17:36:53 3360 bytes)

# Now try to list the archive contents without the index file:
% htar -tf nova.tar
ERROR: No such file: nova.tar.idx           
ERROR: Fatal error opening index file: nova.tar.idx
HTAR: HTAR FAILED

# Here is how we can rebuild the index file if it is accidently deleted
% htar -Xvf nova.tar
HTAR: i nova                         
HTAR: i nova/sn1987a
HTAR: i nova/sn1993j
HTAR: i nova/sn2005e
HTAR: i simulator
HTAR: i /scratch/scratchdirs/joeuser/HTAR_CF_CHK_61406_1285375012
HTAR: Build Index complete for nova.tar, 5 files 6 total objects, size=28,396,544 bytes
HTAR: HTAR SUCCESSFUL

#
% htar -tf nova.tar
HTAR: drwx------  joeuser/joeuser          0 2010-09-24 14:24  nova/
HTAR: -rwx------  joeuser/joeuser    9331200 2010-09-24 14:24  nova/sn1987a
HTAR: -rwx------  joeuser/joeuser    9331200 2010-09-24 14:24  nova/sn1993j
HTAR: -rwx------  joeuser/joeuser    9331200 2010-09-24 14:24  nova/sn2005e
HTAR: -rwx------  joeuser/joeuser     398552 2010-09-24 17:35  simulator
HTAR: -rw-------  joeuser/joeuser        256 2010-09-24 17:36  /scratch/scratchdirs/joeuser/HTAR_CF_CHK_61406_1285375012
HTAR: HTAR SUCCESSFUL

 

For more examples, please go to the HPSS page.