NERSCPowering Scientific Discovery Since 1974

HSI Tape Ordering

General Procedure

If you are retrieving multiple files from HPSS, it is best to order your retrieval requests in a way that makes sense for the HPSS system. In HPSS, files initially go onto a disk cache and migrate to tape as time passes. This means that files that were put into HPSS at the same time could end up spread across multiple tapes. Since each tape must by loaded into the reader, it will be fastest if you order your requests so that you are pulling all the files from a tape while it's in the reader.We have written a script to help you do this.

To use the script, you first need a list of fully qualified file path names and/or directory path names. If you do not already have such a list, you can  query HPSS using the following command:

hsi -q 'ls -1 <HPSS_files_or_directories_you_want_to_retrieve>' >& temp.txt

(for bash, replace ">&" with "2>"). Once you have the list of files, you can feed it to the sorting program:

hpss_file_sorter.script temp.txt > retrieval_list.txt

This will print a list to standard out of files sorted in the optimal retrival order.

Typically the best way to retrieve this list from HPSS is with the "cget" command, which will get the file from HPSS only if it isn't already in the output directory. You also should take advantage of the "hsi in <file_of_hsi_commands.txt>" to run an entire set of HPSS commands in one HPSS session. This will avoid HPSS doing a sign in procedure for each file, which can add up to a significant amount of time if you are retrieving many files. To do this, you'll need to add a little something to the retrieval_list.txt file you already generated:

awk '{print "cget",$1}' retrieval_list.txt > final_retrieval_list.txt

Finally, you can retrieve the files from HPSS with

hsi "in final_retrieval_list.txt"

If you have any further questions about this, please contact consult@nersc.gov.

Additional Considerations

This procedure will return all the files you're retrieving in a single directory. You may want to preserve some of the directory structure you have in HPSS. If so, you could automatically recreate HPSS subdirectories in your target directory with this command

sed 's:^'<your_hpss_directory>'/\(.*\):\1:' temp.txt | xargs -I {} dirname {} | sort | uniq | xargs -I {} mkdir -p {}

where <your_hpss_directory> is the root directory you want to harvest subdirectories from, and temp.txt holds the output from your "ls -1" call.

If your individual files are very large and you're using a fast network (i.e. inside of NERSC), there are ways you can further optimize the file retrieval. Email consult@nersc.gov for details.