NERSCPowering Scientific Discovery Since 1974

HSI Tape Ordering

General Procedure

If you are retrieving multiple files from HPSS, it is best to order your retrieval requests in a way that makes sense for the HPSS system. In HPSS, files initially go onto a disk cache and migrate to tape as time passes. This means that files that were put into HPSS at the same time could end up spread across multiple tapes. Since each tape must by loaded into the reader, it will be fastest if you order your requests so that you are pulling all the files from a tape while it's in the reader.We have written a script to help you do this.

To use the script, you first need a list of the full path of files you'd like to retrieve from HPSS. You can generate this list with

hsi -q 'ls -P' <HPSS_directories_you_want_to_retrieve> >& temp.txt

(for bash, replace ">&" with "2>"). Once you have the list of files, you can feed it to the sorting program:

hpss_file_sorter.script temp.txt > retrieval_list.txt

This will print a list to standard out of files sorted in the optimal retrival order.

Typically the best way to retrieve this list from HPSS is with the "cget" command, which will get the file from HPSS only if it isn't already in the output directory. You also should take advantage of the "hsi in <file_of_hsi_commands.txt>" to run an entire set of HPSS commands in one HPSS session. This will avoid HPSS doing a sign in procedure for each file, which can add up to a significant amount of time if you are retrieving many files. To do this, you'll need to add a little something to the retrieval_list.txt file you already generated:

cat retrieval_list.txt | awk '{print "cget",$1}' > final_retrieval_list.txt

Finally, you can retrieve the files from HPSS with

hsi "in final_retrieval_list.txt"

If you have any further questions about this, please contact consult@nersc.gov.

Additional Considerations

If the list of files is very long, you may want to consider splitting this into several smaller lists and running mulitple retrivals in parallel. The following command will give you four text files, each with 5000 lines and names like final_retrieval_list_aaaa:

 split --lines=5000 --suffix-length=4 final_retrieval_list.txt final_retrieval_list_

Then you can run four different instances of hsi, each with a different file.

Finally, this procedure will return all the files you're retrieving in a single directory. You may want to preserve some of the directory structure you have in HPSS. If so, you could automatically recreate HPSS subdirectories in your target directory with this command

sed 's:^'<your_hpss_directory>'/\(.*\):\1:' temp.txt | xargs -I {} dirname {} | sort | uniq | xargs -I {} mkdir -p {}

where <your_hpss_directory> is the root directory you want to harvest subdirectories from, and temp.txt holds the output from your "ls -P" call.

If your individual files are very large and you're using a fast network (i.e. inside of NERSC), there are ways you can further optimize the file retrieval. Email consult@nersc.gov for details.