Mistakes to Avoid
There are a number of workflows that should be avoided as much as possible:
Large tape storage systems do not work well with small files. Retrieval of large numbers of small files from HPSS will incur significant delays due to the characteristics of tape media:
- File retrieval from tape media involves loading tapes into tape drives and positioning the media. This operation can be quite time consuming.
- Storing large numbers of small files may spread them across dozens or hundreds of tapes
- Mounting dozens of tapes and then seeking to particular locations on tape can take a long time and impair usability for others
Instead, store many small files as a single file with HTAR.
Large htar files end up on fewer tapes and htar does automatic indexing that speeds up retrieval of member files.
- File sizes of 1 GB or larger will give the best network performance.
- HPSS is optimized for files of approximately 500 GB in size.
- Files sizes greater than 1 terabyte can be difficult to work with and lead to longer transfer times, increasing the possibility of transfer interruptions.
- Please contact email@example.com if you wish to store significantly larger files.
Using hsi for recursive storage and retrieval is almost always non-optimal:
- Recursively storing a directory tree is likely to store a lot of small files across a large number of tapes
- Recursive file retrieval is likely to cause excessive tape mount and positioning activity. This is not only slow but ties up the system for other users.
Instead it is better to use htar instead of recursive hsi.
For retrieving a lot of files ordering your read requests by tape and position on the tape is the most efficient method. An example script showing how this might be done can be found in the Usage Examples at HSI Tape Ordering Script.
Htar File Member File Limitations
For backwards compatibility reasons, htar is unable to accept bundles containing files with names longer than 99 character or with directory paths longer than 154 characters. Htar also limits the maximum size of a single file in a bundle to 68 GB (this is not the size of the whole bundle, just the maximum size for any one file inside the bundle). If you wish to archive bundles of files that exceed either of these requirements, the easiest option is to bundle them first with regular "tar" and use hsi put to store the resulting tar ball in HPSS.
Stream Data via Unix Pipelines
Unix pipelines are often used to alleviate the need for spool area for writing large archive file. This approach has some weaknesses:
- Pipelines break during transient network issues
- Pipelines fail to notify HPSS of data size
- Data may be stored on non-optimal resources, and/or transfers fail
- Retrieval can be difficult
- Use global scratch to spool large archive files
- Use htar if spool space is an issue
- If streaming via pipe is unavoidable use pftp with ALLO64 <bytes> hint:
bash-4.0$ pftp archive <<EOF
> quote allo64 7706750976
> put "|tar cf - ./joeuser" /home/j/joeuser/joeuser.tar
HSI allows pre-fetching data from tape to the HPSS disk cache. With data pre-fetched into cache hypothetically it should be available quickly for processing. There are several problems with this approach:
- The disk cache is shared on a first-come, first-served basis
- If the cache is under heavy use by other users data may be purged before use
- If data read to cache is larger than the cache it will be purged before use
- Both situations result in a performance penalty as data is read twice from tape
Instead it is recommended that global scratch be used to pre-stage large data volumes instead of the HPSS disk cache.
Large Directories in HPSS
Each HPSS system is backed by a single database instance so
- Every user interaction causes some database activity
- One of the most database-intensive commands is hsi long file listing, i.e., "ls -l"
- Directories containing more than a few thousand files may become difficult to work with interactively
Below is an example of an "hsi ls -l" listing of a directory containing 80k files:
bash-4.0$ time hsi -q 'ls -l /home/n/nickb/tmp/testing/80k-files/' > /dev/null 2>&1
This graph shows "hsi ls -l" performance as a function of the number of files in the directory:
- Can be failure-prone for a variety of reasons including transient network issues, planned/unplanned maintenance, etc.
- hsi and pftp do not have capability to resume interrupted transfers
- Data is not completely migrated to tape from the HPSS disk cache until the transfer is completed
- It is recommended to keep transfers to 24 hours or less in duration if possible