High Performance & Scientific Computing

User Guide

The Archiving process

It is important to remember that UT-STorR is an archive, not a backup system or an HSM (Heirarchical Storage Manager). Please read the Access and Login page first.

UT-StorR replication policies

UT-StorR offers two different replication policies, “onecopy” and “twocopy”. These refer to the number of individual tapes that the data will reside on once migrated. Twocopy is the default selection for redundancy purposes. “Onecopy” is an option for those that do not need the additional guarantees that two copies provide.

Curating data for archiving

Preparing your data

UT-StorR is designed for the storage, retrieval, and sharing of primary data. Examples of this would be raw genomic sequences, raw microscope images, survey responses, labeled image sets, etc. Many project groups put shared resources like Python virtual environments, R packages, etc. into their shared directories to process their data. Those shared packages do not need to be archived. Instead, create a a list of required packages or instructions on how to replicate the processing environment to be stored in the archive directory.

Data structure

Tape archives are designed to work with fewer, larger files. Recommended sizes for files on UT-StorR are in the 20-200GB range per file, with an allowable range of 1GB to 1TB. Figure 1 shows the origin of this recommendation, as very large files (>2TB) can cause a tape drive to be occupied for an extended time which will cause congestion issues. If a researcher has files that are more than 1TB, please contact us and we will help develop a solution.

Size of File	Time
1 GB	2.5 sec
20 GB	1 min
200 GB	10 min
1TB	45 min
2 TB	1.5 hr
8TB	5.5 hr

Figure 1: Approximate latency for recalling a file

Many small files can be accumulated into a single file via the UNIX “tar” command:

$ tar -cpzf name.tar.gz <directory>.

We also recommend that the tar files are compressed with gzip, zstd, etc. Before tar’ing up and compressing the files to be archived, we suggest creating a checksum file that contains cryptographic checksums of all the files in the particular archive file. This can be accomplished by the UNIX ‘find’ command:

$ find <directory> -type f -exec sha1sum {} \; > checksums.sha1

If a user already has a large file (e.g. 10GB genomic sequence) it is acceptable to place that file on the archive with a README description and checksum the single file. It is not necessary to compress that file into a tarball when moving to the archive. For example:

$ sha1sum 10gb_file > /lustre/utstorr/research-onecopy/<projectdir>/10gb_file.sha1
$ mv 10gb_file /lustre/utstorr/research-onecopy/<projectdir>/

We also highly suggest that research groups create READMEs for each directory of files that they have archived. These files should be descriptive enough that someone at a later date should be able to identify exactly which archived filesets they want to recall without needing to recall a file, un-archive, and peruse the contents. UT-StorR may also fully offline the archived files after a time, so that the only visible files are the READMEs themselves!

An example README template is provided here (UT login required). The same README template is also provided at /lustre/utstorr/examples, along with a list of common archiving and checksum commands.

The migration to tape is fully automated via DMF policies. UT-StorR will migrate off to tape any file that is idle for more than 24 hours. This gives some leeway in organizing the archived directories, but we recommend that the directory structure and READMEs be prepped before being put to UT-STorR.

Sharing

Sharing of the archived data is performed through Globus. PIs that wish to set up a share of their archived data can do so by submitting a UT-StorR service request here.

The University of Tennessee, Knoxville

Office of Innovative Technologies