File Systems on ISAAC-NG
Overview
ISAAC Next Generation has a Lustre File system with a capacity of 3.6 petabytes. ISAAC NG also has a NFS file system for home directories that has a capacity of 4 terabytes.
In High Performance Computing (HPC), a Lustre scratch space refers to a storage area that is used to hold intermediate data during the execution of serial and parallel computing jobs. The scratch space is a portion of the Lustre file system (i.e. a directory path on Lustre) that is reserved for storing data that is only needed during the execution of a job. When a job is submitted to an HPC system, it is assigned or has access to a certain set of resources, including CPU cores, memory, and disk storage space. The Lustre scratch space is used to provide storage for the intermediate data that is generated during the execution of the job. This data may include input data, intermediate results, and output data. The scratch space is typically used for data that is too large to fit into the memory of the compute nodes, for data that needs to be shared between multiple compute nodes, and job results data that may be used as input data for future computational work. The data is stored in the scratch space during the execution of the job and is deleted once the job has been completed, after checking the results of the job, and checking that the data is not needed for future computational work.
The Lustre scratch space is optimized for high performance and high concurrency, allowing multiple compute nodes to access the file in Lustre simultaneously by many users or by the same user in parallel jobs. This makes Lustre well-suited for use in HPC environments, where large data sets or collections of data sets need to be processed quickly and efficiently. Overall, the Lustre scratch space is an essential component of HPC systems, providing a storage area for intermediate data that is generated during the execution of many serial or many large parallel computing jobs. Lustre scratch space is separate and distinct from Lustre project space or data set long-term archival storage space.
Two filesystems are available to ISAAC-NG users for storing user files: the Network File System (NFS) and Lustre. NFS contains and is used for home directories and Lustre is used for project and scratch directories. Table 1.1 summarizes the available filesystems.
File System Purge Policies and Default Quotas | |||
---|---|---|---|
File System/Purpose | Path | Quota | Purged |
NFS Home Directory Purpose: Environment files Capacity 4 TB | /nfs/home/<username> | 50GB, 1M files | Not Purged |
Lustre Scratch Directory Purpose: Scratch file space Lustre capacity 3.6 PB | /lustre/isaac/scratch/<username> | 10 TB, 5M files | Purging may begin at a later date |
Lustre Project Directory Purpose: Project Files Lustre capacity 3.6 PB | /lustre/isaac/proj/<project> | 1 TB, no file limit | Not Purged |
Please note that while both NFS and Lustre are generally reliable filesystems, errors and corruptions can still occur. All users are responsible for backing up their own data. To learn about data transfer to/from ISAAC-NG, please review the Data Transfer document. For more information on the Lustre file system, please refer to the Lustre User Guide.
Home directories have a relatively small quota and are intended for user environment files. Lustre scratch directories have a much higher quota, are not backed up, and are intended for files for running jobs and recently run jobs. Scratch works well for large amounts of space while a computation is running or while you are investigating the results of a computational run. Lustre project directories will also have a quota, are used for sharing files among a research group as a project has a corresponding Linux group (see the group name in the portal for the project), and for more permanent file collections. To obtain a Lustre project directory an ISAAC project request must be submitted by a UT faculty member (as leader of sponsored or unsponsored research) either by submitting an HPSC service request (i.e. ticket) or request of a project in the portal. ISAAC projects are intended for a workgroup to store and share files and are used for workgroups to share and use SLURM partitions with private condo compute resources. Graduate researchers (Masters or PhD) or post-docs must have a faculty advisor submit the project request and then request the researchers to be members of the project.
All the file systems and the researcher directories are intended to be managed by the researcher. From time to time if OIT HPSC staff see quotas are overrun or if Lustre begins to fill up in space then staff may ask researchers to manage their files and remove any unneeded files (usually intermediate results files or copies of files that already exist somewhere else). ISAAC does not have archival storage space so OIT HPSC staff must work together with researchers to address storage issues as they arise.
Details – Home, Scratch and Project Directories
Home Directories
On the ISAAC-NG cluster, Network File System (NFS) is used for home directories. Home directories on NFS will be periodically backed up for disaster recovery by the end of November 2021, but are not currently backed up. Each new account on ISAAC-NG receives a home directory on NFS. This is each account’s storage location for a small number of files. Home directory is where you can store job scripts, virtual environments, and other types of files and data. In a Linux environment, you can refer to your home directory with the environment variable $HOME or the tilde (~) character.
By default, your home directory is limited to 50GB of storage space. It is not intended to store large amounts of a project or job-related data. For job-related data, please use your scratch directory. For project data that you do not want to be purged, request and use project space.
Scratch Directories
Lustre scratch directories have no quota, are not backed up, and are intended for files for currently or recently running jobs. Scratch works well for large amounts of space while computation is running, while you are investigating the results of a computational run, or if you have a project directory but need storage for a short period that will not fit in project directory space.
Scratch directories in ISAAC-NG are available for all users on the Lustre filesystem. Approximately 2.9 petabytes of Lustre storage space is available on /lustre/isaac which is shared with scratch directories and project directories.
Important: Lustre scratch directories are NOT backed up.
Important Purging Notice: Lustre scratch space can be purged monthly on approximately the 3rd Monday of each month. Files in Lustre scratch directories (only /lustre/isaac/scratch/{username} directories) are deleted by the purging process if they have not been accessed or modified within 180 days. In general, users have many intermediate files that are no longer needed once a job completes and results are returned. Many times these files and other orphaned and unneeded files are not deleted by users and they accumulate in scratch directories and can fill the file system which is detrimental to all users. Purging email notices are sent to all active users prior to purging of scratch space which can take place on the 3rd Monday of each month. This email notice about an upcoming purge will explain the purging process and how to request a purge exemption or the process to request a project space (project directories are exempt from purging). To request a purge exemption submit an HPSC service request with “purge exemption request” in the subject.
To transfer data out of your scratch space see the ISAAC-NG Data Transfer documentation.
Each user has access to a scratch directory in Lustre which is located at /lustre/isaac/scratch/{username};. For convenience, use the $SCRATCHDIR environment variable to refer to your Lustre scratch directory.
If you wish to determine which files are eligible to be purged from Lustre space, execute the lfs find $SCRATCHDIR -mtime +180 -type f
command. Files that will be purged from Lustre space are those that are not modified or accessed for 180 days. If you wish to view your total usage of Lustre space, execute the lfs quota -u <user> /lustre/isaac
command.
Any attempts to circumvent purging, such as using the touch command on all files in a users scratch directory, will be considered a violation of the ISAAC-NG acceptable use policy. Instead of taking the time to circumvent purging, why not request a project with corresponding project space. As we are all Tennessee Volunteers, our research community will be improved with positive user actions and behaviors, such as cleaning up unneeded files or requesting a project, instead of circumventing ISAAC-NG file purging policy. This will allow staff to more easily manage the file system in order to ensure its continued availability for all users.
Project Directories
In many cases a researcher or a research group will want an ISAAC project created so there is a Linux project group and a project directory on the Lustre file system where the research group can share files and coordinate work. If a project is requested, a project directory on Lustre is created and this is in addition to a researchers home directory and Lustre scratch directory (See ISAAC-NG File Systems for more info). One can go to the User Portal and once logged in one can make a request by selecting the button under the Projects heading that says “click here to request a new project …” Project directories get a default quota of 1 terabyte. Additional space can be requested via a service request. Requests over 10 terabytes may be required by the HPSC Director to be reviewed by the Faculty Advisory Board before being granted.
Lustre project directories do have a quota, are used for sharing files among a research group as a project has a corresponding Linux group (see the group name in the portal for the project), and for more permanent file collections.
File System Quotas
The project directories have a group/project quota for the entire directory tree as well as there are quotas for individual users. They are cumulative for users and projects. Suppose the project has a 20 TB quota, and you use 10TBs, and another project member uses 10TBs and puts those files under the project directory tree. In that case, the next set of files put into that directory by either user will exceed the project quota.
If you want to see your home directory quota use the quota -s command. The amount of space used will be in the space column, the soft quota value is in the quota column, and the quota hard limit is in the limit column.
When you exceed the quota, you will start a grace period that gives you time to reduce your storage space usage. If you do not reduce your storage space used during this period, the soft limit defined by the “quota” field will be enforced.
$ quota -s Disk quotas for user netid (uid 00000): Filesystem space quota limit grace files quota limit grace nfs.isaac.utk.edu:/data/nfs/home 937M 46080M 51200M 9161 900k 1000k
Figure 2.1 – Output of quota -s
Additionally, using the checkquotas command that is available to all users prints lustre quota for all project directories the user has access to, as well as their lustre user quota and NFS quota.
$ checkquotas ----My Lustre Project Quotas---- ----My Lustre User Quota---- Disk quotas for usr netid (uid 00000): Filesystem used quota limit grace files quota limit grace /lustre/isaac 851.2M 9T 10T - 20306 4500000 5000000 - ----My NFS Quota---- Disk quotas for user netid (uid 00000): Filesystem space quota limit grace files quota limit grace nfs.isaac.utk.edu:/data/nfs/home 937M 46080M 51200M 9161 900k 1000k
Figure 2.2 – output of checkquotas
To see how much space you have consumed in the lustre files system overall (anywhere on /lustre) use the lfs quota command. See the example below. The output shows the storage space used in the used column and the number of files in the files column. The soft quota is in the quota column and the hard limit is in the limit column.
$ lfs quota -h -u netid /lustre/isaac Disk quotas for usr netid (uid 00000): Filesystem used quota limit grace files quota limit grace /lustre/isaac 851.2M 9T 10T - 20306 4500000 5000000 -
Figure 2.3 – output of lfs quota