Running Jobs With SLURM on ISAAC-Legacy
INTRODUCTION
This document explains how to use SLURM scheduler to submit, monitor, and alter jobs on the ISAAC-Legacy. This includes also how to run an interactive job on a compute node resource which is called an “interactive job”, and how to submit a job using SBATCH directives which is called an “batch job”. The cluster is set up to maximize utilization of the shared resources. To learn about our responsible node sharing and its implementation, please review the Responsible Node Sharing on ISAAC webpage. A collection of complete sample jobs scripts are also available at /lustre/haven/examples/jobs.
Quality of Service (QOS) | Run Time Limit (Hours) | Valid Partitions |
campus | 24 | campus-* |
campus-gpu | 24 | campus-beacon-gpu |
condo | 720 [30 days] | condo-* |
long | 144 [6 days] | campus-beacon-long,campus-monster-long |
SLURM Batch Scripts
Partitions
SLURM scheduler organizes the similar sets of nodes and features into individual groups, and each of these groups is referred to as a “partition”. Each partition has hard limits for the maximum amount of wall clock time, maximum job size, and the upper limit to number of nodes a user is able to request, etc. The partitions available for all ISAAC-Legacy users include “campus-beacon-gpu”, “campus-beacon-long”, “campus-monster-long”, “campus-sigma”, and “campus-sigma_bigcore”. For more information on available resources under each partition, please refer to our System Overview page.
Condos
ISAAC-Legacy cluster is categorized into different logical units referred to as “condos”. In the context of the cluster, a condo is a logical group of compute nodes that act as an independent cluster to effectively schedule jobs. There are “institutional” and individual “private” condos within the ISAAC-Legacy cluster. All users belong to their respective institutional condo, whether that be UTK or UTHSC. Private condos are owned and used by individual investors and their associated projects (discussed below). With a private condo, the investor and his project have exclusive use of the compute nodes in the condo. Investors can choose to share their private condo with the entire cluster under the Responsible Node Sharing implementation, but this is not a requirement. In SLURM, each condo is implemented as a “partition”. Most private condos also have a corresponding “overflow” partition, which combine nodes from institutional and private condos, allowing investors the option of running larger jobs than can be run either on type of condo alone.
Project Accounts
A project is the combination of an account used in the batch system and is related to a Linux group used on the system for file access control. In the cluster’s condo-based scheduling model projects and their associated accounts in the batch system are a key component. Projects control access to institutional and private condos. For UTK institutional users, the default project account is ACF-UTK0011. For UTHSC institutional users, the default project is ACF-UTHSC0001. Other projects follow the same project identifier format. To determine which projects you belong to, login to the User Portal and look under the “Projects” header. Always ensure you use the correct project account when submitting jobs with a SLURM directive.
Quality of Service(QOS)
A Quality of Service (QoS) is a set of restrictions on resources accessible to a job. You can always get the current list of QoS specifications via
sacctmgr show qos format=name,maxwall,maxtres,maxtrespu,maxjobspu,maxjobspa,maxsubmitpa
The QoS available to users are listed in the following:
TABLE 2.2: ALLOWED QOS/ACCOUNTS BY PARTITION
Qos | Max Wall Time Per Job | Max Resource Usage Per Job | Max Resource Usage Per User (across all running jobs) | Max Running/Submitted Jobs Per User | Max Running/Submitted Jobs in Queue Per Account |
campus | 1 day | 24 Nodes | 24 Nodes | 48/96 | -/- |
campus-gpu | 1 day | 2 Nodes | 2 GPUs | 3/6 | -/- |
campus-bigmem | 1 day | 4/8 | -/- | ||
condo | 30 days | None (defaults to size of condo) | None | 150/250 | -/- |
long | 6 day | 128 Cores | 12/18 | -/- | |
condo-hicap | 30 days | -/500 | -/- |