High Performance & Scientific Computing

Running Jobs With SLURM on ISAAC-Legacy

INTRODUCTION

This document explains how to use SLURM scheduler to submit, monitor, and alter jobs on the ISAAC-Legacy. This includes also how to run an interactive job on a compute node resource which is called an “interactive job”, and how to submit a job using SBATCH directives which is called an “batch job”. The cluster is set up to maximize utilization of the shared resources. To learn about our responsible node sharing and its implementation, please review the Responsible Node Sharing on ISAAC webpage. A collection of complete sample jobs scripts are also available at /lustre/haven/examples/jobs.

Quality of Service (QOS)	Run Time Limit (Hours)	Valid Partitions
campus	24	campus-*
campus-gpu	24	campus-beacon-gpu
condo	720 [30 days]	condo-*
long	144 [6 days]	campus-beacon-long,campus-monster-long

SLURM Batch Scripts

Partitions

SLURM scheduler organizes the similar sets of nodes and features into individual groups, and each of these groups is referred to as a “partition”. Each partition has hard limits for the maximum amount of wall clock time, maximum job size, and the upper limit to number of nodes a user is able to request, etc. The partitions available for all ISAAC-Legacy users include “campus-beacon-gpu”, “campus-beacon-long”, “campus-monster-long”, “campus-sigma”, and “campus-sigma_bigcore”. For more information on available resources under each partition, please refer to our System Overview page.

Condos

ISAAC-Legacy cluster is categorized into different logical units referred to as “condos”. In the context of the cluster, a condo is a logical group of compute nodes that act as an independent cluster to effectively schedule jobs. There are “institutional” and individual “private” condos within the ISAAC-Legacy cluster. All users belong to their respective institutional condo, whether that be UTK or UTHSC. Private condos are owned and used by individual investors and their associated projects (discussed below). With a private condo, the investor and his project have exclusive use of the compute nodes in the condo. Investors can choose to share their private condo with the entire cluster under the Responsible Node Sharing implementation, but this is not a requirement. In SLURM, each condo is implemented as a “partition”. Most private condos also have a corresponding “overflow” partition, which combine nodes from institutional and private condos, allowing investors the option of running larger jobs than can be run either on type of condo alone.

Project Accounts

A project is the combination of an account used in the batch system and is related to a Linux group used on the system for file access control. In the cluster’s condo-based scheduling model projects and their associated accounts in the batch system are a key component. Projects control access to institutional and private condos. For UTK institutional users, the default project account is ACF-UTK0011. For UTHSC institutional users, the default project is ACF-UTHSC0001. Other projects follow the same project identifier format. To determine which projects you belong to, login to the User Portal and look under the “Projects” header. Always ensure you use the correct project account when submitting jobs with a SLURM directive.

Quality of Service (QOS)

Quality of Service(QOS)

A Quality of Service (QoS) is a set of restrictions on resources accessible to a job. You can always get the current list of QoS specifications via

sacctmgr show qos 
format=name,maxwall,maxtres,maxtrespu,maxjobspu,maxjobspa,maxsubmitpa

The QoS available to users are listed in the following:

TABLE 2.2: ALLOWED QOS/ACCOUNTS BY PARTITION

Qos	Max Wall Time Per Job	Max Resource Usage Per Job	Max Resource Usage Per User (across all running jobs)	Max Running/Submitted Jobs Per User	Max Running/Submitted Jobs in Queue Per Account
campus	1 day	24 Nodes	24 Nodes	48/96	-/-
campus-gpu	1 day	2 Nodes	2 GPUs	3/6	-/-
campus-bigmem	1 day			4/8	-/-
condo	30 days	None (defaults to size of condo)	None	150/250	-/-
long	6 day		128 Cores	12/18	-/-
condo-hicap	30 days			-/500	-/-

Job Specifications

Each user has a list of valid account/partition/QoS combinations they may specify, and each partition has a list of valid accounts which may submit jobs to that partition. For example, to submit a job under the campus partition, use the below batch header:

#SBATCH --account ACF-UTK0011 (or #SBATCH -A ACF-UTK0011)
#SBATCH --partition=campus-<partition-name> (or #SBATCH -p campus)
#SBATCH --qos=campus (or #SBATCH -q condo)

To submit a job under a private condo, use the below batch header:

#SBATCH --account ACF-UTK<project-number>
#SBATCH --partition=condo-<condo-name>
#SBATCH --qos=condo

Important Note: It is the combination of a partition, QoS, and project account which the SLURM scheduler program uses to allocate computing resources from the cluster for each job. If a user does not appropriately specify all three, the SLURM utility may not be able to process the user’s job as submitted due to a lack of information from the user.

isaac-sinfo is used to view partition and each partition’s node information for a system running Slurm. The command isaac-sinfo is a customized wrapper that allows easier access to information about certain classes of partitions. For example, if you type isaac-sinfo help, it will show:

Usage: isaac-sinfo <campus,gpu,bigmem,long,condo>

Then, if you type isaac-sinfo campus will show information only about the campus partitions.

campus	up infinite	3	mix clr[0738,0821,0824]
campus	up infinite	26	alloc clr[0707-0709,0724-0737,0822-0823,0825,0829-0832,1241,1244]
campus	up infinite	5	idle clr[0826-0828,1242-1243]
campus	up infinite	1	down clr0710
campus- bigmem	up 30-00:00:0	3	idle ilm[0833,0835,0837]
campus-gpu	up 30-00:00:0	1	mix clrv0701
campus-gpu	up 30-00:00:0	2	idle clrv[0703,0801]
campus-gpu-bigmem	up 30-00:00:0	1	idle ilpa1209
campus-gpu-large	up 30-00:00:0	8	idle clrv[1101,1103,1105,1107,1201,1203,1205,1207]

Likewise, if a user attempts to specify an account or QoS that’s not valid for a partition, or attempts to request an invalid account/qos combination for their netid, their job will be rejected. Detailed information about all the partitions and the respective nodes on ISAAC-Legacy can be viewed using the following command:

 $ sinfo -Nel

For more information on the flags used by commands, please refer to the Slurm documentation. There are a handful of partitions that will be returned by the above command which are not documented below. These are reserved for administrative purposes, and end users should ignore them.

You can always see the valid accounts for a partition with

$ scontrol show PartitionName=<partition-name> | grep AllowAccounts

You can always see your netid’s valid account/qos combinations with

$ sacctmgr show assoc | grep <netid>

However, this is not usually necessary. As a general rule, the valid QoS and accounts combinations for partitions will be as follows:

Condo(s)	Allowed QoS	Allowed Account(s)
campus-gpu	campus-gpu	All
campus-beacon-long	campus, long	All
campus-monster-long	campus, long	All
campus-rho	campus	All
campus-bigmem	campus	All
campus-sigma	campus	All
campus-sigma_bigcore	campus	All
campus-beacon-gpu	campus-gpu	All

The general association between allowed accounts and QoS will be as follows:

Account Type	Example Account(s)	Valid QoS
Institutional Account	ACF-UTK0011, ACF-UTHSC0001	campus, campus-gpu, campus-bigmem
Private/Project/Class Account	ACF-UTK<project-number>, ACF-UTHSC<project-number>	condo, campus, campus-gpu, campus-bigmem

TABLE 2.3: ALLOWED QOS BY ACCOUNT

You can always determine your current active accounts or request access to new accounts in the user portal.

Please note that in addition to the above two directives (project account and partition), users also need to specify the type of nodes, number of nodes, number of cores, and estimated maximum amount of wall time as parameters or other optional SBATCH directives when submitting a job script.

Once a job is submitted, the SLURM scheduler checks for the available resources and allocate them to the jobs to launch the tasks.

Important Note: users can choose to run a job exclusively on entire node by using the exclusive flag while also using srun to distribute tasks among different CPUs. Check table 3.3 to see how to use this flag.

The order in which jobs are run depends on the following factors:

Number of nodes requested – jobs that request more nodes get a higher priority.
Queue wait time – a job’s priority increases along with its queue wait time (not counting blocked jobs as they are not considered “queued.”)
Number of jobs – a maximum of ten jobs per user, at a time, will be eligible to run.

In certain special cases, the priority of a job may be manually increased upon request. To request priority change you may contact the OIT HelpDesk. They will need the job ID and reason to submit the request.

SLURM Commands/Variables

In the table below, are listed few important SLURM commands (with a description of the command) which are most often used while working with SLURM scheduler, and can be used on the ISAAC-Legacy login nodes.

Command	Description
sbatch jobscript.sh	Used to submit the job script to request the resources
squeue	Used to displays the status of all the jobs
squeue -u username	Used to displays the status and other information of user’s all jobs
squeue [jobid]	Display the job status and information of a particular job
scancel	Cancel the job with a jobid
control show jobid/parition value	Yields the information about a job or any resource
scontrol update	Alter the resources of a pending job
salloc	Used to allocate the resources for the interactive job run

TABLE 3.1: COMMON SLURM COMMANDS AND ASSOCIATED DESCRIPTIONS

SLURM Variables

Below are tabulated few important Slurm variables which ISAAC-Legacy users may find useful.

Variable	Description
SLURM_SUBMIT_DIR	The directory from where the job is submitted
LURM_JOBID	The job identifier of the submitted job
SLURM_NODELIST	List of nodes allocated to a job
SLURM_NTASKS	Prints the total number of CPUs used

TABLE 3.2: SLURM VARIABLES USED TO GET THE INFORMATION OF AN ISAAC-Legacy RUNNING JOB

SBATCH Flags

The jobs a user submits to the ISAAC-Legacy cluster are submitted using a sbatch command which passes the request for the resources a user has requested in the job script to the SLURM scheduler. The resources in the job script are requested using the “SBATCH” directive. Note that SLURM scheduler can accept SBATCH directives in two formats. Users can choose to use either of the two formats at their own discretion. The description of each of the SBATCH flags is given below:

Flags	Description
#SBATCH `-`J Jobname	Name of the job
#SBATCH `--`account (or `-`A) Project Account	Project account to which the time will be charge
#SBATCH `--`time (or `-`t)=days-hh:mm:ss	Request wall time for the job
#SBATCH `--`nodes (or `-`N)=1	Number of nodes needed
#SBATCH `--`ntasks (or `-`n) = 48	Total number of cores requested
#SBATCH `--`ntasks-per-node = 48	Request number of cores per node
#SBATCH `--`partition (or `-`p) = campus	Selects the partition or queue
#SBATCH `--`output (or `-`o) = Jobname.o%j	The file where output of terminal is dumped
#SBATCH `--`error (`-`e) = Jobname.e%j	The files where run time errors are dumped
#SBATCH `--`exclusive	Allocates the exclusive excess of node(s)
#SBATCH `--`array (`-`a) = index	Used to run multiple jobs with identical parameters
#SBATCH `--`chdir=directory	Used to change the working directory. The default working directory is the one from where a job is submitted
#SBATCH – -qos=campus	The Quality of Service level for the job.

TABLE 3.3: SLURM FLAGS USED IN CREATING JOB SCRIPTS WITH DESCRIPTIONS

Submitting Jobs with SLURM

On ISAAC-Legacy, batch jobs can be submitted in two ways: (i) interactive batch mode, and (ii) non-interactive batch mode.

Interactive Batch Mode

Interactive batch jobs give users the interactive access to compute nodes. In interactive batch mode, users can request that the SLURM scheduler allocate the resources of compute nodes, directly on the terminal (e.g., using a terminal’s command line interface (CLI)). A common use for interactive batch jobs is to debug a calculation or a program before submitting non-interactive batch jobs for production runs. This section demonstrates how to run interactive jobs through the batch system and provides common usage tips.

Interactive batch mode can be invoked on an ISAAC-Legacy login node by using the salloc command followed by specific SBATCH flags to request specific resources to which the user would like access. The different SBATCH flags are given in table 3.3.

$ salloc -A projectaccount --nodes=1 --ntasks=1 --partition=campus --time=01:00:00
or 
$ salloc -A projectaccount -N 1 -n 1 -p campus -t 01:00:00

The salloc command interprets the user’s request to the SLURM scheduler, and as a result, allows the user to request the resources. In the example command above, we requested that the SLURM scheduler allocate one node and one CPU core for a total time of 1 hour using the “campus” partition. Note that if the salloc command is executed without specifying the resources (i.e., number of nodes, number of CPUs, tasks, and wall clock time etc.), then the SLURM scheduler will allocate the default resources.

Important Note: The default resources on ISAAC-Legacy are: one processor (1 CPU on 1 node) located under the campus partition, with a maximum wall clock time of 1 hour.

When the SLURM scheduler allocates the resources, the user who submitted the job gets a message on the terminal (as shown below) containing the information about the jobid and the hostname of the compute node where the resources are allocated. Notice that in the below example the jobid is “1234” and the hostname of the compute node is “nodename.”

 $ salloc --nodes=1 --ntasks=1 --time=01:00:00 --partition=campus
  salloc: Granted job allocation 1234
  salloc: Waiting for resource configuration
  salloc: Nodes nodename are ready for job
 $

Once the interactive job starts, the user needs to ssh to the allocated node or one of the allocated nodes (for the jobs requesting more than one node) and should change their working directories to a Lustre file system project space or scratch space to run the user’s computationally intense applications. It is best to run large user jobs in Lustre scratch space rather than Lustre project space because there is no size limit to the amount of data that a user may place on Lustre scratch space; however, there is a limit to how much data that a user may place on Lustre project space. Please visit File System for more information. If users have questions about how to best run their interactive jobs, users are welcome to send their questions to HPSC staff by submitting HPSC Service Request.

To run the parallel executable, we recommend using srun followed by the executable as shown below:

 $ srun executable
or
 $ srun -n <nprocs> executable

Important Note: Users do not necessarily need to include the number of processors they will require, before running the parallel executable while calling srun. The SLURM wrapper srun will execute your calculations in parallel on the number of processors requested in the user’s job script. Serial applications (that is, non-parallel applications) can be run both with and without srun.

Non-Interactive Batch Mode

In non-interactive batch mode, the set of resources as well as the commands for the application to be run are written in a text file, which is referred to as a “batch file” or “batch script”. A user’s “batch script” for their particular job is submitted to the SLURM scheduler by using the sbatch command. Batch scripts are very useful for users who want to run productions jobs. This is because batch scripts allow users to work on a cluster non-interactively. In batch jobs, users submit a group of commands to SLURM and then simply check the job’s status and the output of the commands from time to time. However, sometimes it is very useful to run a job interactively (primarily for debugging). Click here to check how to run the batch jobs interactively. A typical example of a non-interactive batch script/job script is given below:

 #!/bin/bash
 #This file is a submission script to request the ISAAC resources from SLURM 
 #SBATCH -J job			       #The name of the job
 #SBATCH -A ACF-UTK0011              # The project account to be charged
 #SBATCH --nodes=1                     # Number of nodes
 #SBATCH --ntasks-per-node=48          # cpus per node 
 #SBATCH --partition=campus            # If not specified then default is "campus"
 #SBATCH --time=0-01:00:00             # Wall time (days-hh:mm:ss)
 #SBATCH --error=job.e%J	       # The file where run time errors will be dumped
 #SBATCH --output=job.o%J	       # The file where the output of the terminal will be dumped
 #SBATCH --qos=campus

 # Now list your executable command/commands.
 # Example for code compiled with a software module:
 module load example/test

 hostname
 sleep 100
 srun executable

The above job script can be divided into three sections:

Shell interpreter (one line)
- The first line of the script specifies the script’s interpreter. The syntax of this line is #!/bin/shellname (sh, bash, csh, ksh, zsh)
- This line is important and essential. If not mentioned, then scheduler will print an error.
SLURM submission options
- The second section contains a bunch of lines starting with ‘#SBATCH’.
- These lines are not the comments, but by format start with ‘#’.
- #SBATCH is a SLURM directive which communicates information regarding the resources requested by the user in the batch script file.
- #SBATCH options after the first non-comment line are ignored by SLURM scheduler
- The description about each of the flags is mentioned in table 3.3
- The command sbatch on the terminal is used to submit the non-interactive batch script.
Shell commands
- The shell command follows the last #SBATCH line.
- These commands are a set of commands or tasks which a user wants to run. This also includes any software modules which may be needed to access a particular application.
- To run the parallel application, it is recommended to use srun followed by the name of the full path and name of the executable if the executable path is not loaded into Slum environment while submitting the script.

For the quick start, we have also provided a collection of complete sample job scripts that are available on the ISAAC-Legacy cluster at /lustre/isaac/examples/jobs

Job Arrays

SLURM offers a useful option of submitting jobs that use array flags, for users whose batch jobs require identical resources. Using the array flag in a job script, users can submit multiple jobs with with a single sbatch command. Although the job script is submitted only once using sbatch command, the individual jobs in the array (of jobs) are scheduled independently with unique job array identifiers ($SLURM_ARRAY_JOB_ID). Each of the individual jobs can be differentiated using SLURM’s environmental variable $SLURM_ARRAY_TASK_ID. To understand this variable, let us consider an example of a SLURM script given below:

#!/bin/bash
#SBATCH -J myjob
#SBATCH -A ACF-UTK0011
#SBATCH -N 1
#SBATCH --ntasks-per-node=30  ###-ntasks is used when we want to define total number of processors
#SBATCH --time=01:00:00
#SBATCH --partition=campus     #####
##SBATCH -e myjob.e%j   ## Errors will be written in this file
#SBATCH -o myjob%A_%a.out    ## Separate output file will be created for each array. %A will be replaced by jobid and %a will be replaced by array index
#SBATCH --qos=campus
#SBATCH --array=1-30
       # Submit array of jobs numbered 1 to 30
###########   Perform some simple commands   ########################
set -x
###########   Below code is used to create 30 script files needed to submit the array of jobs   ###############
for i in {1..30}; do cp sleep_test.sh 'sleep_test'$i'.sh';done

###########   Run your executable   ###############
sh sleep_test$SLURM_ARRAY_TASK_ID.sh

In the above example, 30 sleep_test$index.sh executable files were created, and these jobs have names are differentiated by an index. One understands that the 30 executable files can be run by either submitting 30 individual jobs or by using SLURM arrays. A SLURM array would take these files in the form of an array named sleep_test[1-30].sh and execute these files. The variable SLURM_ARRAY_TASK_ID is set to array index value [between 1 and 30], with 30 being the total number of jobs in the array in this example, and the index value is defined in the SLURM script (above) using the #SBATCH directive

#SBATCH --array=1-30

The simultaneous number of jobs using a job array can also be limited by using a %n flag along with –array flag. For example: To run only 5 jobs at a time in a SLURM array, users can include the SLURM directive

#SBATCH --array=1-30%5

In order to create a separate output file for each of the submit jobs using SLURM arrays, use %A and %a, which represents the jobid and job array index as shown in the above example.

Exclusive Access to Nodes

As explained in the Scheduling Policy, jobs that are submitted by the same user can share computational nodes. However, if desired, users can request in their job scripts that whole node(s) be reserved to run their jobs without sharing them with other jobs. To do that use the below command:

Interactive batch mode:

 $ salloc -A projectaccount --nodes=1 --ntasks=1 --partition=campus --time=01:00:00 --exclusive --qos=campus

Non-Interactive batch mode:

Add the below line in your job script

 #SBATCH --exclusive

Monitoring Job Status

Users can regularly check status of their jobs by using the squeue command.

$ squeue -u <netID>
              JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
              1202    campus     Job3 username PD       0:00      2 (Resources)
              1201    campus     Job1 username  R       0:05      2 node[001-002]
              1200    campus     Job2 username  R       0:10      2 node[004-005]

The description of each of the columns of the output from squeue command is given below:

Name of Column	Description
JOBID	The unique identifier of each job
PARTITION	The partition/queue from which the resources are to be allocated to the job
NAME	The name of the job specified in the SLURM script using #SBATCH -J option. If the -J option is not used, SLURM will use the name of the batch script.
USER	The login name of the user submitting the job
ST	Status of the job. SLURM scheduler uses short notation to give the status of the job. The meaning of these short notations is given in the table below.
TIME	The maximum wall time requested by the user for a job
NODES	The requested number of nodes on which the job is running along with the node names if resources are already allocated

TABLE 3.4: SQUEUE COMMAND OUTPUTS

Description of the output

When a user submits a job, it passes through various states. The values of these states for a job is given by squeue command under the column ST. The possible values of the job under ST column are given below:

Status Value	Meaning	Description
CG	Completing	Job is about to complete.
PD	Pending	Job is waiting for the resources to be allocated
R	Running	Job is running on the allocated resources
S	Suspended	Job was allocated resources but the execution got suspended due to some problem and CPUs are released for other jobs
NF	Node Failure	Job terminated due to failure of one or more allocated nodes

TABLE 3.5: POSSIBLE STATES OF A SLURM JOB

Altering Batch Jobs

The users are allowed to change the attributes of their jobs until the job starts running. In this section, we will describe how to alter your batch jobs with examples.

Remove a Job from Queue

User can remove the jobs in any state which are submitted by them using the command scancel.

To remove a job with a JOB ID 1234, use the command:

scancel 1234

Modifying the Job Details

Users can make use of the SLURM command scontrol which is used to alter a variety of SLURM parameters. Most of the commands using scontrol can only be executed by an ISAAC-Legacy System Adminstrator; however, users are granted some permissions to use scontrol for use on the jobs they have submitted, provided the jobs are not in the running mode.

Release/Hold a job

scontrol release/hold jobid

Modify the name of the job

scontrol update JobID=jobid JobName=any_new_name

Modify the total number of tasks

scontrol update JobID=jobid NumTasks=Total_tasks

Modify the number of CPUs per node

scontrol update JobID=jobid MinCPUsNode=CPUs

Modify the wall time of the job

scontrol update JobID=jobid TimeLimit=day-hh:mm:ss

The University of Tennessee, Knoxville

Office of Innovative Technologies

Running Jobs With SLURM on ISAAC-Legacy

INTRODUCTION

SLURM Batch Scripts

Partitions

Partitions

Condos

Condos

Project Accounts

Project Accounts

Quality of Service (QOS)

Quality of Service(QOS)

Job Specifications

SLURM Commands/Variables

SLURM Variables

SBATCH Flags

Submitting Jobs with SLURM

Interactive Batch Mode

Non-Interactive Batch Mode

Job Arrays

Exclusive Access to Nodes

Interactive batch mode:

Non-Interactive batch mode:

Monitoring Job Status

Altering Batch Jobs

Remove a Job from Queue

Modifying the Job Details