Running Jobs on ISAAC ORI
**The compute resources of ISAAC-ORI are managed using SLURM scheduler and workload manager**
Please note: The login nodes should ONLY be used for basic tasks such as: file editing, file management, code compilation, and job submission.
Introduction
ISAAC-ORI cluster offers access to its compute resources via two means: (i) submission of a SLURM job via the command line or (ii) Web based job submission via Open OnDemand (See the ISAAC-ORI web page about Open OnDemand for further details). The compute resources on ISAAC-ORI cluster are managed by SLURM workload manager. This webpage describes how to efficiently use the compute resources of ISAAC-ORI via command line and submit, monitor, and manage the computational jobs for production work using the SLURM scheduler.
A collection of example job scripts are available at /lustre/haven/examples/jobs.
The below table shows the SLURM Quality of Service (QOS) categories, the run time limits and the partitions that can be selected for each QOS.
Quality of Service (QOS) | Run Time Limit (Hours) | Valid Partitions |
campus | 24 | campus |
campus-gpu | 24 | campus-gpu |
condo | 720 [30 days] | condo-* |
General Information
As explained in the Access and Login page, when you log in to ISAAC-ORI HPSC cluster, you will be directed to one of the login nodes from where you can access the cluster to do your research work. Please note that the login nodes should only be used for basic tasks such as file editing, code compilation, and job submission. Running production jobs on login nodes is highly discouraged and any production job running on the login nodes is subject to termination.
If a user is not sure how to determine whether their job(s) are running on a login node, the user is encouraged to reach out to the High Performance & Scientific Computing (HPSC) group for assistance by submitting a help request ticket using the “Submit HPSC Service Request” button at the top right corner of any of the HPSC website pages at oit.utk.edu/hpsc.
As discussed above, login nodes should not be used to run production jobs. Any kind of production work should be performed on the system’s compute resources. The compute resources of ISAAC-ORI cluster are managed and allocated by SLURM scheduler (Simple Linux Utility for Resource Management) which uses several partitions to efficiently allocate the resources for different jobs.
This page provides information for getting started with the batch facilities of SLURM scheduler as well as the basic job execution. However, before getting started with the job submission, it is imperative to understand how to access different job directories on ISAAC-ORI clusters from where the jobs can be submitted.
Job Directories
By default, SLURM scheduler assumes the working directory to be the one from where the jobs are submitted. It is recommended that the jobs should be submitted from within a directory in the Lustre file system. . The other storage space is located in /lustre/haven/proj. The following storage spaces are available for users to choose from when deciding which directory from which they would like to run their job(s):
- /lustre/haven/proj/<project-name> – This is the project directory for the project under which the job is being run. Users can create their own directory under this project directory with their username where data can be stored. This storage space is allocated for a specific project and is made available upon request by the Principle Investigator of the project.
- /lustre/haven/scratch/<netid> – All ISAAC-ORI users have access to a scratch directory, which can be accessed by specifying the path or using the $SCRATCHDIR environmental variable.
Please refer to the File Systems page for more details on home, project, and scratch directories.
SLURM Batch Scripts
In this section, we will explain how to request the SLURM scheduler to allocate the resources for a job and how to submit those jobs.
Partitions
SLURM scheduler organizes similar sets of nodes and job features into individual groups, which are individually referred to as a “partition”. Each partition has hard limits for the maximum wall clock time, maximum job size, the upper limit to the number of nodes a user can request, etc. The partitions available for all ISAAC-ORI users include “campus” and “campus-gpu” under which a number of CPU and GPU cores are available. The default partition is “campus”. For more information on available resources under each partition, please refer to our System Overview page.
Scheduling Policy
ISAAC-ORI cluster uses a variety of mechanisms to determine how to schedule jobs. Most of the resource request mechanisms can be manipulated by users to ensure that their jobs are scheduled and executed in a reasonable time period.
Slurm Commands
In the table below, are listed few important Slurm commands (with a description of the command) which are most often used while working with Slurm scheduler, and can be used on the ISAAC-ORI login nodes.
Command | Description |
sbatch jobscript.sh | Used to submit the job script to request the resources |
squeue | Used to displays the status of all the jobs |
squeue -u username | Used to displays the status and other information of user’s all jobs |
squeue [jobid] | Display the job status and information of a particular job |
scancel | Cancel the job with a jobid |
scontrol show jobid/parition value | Yields the information about a job or any resource |
scontrol update | Alter the resources of a pending job |
salloc | Used to allocate the resources for the interactive job run |
Slurm Variables
Below are tabulated few important Slurm variables which ISAAC-ORI users may find useful.
Variable | Description |
SLURM_SUBMIT_DIR | The directory from where the job is submitted |
SLURM_JOBID | The job identifier of the submitted job |
SLURM_NODELIST | List of nodes allocated to a job |
SLURM_NTASKS | Prints the total number of CPUs used |
SBATCH Flags
The jobs a user submits to the ISAAC-ORI cluster are submitted using a sbatch command which passes the request for the resources a user has requested in the job script to the Slurm scheduler. The resources in the job script are requested using the “SBATCH” directive. Note that SLURM scheduler can accept SBATCH directives in two formats. Users can choose to use either of the two formats at their own discretion. The description of each of the SBATCH flags is given below:
Flags | Description |
#SBATCH -J Jobname | Name of the job |
#SBATCH --account (or -A) Project Account | Project account to which the time will be charge |
#SBATCH --time (or -t)=days-hh:mm:ss | Request wall time for the job |
#SBATCH --nodes (or -N)=1 | Number of nodes needed |
#SBATCH --ntasks (or -n) = 48 | Total number of cores requested |
#SBATCH --ntasks-per-node = 48 | Request number of cores per node |
#SBATCH --partition (or -p) = campus | Selects the partition or queue |
#SBATCH --output (or -o) = Jobname.o%j | The file where output of terminal is dumped |
#SBATCH --error (-e) = Jobname.e%j | The files where run time errors are dumped |
#SBATCH --exclusive | Allocates the exclusive excess of node(s) |
#SBATCH --array (-a) = index | Used to run multiple jobs with identical parameters |
#SBATCH --chdir=directory | Used to change the working directory. The default working directory is the one from where a job is submitted |
#SBATCH --qos=campus | The Quality of Service level for the job. |
#SBATCH --constraint (-C) = hardware,feature,list | Comma-separated list of hardware features for the job. The possible features for each can be displayed via the isaac-sinfo command. |
You can request specific CPU types via the --constraint=<cputype>
flag, which can be useful if you are running an application compiled for a specific architecture. Specific available options for this flag are documented in the table below.
Flag | CPU Type |
--constraint=intel | Any Intel CPU |
--constraint=amd | Any AMD CPU |
--constraint=cascadelake | Intel Cascade Lake CPU |
--constraint=icelake | Intel Ice Lake CPU |
--constraint=sapphirerapids | Intel Sapphire Rapids CPU |
--constraint=rome | AMD Rome CPU |
--constraint=milan | AMD Milan CPU |
--constraint=genoa | AMD Genoa CPU |
--constraint=bergamo | AMD Bergamo CPU |
--constraint=avx512 | Any CPU supporting the AVX-512 Instruction set (all Intel and a subset of AMD CPUs). |
The available feature list will occasionally change depending on the node composition of the cluster. The most current version of this list can always be found using the isaac-sinfo
command. More details about the specifications of the various CPU types can be found at the System Overview page.
Submitting Jobs with Slurm
On ISAAC-ORI, batch jobs can be submitted in two ways: (i) interactive batch mode, and (ii) non-interactive batch mode.
Interactive Batch mode
Interactive batch jobs give users the interactive access to compute nodes. In interactive batch mode, users can request that the Slurm scheduler allocate the resources of compute nodes, directly on the terminal (e.g., using a terminal’s command line interface (CLI)). A common use for interactive batch jobs is to debug a calculation or a program before submitting non-interactive batch jobs for production runs. This section demonstrates how to run interactive jobs through the batch system and provides common usage tips.
Interactive batch mode can be invoked on an ISAAC-ORI login node by using the salloc command followed by specific SBATCH flags to request specific resources to which the user would like access. The different SBATCH flags are given in table 3.3.
$ salloc -A projectaccount --nodes=1 --ntasks=1 --partition=campus --time=01:00:00
or
$ salloc -A projectaccount -N 1 -n 1 -p campus -t 01:00:00
The salloc command interprets the user’s request to the SLURM scheduler, and as a result, allows the user to request the resources. In the example command above, we requested that the Slurm scheduler allocate one node and one CPU core for a total time of 1 hour using the “campus” partition. Note that if the salloc command is executed without specifying the resources (i.e., number of nodes, number of CPUs, tasks, and wall clock time etc.), then the Slurm scheduler will allocate the default resources.
Important Note: The default resources on ISAAC-ORI are: one processor (1 CPU on 1 node) located under the campus partition, with a maximum wall clock time of 1 hour.
When the Slurm scheduler allocates the resources, the user who submitted the job gets a message on the terminal (as shown below) containing the information about the jobid and the hostname of the compute node where the resources are allocated. Notice that in the below example the jobid is “1234” and the hostname of the compute node is “nodename.”
$ salloc --nodes=1 --ntasks=1 --time=01:00:00 --partition=campus
salloc: Granted job allocation 1234
salloc: Waiting for resource configuration
salloc: Nodes nodename are ready for job
$
Once the the interactive job starts, the user need to ssh to the allocated node or one of the allocated nodes (for the jobs requesting more than one node) and should change their working directories to a Lustre file system project space or scratch space to run the user’s computationally intense applications. It is best to run large user jobs in Lustre scratch space rather than Lustre project space, because there is no size limit to the amount of data that a user may place on Lustre scratch space; however, there is a limit to how much data that a user may place on Lustre project space. Please visit File System for more information. If users have questions about how to best run their interactive jobs, users are welcome to send their questions to HPSC staff by submitting HPSC Service Request.
To run the parallel executable, we recommend using srun followed by the executable as shown below:
$ srun executable
or
$ srun -n <nprocs> executable
Important Note: Users do not necessarily need to include the number of processors they will require, before running the parallel executable while calling srun. The Slurm wrapper srun will execute your calculations in parallel on the number of processors requested in the user’s job script. Serial applications (that is, non-parallel applications) can be run both with and without srun.
Non-interactive batch mode
In non-interactive batch mode, the set of resources as well as the commands for the application to be run are written in a text file, which is referred to as a “batch file” or “batch script”. A user’s “batch script” for their particular job is submitted to the Slurm scheduler by using the sbatch command. Batch scripts are very useful for users who want to run productions jobs. This is because batch scripts allow users to work on a cluster non-interactively. In batch jobs, users submit a group of commands to Slurm and then simply check the job’s status and the output of the commands from time to time. However, sometimes it is very useful to run a job interactively (primarily for debugging). Click here to check how to run the batch jobs interactively. A typical example of a non-interactive batch script/job script is given below:
#!/bin/bash
#This file is a submission script to request the ISAAC resources from Slurm
#SBATCH -J job #The name of the job
#SBATCH -A ACF-UTK0011 # The project account to be charged
#SBATCH --nodes=1 # Number of nodes
#SBATCH --ntasks-per-node=48 # cpus per node
#SBATCH --partition=campus # If not specified then default is "campus"
#SBATCH --time=0-01:00:00 # Wall time (days-hh:mm:ss)
#SBATCH --error=job.e%J # The file where run time errors will be dumped
#SBATCH --output=job.o%J # The file where the output of the terminal will be dumped
#SBATCH --qos=campus
# Now list your executable command/commands.
# Example for code compiled with a software module:
module load example/test
hostname
sleep 100
srun executable
Stuff
The above job script can be divided into three sections:
- Shell interpreter (one line)
- The first line of the script specifies the script’s interpreter. The syntax of this line is #!/bin/shellname (sh, bash, csh, ksh, zsh)
- This line is important and essential. If not mentioned, then scheduler will print an error.
- SLURM submission options
- The second section contains a bunch of lines starting with ‘#SBATCH’.
- These lines are not the comments, but by format start with ‘#’.
- #SBATCH is a Slurm directive which communicates information regarding the resources requested by the user in the batch script file.
- #SBATCH options after the first non-comment line are ignored by Slurm scheduler
- The description about each of the flags is mentioned in the table 3.3
- The command sbatch on the terminal is used to submit the non-interactive batch script.
- Shell commands
- The shell command follows the last #SBATCH line.
- These commands are a set of commands or tasks which a user wants to run. This also includes any software modules which may be needed to access a particular application.
- To run the parallel application, it is recommended to use srun followed by the name of the full path and name of the executable if the executable path is not loaded into Slum environment while submitting the script.
For the quick start, we have also provided a collection of complete sample job scripts that are available on the ISAAC-ORI cluster at /lustre/haven/examples/jobs
Checking for Available Resources
In order to best utilize resources and avoid jobs waiting in queue, users should check what Slurm resources are available before deciding job submission parameters and submitting jobs. The Slurm commands sinfo
and isaac-sinfo
are offered for users to query the system for the status of available resources.
The command sinfo
outputs the status of available nodes organized by partition and node-type while isaac-sinfo
is a modified version to include partitions directly usable by ISAAC users with cleaner output. If a user was searching for a GPU node to run a GPU-accelerated job, they may for example execute the following:
$ isaac-sinfo | grep gpu | egrep 'idle|mixed'
The above outputs nodes in GPU partitions that are either idle and totally free or are mixed where only some resources are taken by other users’ jobs. This filters out nodes in the ‘allocated’ state since those nodes’ resources are entirely utilized by current running jobs. A sample output line may look like the following:
campus-gpu intel,cascadelake,avx512 up 30-00:00:0 1-infinite no NO all 1 mixed clrv0701
Utilizing the information given, users can tailor their Slurm job requests to the available resources. For example, knowing that there resources left available on node “clrv0701” based on the previous output of isaac-sinfo
showing ‘mixed’, a user may request the following:
$ salloc --account=ACF-UTK0011 --nodes=1 --ntasks=8 --partition=campus-gpu --qos=campus-gpu --time=1:00:00 --gpus=1 --nodelist=clrv0701
It is important to notice that the amount of nodes, cores, and GPUs requested match what is currently available, based on the output of the aforementioned commands, otherwise the job will wait in queue until the requested resources become available. To ensure a submitted job has requested an appropriate amount of resources based on the desired or available target node(s), users should compare the resources requested with the resources available. Checking available resources applies to all jobs whether they are submitted interactively, non-interactively, or through Open OnDemand.
Job Arrays
Slurm offers a useful option of submitting jobs that use array flags, for users whose batch jobs require identical resources. Using the array flag in a job script, users can submit multiple jobs with with a single sbatch command. Although the job script is submitted only once using sbatch command, the individual jobs in the array (of jobs) are scheduled independently with unique job array identifiers ($SLURM_ARRAY_JOB_ID). Each of the individual jobs can be differentiated using Slurm’s environmental variable $SLURM_ARRAY_TASK_ID. To understand this variable, let us consider an example of a Slurm script given below:
#!/bin/bash
#SBATCH -J myjob
#SBATCH -A ACF-UTK0011
#SBATCH -N 1
#SBATCH --ntasks-per-node=30 ###-ntasks is used when we want to define total number of processors
#SBATCH --time=01:00:00
#SBATCH --partition=campus #####
##SBATCH -e myjob.e%j ## Errors will be written in this file
#SBATCH -o myjob%A_%a.out ## Separate output file will be created for each array. %A will be replaced by jobid and %a will be replaced by array index
#SBATCH --qos=campus
#SBATCH --array=1-30
# Submit array of jobs numbered 1 to 30
########### Perform some simple commands ########################
set -x
########### Below code is used to create 30 script files needed to submit the array of jobs ###############
for i in {1..30}; do cp sleep_test.sh 'sleep_test'$i'.sh';done
########### Run your executable ###############
sh sleep_test$SLURM_ARRAY_TASK_ID.sh
In the above example, 30 sleep_test$index.sh executable files were created, and these jobs have names are differentiated by an index. One understands that the 30 executable files can be run by either submitting 30 individual jobs or by using Slurm arrays. A Slurm array would take these files in the form of an array named sleep_test[1-30].sh and execute these files. The variable SLURM_ARRAY_TASK_ID is set to array index value [between 1 and 30], with 30 being the total number of jobs in the array in this example, and the index value is defined in the Slurm script (above) using the #SBATCH directive
#SBATCH --array=1-30
The simultaneous number of jobs using a job array can also be limited by using a %n flag along with –array flag. For example: To run only 5 jobs at a time in a Slurm array, users can include the SLURM directive
#SBATCH --array=1-30%5
In order to create a separate output file for each of the submit jobs using Slurm arrays, use %A and %a, which represents the jobid and job array index as shown in the above example.
Exclusive Access to Nodes
As explained in the Scheduling Policy, jobs that are submitted by the same user can share computational nodes. However, if desired, users can request in their job scripts that whole node(s) be reserved to run their jobs without sharing them with other jobs. To do that use the below command:
Interactive batch mode:
$ salloc -A projectaccount --nodes=1 --ntasks=1 --partition=campus --time=01:00:00 --exclusive --qos=campus
Non-Interactive batch mode:
Add the below line in your job script
#SBATCH --exclusive
Monitoring Job Status
Users can regularly check status of their jobs by using the squeue command.
$ squeue -u <netID>
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1202 campus Job3 username PD 0:00 2 (Resources)
1201 campus Job1 username R 0:05 2 node[001-002]
1200 campus Job2 username R 0:10 2 node[004-005]
The description of each of the columns of the output from squeue command is given below:
Name of Column | Description |
JOBID | The unique identifier of each job |
PARTITION | The partition/queue from which the resources are to be allocated to the job |
NAME | The name of the job specified in the Slurm script using #SBATCH -J option. If the -J option is not used, Slurm will use the name of the batch script. |
USER | The login name of the user submitting the job |
ST | Status of the job. Slurm scheduler uses short notation to give the status of the job. The meaning of these short notations is given in the table below. |
TIME | The maximum wall time requested by the user for a job |
NODES | The requested number of nodes on which the job is running along with the node names if resources are already allocated |
When a user submits a job, it passes through various states. The values of these states for a job is given by squeue command under the column ST. The possible values of the job under ST column are given below:
Status Value | Meaning | Description |
CG | Completing | Job is about to complete. |
PD | Pending | Job is waiting for the resources to be allocated |
R | Running | Job is running on the allocated resources |
S | Suspended | Job was allocated resources but the execution got suspended due to some problem and CPUs are released for other jobs |
NF | Node Failure | Job terminated due to failure of one or more allocated nodes |
Altering Batch Jobs
The users are allowed to change the attributes of their jobs until the job starts running. In this section, we will describe how to alter your batch jobs with examples.
Remove a Job from Queue
User can remove the jobs in any state which are submitted by them using the command scancel.
To remove a job with a JOB ID 1234, use the command:
scancel 1234
Modifying the Job Details
Users can make use of the Slurm command scontrol which is used to alter a variety of Slurm parameters. Most of the commands using scontrol can only be executed by an ISAAC System Adminstrator; however, users are granted some permissions to use scontrol for use on the jobs they have submitted, provided the jobs are not in the running mode.
Release/Hold a job
scontrol release/hold jobid
Modify the name of the job
scontrol update JobID=jobid JobName=any_new_name
Modify the total number of tasks
scontrol update JobID=jobid NumTasks=Total_tasks
Modify the number of CPUs per node
scontrol update JobID=jobid MinCPUsNode=CPUs
Modify the wall time of the job
scontrol update JobID=jobid TimeLimit=day-hh:mm:ss