High Performance & Scientific Computing

Writing Job Scripts on ISAAC ORI

New Cluster: NOTE ABOUT JOB DIRECTORIES

By default, SLURM scheduler assumes the working directory to be the directory from which the jobs are being submitted.

It is recommended that your jobs should be submitted from within a directory in the Lustre file system. This is because directories housed within the Lustre file system a quota of 5TB, therefore users can store significantly more data on those directories (if needed), while the quota for each user at home directory (/nfs/home/<user-id>) has a 50GB maximum storage limit). The path of Lustre file system for each user is /lustre/haven/user/<user-id>. If you are a member of a project associated with your research then the Lustre path for project space is /lustre/haven/proj/{project name} (without the “ACF-” or “ISAAC-” in front of the project name).

SLURM BATCH SCRIPTS

This section will explain how to request that the SLURM scheduler allocate the resources for a job(s), and then submit those job(s) for computational processing by ISAAC compute nodes.

Partition

The SLURM scheduler organizes similar sets of nodes and job features into one group, which is called a “partition”. Each “partition” features hard limits for maximum wallclock time, job size, and an upper limit to number of nodes, etc. The default partition for all the users is named “campus”. At present, there are 18 nodes under the “campus” partition, with a total of 864 cores available to users.

Scheduling Policy

The Secure Enclave has been divided into logical units known as “condos”. There are institutional and individual private condos in the ISAAC Open Enclave (housed in the Kingston Pike Building (KPB)), and each condo has associated project accounts. CLICK HERE for more information on condos and project accounts. Institutional condos are available for any faculty, staff, or student at the institution. However, individual private condos are available to projects that have invested in the ISAAC Open Enclave. CLICK HERE for more information on investment information for ISAAC.

The SLURM program’s scheduling policy requires that the nodes under each condo be a part of a partition. Therefore, each condo has a unique partition associated with its project account; it is imperative to define the partition and the project account while requesting the resources on ISAAC.

NOTE: It is the combination of partition and project account which the SLURM scheduler uses to allocate the resources for each job a user requests.

For example: an institutional condo has a campus partition associated with its project, and the project is named UTK0011. The SLURM scheduling software will set constraints that limit a job submitted by saying that the job must use specific options particular to the resources available to that project such as: maximum walltime, maximum number of nodes that can be requested, etc. To submit a job under this partition, we can use the below command:

 #SBATCH -A UTK0011
 #SBATCH --partition=campus 

or we could use:

 #SBATCH -A UTK0011
 #SBATCH -p campus

Detailed information about all the partitions, and the respective nodes on ISAAC ORI can be viewed using the command:

 $ sinfo -Nel

For more information on the flags used by this command, refer to the Slurm documentation

NOTE: In addition to above two directives, we also need to specify the number of nodes, number of cores, and amount of walltime to be used, and other optional SBATCH directives when we submit a job script for handling by the SLURM scheduler. Detailed information on how to submit a job using SBATCH directives is provided under the section Submitting Jobs with Slurm.

Once a job is submitted, the scheduler checks for the available resources and allocates them to the jobs in order to launch the jobs’ tasks. Currently, the SLURM scheduler is configured to avoid the overlap of nodes allocated to different users. This means that the nodes are not shared among the jobs submitted by different users. However, SLURM can allocate the same node to be shared by multiple jobs for the same user, and even then the node will only be shared if the total requested resources among all of the jobs do not exceed the available resources on the node. NOTE: Users can choose to run a job exclusively on an entire node by using the exclusive flag while calling srun to distribute tasks among different CPUs on that node. Check Exclusive Access to Nodes to see how to use the node “exclusive” flag.

The order in which jobs are run on ISAAC ORI’s resources depends on the following factors:

Number of nodes requested – jobs that request more nodes get a higher priority.
Queue wait time – a job’s priority increases along with its queue wait time (not counting blocked jobs, as blocked jobs are not considered “queued.”).
Number of jobs – a maximum of ten jobs per user, at a time, will be eligible to run.

Currently, single core jobs by the same user will get scheduled on the same node.

In certain special cases, the priority of a job may be manually increased upon request. To request priority change you may contact the OIT HelpDesk. They will need the job ID and the reason a user is requesting a priority increase in order to submit the request.

Slurm Commands/Variables

Slurm Commands

In the table below, we have listed a few important Slurm commands used on the login nodes along with their description which are most often used while working with Slurm scheduler.

Command	Description
sbatch jobscript.sh	Used to submit the job script to request the resources
squeue	Used to displays the status of all the jobs
squeue -u username	Used to displays the status and other information of user’s all jobs
squeue [jobid]	Display the job status and information of a particular job
scancel	Cancel the job with a jobid
scontrol show jobid/parition value	Yields the information about a job or any resource
scontrol update	Alter the resources of a pending job
salloc	Used to allocate the resources for the interactive job run

TABLE 6.1: BASIC SLURM COMMANDS AND VARIABLES TO SUBMIT, MONITOR AND DELETE THE JOBS ON SECURE ENCLAVE

Slurm Variables

Below we have tabulated a few important Slurm variables which will be useful to the ISAAC users.

Variable	Description
SLURM_SUBMIT_DIR	The directory from where the job is submitted
SLURM_JOBID	The job identifier of the submitted job
SLURM_NODELIST	List of nodes allocated to a job
SLURM_NTASKS	Prints the total number of CPUs used

TABLE 6.2: DIFFERENT SLURM VARIABLES TO GET THE INFORMATION OF THE RUNNING JOB

SBATCH Flags

The jobs on Secure Enclave are submitted using sbatch command which passes the request for the resources requested in the job script to Slurm scheduler. The resources in the job script are requested using the “SBATCH” directive. Note that Slurm accepts SBATCH directives in two formats. Users can choose any format at their own discretion. The description of each of the SBATCH flags is given below:

Flags	Description
#SBATCH `-`J Jobname	Name of the job
#SBATCH `--`account (or `-`A) Project Account	Project account to which the time will be charge
#SBATCH `--`time (or `-`t)=days-hh:mm:ss	Request wall time for the job
#SBATCH `--`nodes (or `-`N)=1	Number of nodes needed
#SBATCH `--`ntasks (or `-`n) = 48	Total number of cores requested
#SBATCH `--`ntasks-per-node = 48	Request number of cores per node
#SBATCH `--`constraint=nosecurespace	Submit job to non-secure queue without authentication
#SBATCH `--`partition (or `-`p) = campus	Selects the partition or queue
#SBATCH `--`output (or `-`o) = Jobname.o%j	The file where output of terminal is dumped
#SBATCH `--`error (`-`e) = Jobname.e%j	The files where run time errors are dumped
#SBATCH `--`exclusive	Allocates the exclusive excess of node(s)
#SBATCH `--`array (`-`a) = index	Used to run multiple jobs with identical parameters
#SBATCH `--`chdir=directory	Used to change the working directory. The default working directory is the one from where a job is submitted

TABLE 6.3: DIFFERENT SLURM FLAGS USED IN CREATING THE JOB SCRIPT ALONG WITH THEIR DESCRIPTION

Submitting Jobs with Slurm

On ISAAC Secure Enclave, batch jobs can be submitted in two ways: (i) interactive batch mode (ii) Non-interactive batch mode.

Interactive Batch mode:

Interactive batch jobs give users interactive access to compute nodes. In this mode, user can request the Slurm scheduler to allocate the resources of compute nodes directly on the terminal. A common use for interactive batch jobs is to debug the calculation or program before submitting the non-interactive batch jobs for production runs. This section demonstrates how to run interactive jobs through the batch system and provides common usage tips.

The interactive batch mode can be invoked on the login node by using salloc command followed by the sbatch flags to request the different resources. The different sbatch flags are given in table 1.3.

$ salloc -A projectaccount --nodes=1 --ntasks=1 --partition=campus --time=01:00:00
or 
$ salloc -A projectaccount -N 1 -n 1 -p campus -t 01:00:00

The salloc command interprets the user’s request to Slurm scheduler and request the resources. In the above command we requested slurm scheduler to allocate one node and one cpu for a total time of 1 hour using campus partition. Note that if salloc command is executed without specifying the resources such as nodes, tasks and clock time, then scheduler will allocate the default resources which are one processor under campus partition with a wall clock time of 1 hour.

When the scheduler allocates the resources, the user gets a message on the terminal as shown below with the information about the jobid and the hostname of the compute node where the resources are allocated.

 $ salloc --nodes=1 --ntasks=1 --time=01:00:00
  salloc: Granted job allocation 1234
  salloc: Waiting for resource configuration
  salloc: Nodes nodename are ready for job
 $

Once the interactive job starts, the user should change their working directories to lustre project or scratch space to run the computationally intense applications. To run the parallel executable, we recommend using srun followed by the executable as shown below:

 $ srun executable

Note that you do not need to mention the number of processors before the executable while calling srun. The slurm wrapper srun executes your calculations in parallel on the requested number of processors. The serial applications can be run with and without srun.

Non-interactive batch mode:

In this mode, the set of resources as well as the commands for the application to be run are written in a text file called as batch file or batch script. This batch script is submitted to the Slurm scheduler by using the sbatch command. The batch scripts are very useful to run the production jobs. The batch scripts allow the users to work on a cluster non-interactively. In batch jobs, users submit a group of commands to SLURM and check the status and the output of the commands from time to time. However, sometimes it is very useful to run a job interactively (primarily for debugging). Click here to check how to run the batch jobs interactively. A typical example of a job script is given below:

 #!/bin/bash
 #This file is a submission script to request the ISAAC resources from Slurm 
 #SBATCH -J job			       #The name of the job
 #SBATCH -A ACF-UTK0011                            # The project account to be charged
 #SBATCH --nodes=1                                      # Number of nodes
 #SBATCH --ntasks-per-node=48                   # cpus per node 
 #SBATCH --partition=campus                     # If not specified then default is "campus"
 #SBATCH --time=0-01:00:00                       # Wall time (days-hh:mm:ss)
 #SBATCH --error=job.e%J		      # The file where run time errors will be dumped
 #SBATCH --output=job.o%J		      # The file where the output of the terminal will be dumped

 # Now list your executable command/commands.
 # Example for code compiled with a software module:
 module load example/test

 hostname
 sleep 100
 srun executable

The above job script can be divided into three sections:

Shell interpreter (one line)
- The first line of the script specifies the script’s interpreter. The syntax of this line is #!/bin/shellname (sh, bash, csh, ksh, zsh)
- This line is important and essential. If not mentioned, then the scheduler will print the error.
SLURM submission options
- The second section contains a bunch of lines starting with ‘#SBATCH’.
- These lines are not the comments.
- #SBATCH is a Slurm directive which communicates information regarding the resources requested by the user in the batch script file.
- #SBATCH options after the first non-comment line are ignored by Slurm scheduler
- The description about each of the flags is mentioned in the table 1.3
- The command sbatch on the terminal is used to submit the non-interactive batch script.
Shell commands
- The shell command follows the last #SBATCH line.
- Set of commands or tasks which a user wants to run. This also includes any software modules which may be needed to access a particular application.
- To run the parallel application, it is recommended to use srun followed by the name of the full path and the name of the executable if the executable path is not loaded into SLURM environment while submitting the script.

For the quick start, we have also provided a collection of complete sample job scripts that are available on ISAAC Next Gen cluster at /lustre/examples/jobs

Job Arrays

Slurm offers a useful option of submitting jobs using array flags to the users whose batch jobs require identical resources. Using this flag in the job script, users can submit multiple jobs with a single sbatch command. Although job script is submitted only once using sbatch command, the individual jobs in the array are scheduled independently with unique job array identifiers ($SLURM_ARRAY_JOB_ID). Each of the individual jobs can be differentiated using SLURM’s environmental variable $SLURM_ARRAY_TASK_ID. To understand this variable, let us consider an example of a Slurm script given below:

#!/bin/bash
#SBATCH -J myjob
#SBATCH -A ACF-UTK0011
#SBATCH -N 1
#SBATCH --ntasks-per-node=30  ###-ntasks is used when we want to define total number of processors
#SBATCH --time=01:00:00
#SBATCH --partition=campus     #####
##SBATCH -e myjob.e%j   ## Errors will be written in this file
#SBATCH -o myjob%A_%a.out    ## Separate output file will be created for each array. %A will be replaced by jobid and %a will be replaced by array index
#SBATCH --array=1-30
       # Submit array of jobs numbered 1 to 30
###########   Perform some simple commands   ########################
set -x
###########   Below code is used to create 30 script files needed to submit the array of jobs   ###############
for i in {1..30}; do cp sleep_test.sh 'sleep_test'$i'.sh';done

###########   Run your executable   ###############
sh sleep_test$SLURM_ARRAY_TASK_ID.sh

In the above example, we have created 30 sleep_test$index.sh executable files whose names are differing by an index. We can accomplish this task either by submitting 30 individual jobs or using an efficient and simple method of slurm arrays which takes these files in the form of an array sleep_test[1-30].sh. The variable SLURM_ARRAY_TASK_ID array is set to array index value [1-30], which is defined in the Slurm script above using #SBATCH directive

#SBATCH --array=1-30

The simultaneous number of jobs using a job array can also be limited by using a %n flag along with –array flag. For example: to run only 5 jobs at a time in the Slurm array, users can include the SLURM directive

#SBATCH --array=1-30%5

In order to create a separate output file for each of the submit jobs using Slurm arrays, use %A and %a, which represents the jobid and job array index as shown in the above example.

Exclusive Access to Nodes

As explained in the Scheduling policy, the jobs submitted by the same user can share the nodes. However, users can request the whole node(s) to run their jobs without sharing them with other jobs. To do that use the below command:

Interactive batch mode:

 $ salloc -A projectaccount --nodes=1 --ntasks=1 --partition=campus --time=01:00:00 --exclusive

Non-Interactive batch mode:

Add the below line in your job script

 #SBATCH --exclusive

Monitoring the Jobs Status

Users can regularly check the status of their jobs by using the squeue command.

             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
              1202    campus     Job3 username PD       0:00      2 (Resources)
              1201    campus     Job1 username  R       0:05      2 node[001-002]
              1200    campus     Job2 username  R       0:10      2 node[004-005]

The description of each of the columns of the output from squeue command is given below:

Name of Column	Description
JOBID	The unique identifier of each job
PARTITION	The partition/queue from which the resources are to be allocated to the job
NAME	The name of the job specified in the Slurm script using #SBATCH -J option. If the -J option is not used, Slurm will use the name of the batch script.
USER	The login name of the user submitting the job
ST	Status of the job. Slurm scheduler uses short notation to give the status of the job. The meaning of these short notations is given in the table below.
TIME	The maximum wall time requested by the user for a job
NODES	The requested number of nodes on which the job is running along with the node names if resources are already allocated

DESCRIPTION OF THE SQUEUE OUTPUT

When a user submits a job, it passes through various states. The values of these states for a job is given by squeue command under the column ST. The possible values of the job under ST column are given below:

Status Value	Meaning	Description
CG	Completing	Job is about to complete.
PD	Pending	Job is waiting for the resources to be allocated
R	Running	Job is running on the allocated resources
S	Suspended	Job was allocated resources but the execution got suspended due to some problem and CPUs are released for other jobs

DIFFERENT STATES OF A SLURM JOB

Altering Batch Jobs

The users are allowed to change the attributes of their jobs until the job starts running. In this section, we will describe how to alter your batch jobs with examples.

Remove a Job from Queue

Users can remove the jobs in any state which are submitted by them using the command scancel.

To remove a job with a JOB ID 1234, use the command:

scancel 1234

Modifying the Job Details

Users can make use of the Slurm command scontrol which is used to alter a variety of Slurm parameters. Although most of the commands using scontrol can only be executed by the System Adminstrator. However, users are granted some permissions to use scontrol for its use on the jobs submitted by them provided the jobs are not in the running mode

Release/Hold a job

scontrol release/hold jobid

Modify the name of the job

scontrol update JobID=jobid JobName=any_new_name

Modify the total number of tasks

scontrol update JobID=jobid NumTasks=Total_tasks

Modify the number of CPUs per node

scontrol update JobID=jobid MinCPUsNode=CPUs

Modify the Wall time of the job

scontrol update JobID=jobid TimeLimit=day-hh:mm:ss

The University of Tennessee, Knoxville

Office of Innovative Technologies

Writing Job Scripts on ISAAC ORI

New Cluster: NOTE ABOUT JOB DIRECTORIES

SLURM BATCH SCRIPTS

Partition

Partition

Scheduling Policy

Scheduling Policy

Slurm Commands/Variables

Slurm Commands

Slurm Variables

SBATCH Flags

SBATCH Flags

Submitting Jobs with Slurm

Submitting Jobs with Slurm

Interactive Batch mode:

Non-interactive batch mode:

Job Arrays

Job Arrays

Exclusive Access to Nodes

Exclusive Access to Nodes

Monitoring the Jobs Status

Monitoring the Jobs Status

Altering Batch Jobs

Altering Batch Jobs

Remove a Job from Queue

Modifying the Job Details