Writing Job Scripts on ISAAC ORI
New Cluster: NOTE ABOUT JOB DIRECTORIES
By default, SLURM scheduler assumes the working directory to be the directory from which the jobs are being submitted.
It is recommended that your jobs should be submitted from within a directory in the Lustre file system. This is because directories housed within the Lustre file system a quota of 5TB, therefore users can store significantly more data on those directories (if needed), while the quota for each user at home directory (/nfs/home/<user-id>) has a 50GB maximum storage limit). The path of Lustre file system for each user is /lustre/haven/user/<user-id>. If you are a member of a project associated with your research then the Lustre path for project space is /lustre/haven/proj/{project name} (without the “ACF-” or “ISAAC-” in front of the project name).
SLURM BATCH SCRIPTS
This section will explain how to request that the SLURM scheduler allocate the resources for a job(s), and then submit those job(s) for computational processing by ISAAC compute nodes.
Partition
The SLURM scheduler organizes similar sets of nodes and job features into one group, which is called a “partition”. Each “partition” features hard limits for maximum wallclock time, job size, and an upper limit to number of nodes, etc. The default partition for all the users is named “campus”. At present, there are 18 nodes under the “campus” partition, with a total of 864 cores available to users.
Scheduling Policy
The Secure Enclave has been divided into logical units known as “condos”. There are institutional and individual private condos in the ISAAC Open Enclave (housed in the Kingston Pike Building (KPB)), and each condo has associated project accounts. CLICK HERE for more information on condos and project accounts. Institutional condos are available for any faculty, staff, or student at the institution. However, individual private condos are available to projects that have invested in the ISAAC Open Enclave. CLICK HERE for more information on investment information for ISAAC.
The SLURM program’s scheduling policy requires that the nodes under each condo be a part of a partition. Therefore, each condo has a unique partition associated with its project account; it is imperative to define the partition and the project account while requesting the resources on ISAAC.
NOTE: It is the combination of partition and project account which the SLURM scheduler uses to allocate the resources for each job a user requests.
For example: an institutional condo has a campus partition associated with its project, and the project is named UTK0011. The SLURM scheduling software will set constraints that limit a job submitted by saying that the job must use specific options particular to the resources available to that project such as: maximum walltime, maximum number of nodes that can be requested, etc. To submit a job under this partition, we can use the below command:
#SBATCH -A UTK0011
#SBATCH --partition=campus
or we could use:
#SBATCH -A UTK0011
#SBATCH -p campus
Detailed information about all the partitions, and the respective nodes on ISAAC ORI can be viewed using the command:
$ sinfo -Nel
For more information on the flags used by this command, refer to the Slurm documentation
NOTE: In addition to above two directives, we also need to specify the number of nodes, number of cores, and amount of walltime to be used, and other optional SBATCH directives when we submit a job script for handling by the SLURM scheduler. Detailed information on how to submit a job using SBATCH directives is provided under the section Submitting Jobs with Slurm.
Once a job is submitted, the scheduler checks for the available resources and allocates them to the jobs in order to launch the jobs’ tasks. Currently, the SLURM scheduler is configured to avoid the overlap of nodes allocated to different users. This means that the nodes are not shared among the jobs submitted by different users. However, SLURM can allocate the same node to be shared by multiple jobs for the same user, and even then the node will only be shared if the total requested resources among all of the jobs do not exceed the available resources on the node. NOTE: Users can choose to run a job exclusively on an entire node by using the exclusive flag while calling srun to distribute tasks among different CPUs on that node. Check Exclusive Access to Nodes to see how to use the node “exclusive” flag.
The order in which jobs are run on ISAAC ORI’s resources depends on the following factors:
- Number of nodes requested – jobs that request more nodes get a higher priority.
- Queue wait time – a job’s priority increases along with its queue wait time (not counting blocked jobs, as blocked jobs are not considered “queued.”).
- Number of jobs – a maximum of ten jobs per user, at a time, will be eligible to run.
Currently, single core jobs by the same user will get scheduled on the same node.
In certain special cases, the priority of a job may be manually increased upon request. To request priority change you may contact the OIT HelpDesk. They will need the job ID and the reason a user is requesting a priority increase in order to submit the request.
Slurm Commands
In the table below, we have listed a few important Slurm commands used on the login nodes along with their description which are most often used while working with Slurm scheduler.
Command | Description |
sbatch jobscript.sh | Used to submit the job script to request the resources |
squeue | Used to displays the status of all the jobs |
squeue -u username | Used to displays the status and other information of user’s all jobs |
squeue [jobid] | Display the job status and information of a particular job |
scancel | Cancel the job with a jobid |
scontrol show jobid/parition value | Yields the information about a job or any resource |
scontrol update | Alter the resources of a pending job |
salloc | Used to allocate the resources for the interactive job run |
TABLE 6.1: BASIC SLURM COMMANDS AND VARIABLES TO SUBMIT, MONITOR AND DELETE THE JOBS ON SECURE ENCLAVE
Slurm Variables
Below we have tabulated a few important Slurm variables which will be useful to the ISAAC users.
Variable | Description |
SLURM_SUBMIT_DIR | The directory from where the job is submitted |
SLURM_JOBID | The job identifier of the submitted job |
SLURM_NODELIST | List of nodes allocated to a job |
SLURM_NTASKS | Prints the total number of CPUs used |
TABLE 6.2: DIFFERENT SLURM VARIABLES TO GET THE INFORMATION OF THE RUNNING JOB
SBATCH Flags
The jobs on Secure Enclave are submitted using sbatch command which passes the request for the resources requested in the job script to Slurm scheduler. The resources in the job script are requested using the “SBATCH” directive. Note that Slurm accepts SBATCH directives in two formats. Users can choose any format at their own discretion. The description of each of the SBATCH flags is given below:
Flags | Description |
#SBATCH - J Jobname | Name of the job |
#SBATCH -- account (or - A) Project Account | Project account to which the time will be charge |
#SBATCH -- time (or - t)=days-hh:mm:ss | Request wall time for the job |
#SBATCH -- nodes (or - N)=1 | Number of nodes needed |
#SBATCH -- ntasks (or - n) = 48 | Total number of cores requested |
#SBATCH -- ntasks-per-node = 48 | Request number of cores per node |
#SBATCH -- constraint=nosecurespace | Submit job to non-secure queue without authentication |
#SBATCH -- partition (or - p) = campus | Selects the partition or queue |
#SBATCH -- output (or - o) = Jobname.o%j | The file where output of terminal is dumped |
#SBATCH -- error (- e) = Jobname.e%j | The files where run time errors are dumped |
#SBATCH -- exclusive | Allocates the exclusive excess of node(s) |
#SBATCH -- array (- a) = index | Used to run multiple jobs with identical parameters |
#SBATCH -- chdir=directory | Used to change the working directory. The default working directory is the one from where a job is submitted |
TABLE 6.3: DIFFERENT SLURM FLAGS USED IN CREATING THE JOB SCRIPT ALONG WITH THEIR DESCRIPTION
Submitting Jobs with Slurm
On ISAAC Secure Enclave, batch jobs can be submitted in two ways: (i) interactive batch mode (ii) Non-interactive batch mode.
Interactive Batch mode:
Interactive batch jobs give users interactive access to compute nodes. In this mode, user can request the Slurm scheduler to allocate the resources of compute nodes directly on the terminal. A common use for interactive batch jobs is to debug the calculation or program before submitting the non-interactive batch jobs for production runs. This section demonstrates how to run interactive jobs through the batch system and provides common usage tips.
The interactive batch mode can be invoked on the login node by using salloc command followed by the sbatch flags to request the different resources. The different sbatch flags are given in table 1.3.
$ salloc -A projectaccount --nodes=1 --ntasks=1 --partition=campus --time=01:00:00 or $ salloc -A projectaccount -N 1 -n 1 -p campus -t 01:00:00
The salloc command interprets the user’s request to Slurm scheduler and request the resources. In the above command we requested slurm scheduler to allocate one node and one cpu for a total time of 1 hour using campus partition. Note that if salloc command is executed without specifying the resources such as nodes, tasks and clock time, then scheduler will allocate the default resources which are one processor under campus partition with a wall clock time of 1 hour.
When the scheduler allocates the resources, the user gets a message on the terminal as shown below with the information about the jobid and the hostname of the compute node where the resources are allocated.
$ salloc --nodes=1 --ntasks=1 --time=01:00:00
salloc: Granted job allocation 1234
salloc: Waiting for resource configuration
salloc: Nodes nodename are ready for job
$
Once the interactive job starts, the user should change their working directories to lustre project or scratch space to run the computationally intense applications. To run the parallel executable, we recommend using srun followed by the executable as shown below:
$ srun executable
Note that you do not need to mention the number of processors before the executable while calling srun. The slurm wrapper srun executes your calculations in parallel on the requested number of processors. The serial applications can be run with and without srun.
Non-interactive batch mode:
In this mode, the set of resources as well as the commands for the application to be run are written in a text file called as batch file or batch script. This batch script is submitted to the Slurm scheduler by using the sbatch command. The batch scripts are very useful to run the production jobs. The batch scripts allow the users to work on a cluster non-interactively. In batch jobs, users submit a group of commands to SLURM and check the status and the output of the commands from time to time. However, sometimes it is very useful to run a job interactively (primarily for debugging). Click here to check how to run the batch jobs interactively. A typical example of a job script is given below:
#!/bin/bash
#This file is a submission script to request the ISAAC resources from Slurm
#SBATCH -J job #The name of the job
#SBATCH -A ACF-UTK0011 # The project account to be charged
#SBATCH --nodes=1 # Number of nodes
#SBATCH --ntasks-per-node=48 # cpus per node
#SBATCH --partition=campus # If not specified then default is "campus"
#SBATCH --time=0-01:00:00 # Wall time (days-hh:mm:ss)
#SBATCH --error=job.e%J # The file where run time errors will be dumped
#SBATCH --output=job.o%J # The file where the output of the terminal will be dumped
# Now list your executable command/commands.
# Example for code compiled with a software module:
module load example/test
hostname
sleep 100
srun executable
The above job script can be divided into three sections:
- Shell interpreter (one line)
- The first line of the script specifies the script’s interpreter. The syntax of this line is #!/bin/shellname (sh, bash, csh, ksh, zsh)
- This line is important and essential. If not mentioned, then the scheduler will print the error.
- SLURM submission options
- The second section contains a bunch of lines starting with ‘#SBATCH’.
- These lines are not the comments.
- #SBATCH is a Slurm directive which communicates information regarding the resources requested by the user in the batch script file.
- #SBATCH options after the first non-comment line are ignored by Slurm scheduler
- The description about each of the flags is mentioned in the table 1.3
- The command sbatch on the terminal is used to submit the non-interactive batch script.
- Shell commands
- The shell command follows the last #SBATCH line.
- Set of commands or tasks which a user wants to run. This also includes any software modules which may be needed to access a particular application.
- To run the parallel application, it is recommended to use srun followed by the name of the full path and the name of the executable if the executable path is not loaded into SLURM environment while submitting the script.
For the quick start, we have also provided a collection of complete sample job scripts that are available on ISAAC Next Gen cluster at /lustre/examples/jobs
Job Arrays
Slurm offers a useful option of submitting jobs using array flags to the users whose batch jobs require identical resources. Using this flag in the job script, users can submit multiple jobs with a single sbatch command. Although job script is submitted only once using sbatch command, the individual jobs in the array are scheduled independently with unique job array identifiers ($SLURM_ARRAY_JOB_ID). Each of the individual jobs can be differentiated using SLURM’s environmental variable $SLURM_ARRAY_TASK_ID. To understand this variable, let us consider an example of a Slurm script given below:
#!/bin/bash
#SBATCH -J myjob
#SBATCH -A ACF-UTK0011
#SBATCH -N 1
#SBATCH --ntasks-per-node=30 ###-ntasks is used when we want to define total number of processors
#SBATCH --time=01:00:00
#SBATCH --partition=campus #####
##SBATCH -e myjob.e%j ## Errors will be written in this file
#SBATCH -o myjob%A_%a.out ## Separate output file will be created for each array. %A will be replaced by jobid and %a will be replaced by array index
#SBATCH --array=1-30
# Submit array of jobs numbered 1 to 30
########### Perform some simple commands ########################
set -x
########### Below code is used to create 30 script files needed to submit the array of jobs ###############
for i in {1..30}; do cp sleep_test.sh 'sleep_test'$i'.sh';done
########### Run your executable ###############
sh sleep_test$SLURM_ARRAY_TASK_ID.sh
In the above example, we have created 30 sleep_test$index.sh executable files whose names are differing by an index. We can accomplish this task either by submitting 30 individual jobs or using an efficient and simple method of slurm arrays which takes these files in the form of an array sleep_test[1-30].sh. The variable SLURM_ARRAY_TASK_ID array is set to array index value [1-30], which is defined in the Slurm script above using #SBATCH directive
#SBATCH --array=1-30
The simultaneous number of jobs using a job array can also be limited by using a %n flag along with –array flag. For example: to run only 5 jobs at a time in the Slurm array, users can include the SLURM directive
#SBATCH --array=1-30%5
In order to create a separate output file for each of the submit jobs using Slurm arrays, use %A and %a, which represents the jobid and job array index as shown in the above example.
Exclusive Access to Nodes
As explained in the Scheduling policy, the jobs submitted by the same user can share the nodes. However, users can request the whole node(s) to run their jobs without sharing them with other jobs. To do that use the below command:
Interactive batch mode:
$ salloc -A projectaccount --nodes=1 --ntasks=1 --partition=campus --time=01:00:00 --exclusive
Non-Interactive batch mode:
Add the below line in your job script
#SBATCH --exclusive
Monitoring the Jobs Status
Users can regularly check the status of their jobs by using the squeue command.
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1202 campus Job3 username PD 0:00 2 (Resources)
1201 campus Job1 username R 0:05 2 node[001-002]
1200 campus Job2 username R 0:10 2 node[004-005]
The description of each of the columns of the output from squeue command is given below:
Name of Column | Description |
JOBID | The unique identifier of each job |
PARTITION | The partition/queue from which the resources are to be allocated to the job |
NAME | The name of the job specified in the Slurm script using #SBATCH -J option. If the -J option is not used, Slurm will use the name of the batch script. |
USER | The login name of the user submitting the job |
ST | Status of the job. Slurm scheduler uses short notation to give the status of the job. The meaning of these short notations is given in the table below. |
TIME | The maximum wall time requested by the user for a job |
NODES | The requested number of nodes on which the job is running along with the node names if resources are already allocated |
DESCRIPTION OF THE SQUEUE OUTPUT
When a user submits a job, it passes through various states. The values of these states for a job is given by squeue command under the column ST. The possible values of the job under ST column are given below:
Status Value | Meaning | Description |
CG | Completing | Job is about to complete. |
PD | Pending | Job is waiting for the resources to be allocated |
R | Running | Job is running on the allocated resources |
S | Suspended | Job was allocated resources but the execution got suspended due to some problem and CPUs are released for other jobs |
DIFFERENT STATES OF A SLURM JOB
Altering Batch Jobs
The users are allowed to change the attributes of their jobs until the job starts running. In this section, we will describe how to alter your batch jobs with examples.
Remove a Job from Queue
Users can remove the jobs in any state which are submitted by them using the command scancel.
To remove a job with a JOB ID 1234, use the command:
scancel 1234
Modifying the Job Details
Users can make use of the Slurm command scontrol which is used to alter a variety of Slurm parameters. Although most of the commands using scontrol can only be executed by the System Adminstrator. However, users are granted some permissions to use scontrol for its use on the jobs submitted by them provided the jobs are not in the running mode
Release/Hold a job
scontrol release/hold jobid
Modify the name of the job
scontrol update JobID=jobid JobName=any_new_name
Modify the total number of tasks
scontrol update JobID=jobid NumTasks=Total_tasks
Modify the number of CPUs per node
scontrol update JobID=jobid MinCPUsNode=CPUs
Modify the Wall time of the job
scontrol update JobID=jobid TimeLimit=day-hh:mm:ss