High Performance & Scientific Computing

Monitoring Jobs on ISAAC ORI


Jobs submitted to the cluster can encounter a variety of issues that slow or even halt execution. In some cases, users can diagnose and repair these problems themselves. In this document, you will learn how to monitor your submitted jobs and troubleshoot common job issues.

Monitoring Utilities

The squeue command can be used to facilitate job monitoring. 

 $ squeue -u <netID>
              1202    campus     Job3 username PD       0:00      2 (Resources)
              1201    campus     Job1 username  R       0:05      2 node[001-002]
              1200    campus     Job2 username  R       0:10      2 node[004-005]

The description of each of the columns of the output from the squeue command is given below:

JOBIDThe unique identifier of each job
PARTITIONThe partition/queue from which the resources are to be allocated to the job
NAMEThe name of the job specified in the Slurm script using #SBATCH -J option. If the -J option is not used, Slurm will use the name of the batch script.
USERThe login name of the user submitting the job
STStatus of the job. Slurm scheduler uses short notation to give the status of the job. The meaning of these short notations is given in the table below.
TIMEThe maximum wall time requested by the user for a job
NODESThe requested number of nodes on which the job is running along with the node names if resources are already allocated

When a job is submitted, it passes through various states. The values of these states for a job is given by squeue command under the column ST. The possible values of the job under ST column are given below:

ValueMeaning Description
CGCompletingJob is about to complete.
PDPendingJob is waiting for the resources to be allocated
RRunningJob is running on the allocated resources
SSuspendedJob was allocated resources but the execution got suspended due to some problem and CPUs are released for other jobs
NFNode FailureJob terminated due to failure of one or more allocated nodes

The commands isaac-sinfo and showpartitions can be used to check partitions, features and current availability of the requested computing resources.

$ isaac-sinfo
campus                    up   infinite      4   idle skk[101001-101004]
campus-gpu                up 30-00:00:0      1   idle skvk1131

$ showpartitions
Partition statistics for cluster ori at Thu Jun 13 08:22:24 EDT 2024
                  Partition     #Nodes     #CPU_cores  Cores_pending   Job_Nodes MaxJobTime Cores Mem/Node
                  Name State Total  Idle  Total   Idle Resorc  Other   Min   Max  Day-hr:mn /node     (GB)
                campus    up     4     4    160    160      0      0     1 infin   infinite    40     190
            campus-gpu    up     1     0     40      0      0      0     1 infin   30-00:00    40     190

The command sacct -B -j {jobid} can be used to check the job submission parameters.

$ sacct -B -j 1201
Batch Script for 1201
#SBATCH -J Condo-utia   #Job name
#SBATCH -A acf-utk0XYZ  #Write your project account associated to utia condo
#SBATCH -p condo-utia
#SBATCH --ntasks-per-node=2  #--ntasks is used when we want to define total number of processors
#SBATCH --time=03:00:00
#SBATCH -o Condo-utia.o%j     #Dump output and error in one file
#SBATCH --qos=condo

###########   Perform some simple commands   ########################
set -x
###########   Run your parallel executable with srun   ###############
srun ./hello_mpi

The command scontrol show job {jobid} can be used to check details of the requested resources for the job.

$scontrol  show job 1201
JobId=1201 JobName=Condo-utia
   UserId=username(10577) GroupId=tug2106(3319) MCS_label=N/A
   Priority=4600 Nice=0 Account=acf-utk0XYZ QOS=condo
   JobState=PENDING Reason=Priority Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=03:00:00 TimeMin=N/A
   SubmitTime=2024-01-22T08:48:41 EligibleTime=2024-01-22T08:48:41
   StartTime=2024-02-10T13:18:33 EndTime=2024-02-10T16:18:33 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2024-01-22T16:40:18 Scheduler=Main
   Partition=condo-utia AllocNode:Sid=localhost:1184486
   ReqNodeList=(null) ExcNodeList=(null)
   NumNodes=1-1 NumCPUs=2 NumTasks=2 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   Socks/Node=* NtasksPerN:B:S:C=2:0:*:* CoreSpec=*
   MinCPUsNode=2 MinMemoryCPU=3800M MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)