High Performance & Scientific Computing

Monitoring Jobs on ISAAC ORI

Introduction

Jobs submitted to the cluster can encounter a variety of issues that slow or even halt execution. In some cases, users can diagnose and repair these problems themselves. In this document, you will learn how to monitor your submitted jobs and troubleshoot common job issues.

Monitoring Utilities

The squeue command can be used to facilitate job monitoring.

 $ squeue -u <netID>
              JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
              1202    campus     Job3 username PD       0:00      2 (Resources)
              1201    campus     Job1 username  R       0:05      2 node[001-002]
              1200    campus     Job2 username  R       0:10      2 node[004-005]

The description of each of the columns of the output from the squeue command is given below:

Column	Description
JOBID	The unique identifier of each job
PARTITION	The partition/queue from which the resources are to be allocated to the job
NAME	The name of the job specified in the Slurm script using #SBATCH -J option. If the -J option is not used, Slurm will use the name of the batch script.
USER	The login name of the user submitting the job
ST	Status of the job. Slurm scheduler uses short notation to give the status of the job. The meaning of these short notations is given in the table below.
TIME	The maximum wall time requested by the user for a job
NODES	The requested number of nodes on which the job is running along with the node names if resources are already allocated

When a job is submitted, it passes through various states. The values of these states for a job is given by squeue command under the column ST. The possible values of the job under ST column are given below:

Value	Meaning	Description
CG	Completing	Job is about to complete.
PD	Pending	Job is waiting for the resources to be allocated
R	Running	Job is running on the allocated resources
S	Suspended	Job was allocated resources but the execution got suspended due to some problem and CPUs are released for other jobs
NF	Node Failure	Job terminated due to failure of one or more allocated nodes

The commands isaac-sinfo and showpartitions can be used to check partitions, features and current availability of the requested computing resources.

$ isaac-sinfo
PARTITION              AVAIL  TIMELIMIT  NODES  STATE NODELIST
campus                    up   infinite      4   idle skk[101001-101004]
campus-gpu                up 30-00:00:0      1   idle skvk1131

$ showpartitions
Partition statistics for cluster ori at Thu Jun 13 08:22:24 EDT 2024
                  Partition     #Nodes     #CPU_cores  Cores_pending   Job_Nodes MaxJobTime Cores Mem/Node
                  Name State Total  Idle  Total   Idle Resorc  Other   Min   Max  Day-hr:mn /node     (GB)
                campus    up     4     4    160    160      0      0     1 infin   infinite    40     190
            campus-gpu    up     1     0     40      0      0      0     1 infin   30-00:00    40     190

The command sacct -B -j {jobid} can be used to check the job submission parameters.

$ sacct -B -j 1201
Batch Script for 1201
--------------------------------------------------------------------------------
#!/bin/bash
#SBATCH -J Condo-utia   #Job name
#SBATCH -A acf-utk0XYZ  #Write your project account associated to utia condo
#SBATCH -p condo-utia
#SBATCH -N 1
#SBATCH --ntasks-per-node=2  #--ntasks is used when we want to define total number of processors
#SBATCH --time=03:00:00
#SBATCH -o Condo-utia.o%j     #Dump output and error in one file
#SBATCH --qos=condo

###########   Perform some simple commands   ########################
set -x
###########   Run your parallel executable with srun   ###############
srun ./hello_mpi

The command scontrol show job {jobid} can be used to check details of the requested resources for the job.

$scontrol  show job 1201
JobId=1201 JobName=Condo-utia
   UserId=username(10577) GroupId=tug2106(3319) MCS_label=N/A
   Priority=4600 Nice=0 Account=acf-utk0XYZ QOS=condo
   JobState=PENDING Reason=Priority Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=03:00:00 TimeMin=N/A
   SubmitTime=2024-01-22T08:48:41 EligibleTime=2024-01-22T08:48:41
   AccrueTime=2024-01-22T08:48:41
   StartTime=2024-02-10T13:18:33 EndTime=2024-02-10T16:18:33 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2024-01-22T16:40:18 Scheduler=Main
   Partition=condo-utia AllocNode:Sid=localhost:1184486
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=
   NumNodes=1-1 NumCPUs=2 NumTasks=2 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   ReqTRES=cpu=2,mem=7600M,node=1,billing=2
   AllocTRES=(null)
   Socks/Node=* NtasksPerN:B:S:C=2:0:*:* CoreSpec=*
   MinCPUsNode=2 MinMemoryCPU=3800M MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/lustre/isaac/scratch/username/jobs/condo_utia.sh
   WorkDir=/lustre/isaac/scratch/username/jobs
   StdErr=/lustre/isaac/scratch/username/jobs/Condo-utia.o1201
   StdIn=/dev/null
   StdOut=/lustre/isaac/scratch/username/jobs/Condo-utia.o1201
   Power=

The University of Tennessee, Knoxville

Office of Innovative Technologies

Monitoring Jobs on ISAAC ORI

Introduction

Monitoring Utilities