High Performance & Scientific Computing

Monitoring Jobs on ISAAC-NG

Introduction

Jobs submitted to the cluster can encounter a variety of issues that slow or even halt execution. In some cases, users can diagnose and repair these problems themselves. In this document, you will learn how to monitor your submitted jobs and troubleshoot common job issues.

Monitoring Utilities

The squeue command can be used to facilitate job monitoring.

 $ squeue -u <netID>
              JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
              1202    campus     Job3 username PD       0:00      2 (Resources)
              1201    campus     Job1 username  R       0:05      2 node[001-002]
              1200    campus     Job2 username  R       0:10      2 node[004-005]

The description of each of the columns of the output from the squeue command is given below:

Column	Description
JOBID	The unique identifier of each job
PARTITION	The partition/queue from which the resources are to be allocated to the job
NAME	The name of the job specified in the Slurm script using #SBATCH -J option. If the -J option is not used, Slurm will use the name of the batch script.
USER	The login name of the user submitting the job
ST	Status of the job. Slurm scheduler uses short notation to give the status of the job. The meaning of these short notations is given in the table below.
TIME	The maximum wall time requested by the user for a job
NODES	The requested number of nodes on which the job is running along with the node names if resources are already allocated

When a job is submitted, it passes through various states. The values of these states for a job is given by squeue command under the column ST. The possible values of the job under ST column are given below:

Value	Meaning	Description
CG	Completing	Job is about to complete.
PD	Pending	Job is waiting for the resources to be allocated
R	Running	Job is running on the allocated resources
S	Suspended	Job was allocated resources but the execution got suspended due to some problem and CPUs are released for other jobs
NF	Node Failure	Job terminated due to failure of one or more allocated nodes

The commands isaac-sinfo and showpartitions can be used to check partitions, features and current availability of the requested computing resources.

$ isaac-sinfo 
PARTITION                AVAIL_FEATURES           AVAIL  TIMELIMIT   JOB_SIZE ROOT OVERSUBS     GROUPS  NODES       STATE RESERVATION NODELIST
campus                   amd,bergamo,avx512          up   infinite 1-infinite   no       NO        all      3       mixed             ber[1413,1415,1430]
campus                   intel,cascadelake,avx512    up   infinite 1-infinite   no       NO        all      4       mixed             clr[0726,0817,0824,0832]
campus                   amd,bergamo,avx512          up   infinite 1-infinite   no       NO        all      1   allocated             ber1432
campus                   intel,icelake,avx512        up   infinite 1-infinite   no       NO        all      4   allocated             il[1228-1229,1234-1235]
campus-bigmem            intel,cascadelake,avx512    up 30-00:00:0 1-infinite   no       NO        all      1       mixed             clrm1219
campus-bigmem            intel,icelake,avx512        up 30-00:00:0 1-infinite   no       NO        all      1       mixed             ilm0833
campus-bigmem            intel,cascadelake,avx512    up 30-00:00:0 1-infinite   no       NO        all      1   allocated             clrm1218
campus-bigmem            intel,icelake,avx512        up 30-00:00:0 1-infinite   no       NO        all      1   allocated             ilm0835
long-bigmem              intel,cascadelake,avx512    up 30-00:00:0 1-infinite   no       NO        all      1       mixed             clrm1219
long-bigmem              intel,cascadelake,avx512    up 30-00:00:0 1-infinite   no       NO        all      1   allocated             clrm1218
long-bigmem              intel,icelake,avx512        up 30-00:00:0 1-infinite   no       NO        all      1   allocated             ilm0835
campus-gpu               intel,cascadelake,avx512    up 30-00:00:0 1-infinite   no       NO        all      1       down$ clrv1149_wi clrv1149
campus-gpu               intel,cascadelake,avx512    up 30-00:00:0 1-infinite   no       NO        all      1       mixed             clrv0701
campus-gpu               intel,cascadelake,avx512    up 30-00:00:0 1-infinite   no       NO        all      1   allocated             clrv0703
campus-gpu-bigmem        intel,icelake,avx512        up 30-00:00:0 1-infinite   no       NO        all      1       mixed             ilpa1209
campus-gpu-large         intel,cascadelake,avx512    up 30-00:00:0 1-infinite   no       NO        all      2       mixed rsoni_paper clrv[1203,1205]
campus-gpu-large         intel,cascadelake,avx512    up 30-00:00:0 1-infinite   no       NO        all      4   allocated             clrv[1101,1103,1105,1107]
campus-gpu-large         intel,cascadelake,avx512    up 30-00:00:0 1-infinite   no       NO        all      1        idle             clrv1201
long-gpu                 intel,cascadelake,avx512    up 30-00:00:0 1-infinite   no       NO        all      4   allocated             clrv[1101,1103,1105,1107]
long                     intel,cascadelake,avx512    up 30-00:00:0 1-infinite   no       NO        all      6   allocated             clr[0732-0737]
short                    intel,cascadelake,avx512    up 30-00:00:0 1-infinite   no       NO        all      1       mixed             clr0716

$ showpartitions 
Partition statistics for cluster isaac at Mon Jan 22 11:15:50 EST 2024
              Partition     #Nodes     #CPU_cores  Cores_pending   Job_Nodes MaxJobTime Cores Mem/Node
              Name State Total  Idle  Total   Idle Resorc  Other   Min   Max  Day-hr:mn /node     (GB)
            campus    up    45     0   2896    204      0   4960     1 infin   infinite    48     190+
               all    up   212    72  12784   5943      0      0     1 infin   30-00:00    48     128+
     campus-bigmem    up     4     0    208     40      0      0     1 infin   30-00:00    48    1546+
       long-bigmem    up     3     0    152     32      0     96     1 infin   30-00:00    48    1546+
        campus-gpu    up     3     0    144      8      0     20     1 infin   30-00:00    48     190 
 campus-gpu-bigmem    up     1     0     64      8      0      0     1 infin   30-00:00    64    1020 
  campus-gpu-large    up     7     3    336    180      0    360     1 infin   30-00:00    48     770 
          long-gpu    up     4     2    192     96      0     96     1 infin   30-00:00    48     770 
              long    up     6     0    288     16      0      1     1 infin   30-00:00    48     190 
             short    up    25    22   1480   1375      0      0     1 infin   30-00:00    48     190+

The command sacct -B -j {jobid} can be used to check the job submission parameters.

$ sacct -B -j 1201
Batch Script for 1201
--------------------------------------------------------------------------------
#!/bin/bash
#SBATCH -J Condo-utia   #Job name
#SBATCH -A acf-utk0XYZ  #Write your project account associated to utia condo
#SBATCH -p condo-utia
#SBATCH -N 1
#SBATCH --ntasks-per-node=2  #--ntasks is used when we want to define total number of processors
#SBATCH --time=03:00:00
#SBATCH -o Condo-utia.o%j     #Dump output and error in one file
#SBATCH --qos=condo

###########   Perform some simple commands   ########################
set -x
###########   Run your parallel executable with srun   ###############
srun ./hello_mpi

The command scontrol show job {jobid} can be used to check details of the requested resources for the job.

$scontrol  show job 1201
JobId=1201 JobName=Condo-utia
   UserId=username(10577) GroupId=tug2106(3319) MCS_label=N/A
   Priority=4600 Nice=0 Account=acf-utk0XYZ QOS=condo
   JobState=PENDING Reason=Priority Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=03:00:00 TimeMin=N/A
   SubmitTime=2024-01-22T08:48:41 EligibleTime=2024-01-22T08:48:41
   AccrueTime=2024-01-22T08:48:41
   StartTime=2024-02-10T13:18:33 EndTime=2024-02-10T16:18:33 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2024-01-22T16:40:18 Scheduler=Main
   Partition=condo-utia AllocNode:Sid=localhost:1184486
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=
   NumNodes=1-1 NumCPUs=2 NumTasks=2 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   ReqTRES=cpu=2,mem=7600M,node=1,billing=2
   AllocTRES=(null)
   Socks/Node=* NtasksPerN:B:S:C=2:0:*:* CoreSpec=*
   MinCPUsNode=2 MinMemoryCPU=3800M MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/lustre/isaac/scratch/username/jobs/condo_utia.sh
   WorkDir=/lustre/isaac/scratch/username/jobs
   StdErr=/lustre/isaac/scratch/username/jobs/Condo-utia.o1201
   StdIn=/dev/null
   StdOut=/lustre/isaac/scratch/username/jobs/Condo-utia.o1201
   Power=

The University of Tennessee, Knoxville

Office of Innovative Technologies

Monitoring Jobs on ISAAC-NG

Introduction

Monitoring Utilities