Monitoring Jobs on ISAAC-NG
Introduction
Jobs submitted to the cluster can encounter a variety of issues that slow or even halt execution. In some cases, users can diagnose and repair these problems themselves. In this document, you will learn how to monitor your submitted jobs and troubleshoot common job issues.
Monitoring Utilities
The squeue
command can be used to facilitate job monitoring.
$ squeue -u <netID> JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 1202 campus Job3 username PD 0:00 2 (Resources) 1201 campus Job1 username R 0:05 2 node[001-002] 1200 campus Job2 username R 0:10 2 node[004-005]
The description of each of the columns of the output from the squeue
command is given below:
Column | Description |
JOBID | The unique identifier of each job |
PARTITION | The partition/queue from which the resources are to be allocated to the job |
NAME | The name of the job specified in the Slurm script using #SBATCH -J option. If the -J option is not used, Slurm will use the name of the batch script. |
USER | The login name of the user submitting the job |
ST | Status of the job. Slurm scheduler uses short notation to give the status of the job. The meaning of these short notations is given in the table below. |
TIME | The maximum wall time requested by the user for a job |
NODES | The requested number of nodes on which the job is running along with the node names if resources are already allocated |
When a job is submitted, it passes through various states. The values of these states for a job is given by squeue command under the column ST. The possible values of the job under ST column are given below:
Value | Meaning | Description |
CG | Completing | Job is about to complete. |
PD | Pending | Job is waiting for the resources to be allocated |
R | Running | Job is running on the allocated resources |
S | Suspended | Job was allocated resources but the execution got suspended due to some problem and CPUs are released for other jobs |
NF | Node Failure | Job terminated due to failure of one or more allocated nodes |
The commands isaac-sinfo
and showpartitions
can be used to check partitions, features and current availability of the requested computing resources.
$ isaac-sinfo PARTITION AVAIL_FEATURES AVAIL TIMELIMIT JOB_SIZE ROOT OVERSUBS GROUPS NODES STATE RESERVATION NODELIST campus amd,bergamo,avx512 up infinite 1-infinite no NO all 3 mixed ber[1413,1415,1430] campus intel,cascadelake,avx512 up infinite 1-infinite no NO all 4 mixed clr[0726,0817,0824,0832] campus amd,bergamo,avx512 up infinite 1-infinite no NO all 1 allocated ber1432 campus intel,icelake,avx512 up infinite 1-infinite no NO all 4 allocated il[1228-1229,1234-1235] campus-bigmem intel,cascadelake,avx512 up 30-00:00:0 1-infinite no NO all 1 mixed clrm1219 campus-bigmem intel,icelake,avx512 up 30-00:00:0 1-infinite no NO all 1 mixed ilm0833 campus-bigmem intel,cascadelake,avx512 up 30-00:00:0 1-infinite no NO all 1 allocated clrm1218 campus-bigmem intel,icelake,avx512 up 30-00:00:0 1-infinite no NO all 1 allocated ilm0835 long-bigmem intel,cascadelake,avx512 up 30-00:00:0 1-infinite no NO all 1 mixed clrm1219 long-bigmem intel,cascadelake,avx512 up 30-00:00:0 1-infinite no NO all 1 allocated clrm1218 long-bigmem intel,icelake,avx512 up 30-00:00:0 1-infinite no NO all 1 allocated ilm0835 campus-gpu intel,cascadelake,avx512 up 30-00:00:0 1-infinite no NO all 1 down$ clrv1149_wi clrv1149 campus-gpu intel,cascadelake,avx512 up 30-00:00:0 1-infinite no NO all 1 mixed clrv0701 campus-gpu intel,cascadelake,avx512 up 30-00:00:0 1-infinite no NO all 1 allocated clrv0703 campus-gpu-bigmem intel,icelake,avx512 up 30-00:00:0 1-infinite no NO all 1 mixed ilpa1209 campus-gpu-large intel,cascadelake,avx512 up 30-00:00:0 1-infinite no NO all 2 mixed rsoni_paper clrv[1203,1205] campus-gpu-large intel,cascadelake,avx512 up 30-00:00:0 1-infinite no NO all 4 allocated clrv[1101,1103,1105,1107] campus-gpu-large intel,cascadelake,avx512 up 30-00:00:0 1-infinite no NO all 1 idle clrv1201 long-gpu intel,cascadelake,avx512 up 30-00:00:0 1-infinite no NO all 4 allocated clrv[1101,1103,1105,1107] long intel,cascadelake,avx512 up 30-00:00:0 1-infinite no NO all 6 allocated clr[0732-0737] short intel,cascadelake,avx512 up 30-00:00:0 1-infinite no NO all 1 mixed clr0716
$ showpartitions Partition statistics for cluster isaac at Mon Jan 22 11:15:50 EST 2024 Partition #Nodes #CPU_cores Cores_pending Job_Nodes MaxJobTime Cores Mem/Node Name State Total Idle Total Idle Resorc Other Min Max Day-hr:mn /node (GB) campus up 45 0 2896 204 0 4960 1 infin infinite 48 190+ all up 212 72 12784 5943 0 0 1 infin 30-00:00 48 128+ campus-bigmem up 4 0 208 40 0 0 1 infin 30-00:00 48 1546+ long-bigmem up 3 0 152 32 0 96 1 infin 30-00:00 48 1546+ campus-gpu up 3 0 144 8 0 20 1 infin 30-00:00 48 190 campus-gpu-bigmem up 1 0 64 8 0 0 1 infin 30-00:00 64 1020 campus-gpu-large up 7 3 336 180 0 360 1 infin 30-00:00 48 770 long-gpu up 4 2 192 96 0 96 1 infin 30-00:00 48 770 long up 6 0 288 16 0 1 1 infin 30-00:00 48 190 short up 25 22 1480 1375 0 0 1 infin 30-00:00 48 190+
The command sacct -B -j {jobid}
can be used to check the job submission parameters.
$ sacct -B -j 1201 Batch Script for 1201 -------------------------------------------------------------------------------- #!/bin/bash #SBATCH -J Condo-utia #Job name #SBATCH -A acf-utk0XYZ #Write your project account associated to utia condo #SBATCH -p condo-utia #SBATCH -N 1 #SBATCH --ntasks-per-node=2 #--ntasks is used when we want to define total number of processors #SBATCH --time=03:00:00 #SBATCH -o Condo-utia.o%j #Dump output and error in one file #SBATCH --qos=condo ########### Perform some simple commands ######################## set -x ########### Run your parallel executable with srun ############### srun ./hello_mpi
The command scontrol show job {jobid}
can be used to check details of the requested resources for the job.
$scontrol show job 1201 JobId=1201 JobName=Condo-utia UserId=username(10577) GroupId=tug2106(3319) MCS_label=N/A Priority=4600 Nice=0 Account=acf-utk0XYZ QOS=condo JobState=PENDING Reason=Priority Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:00:00 TimeLimit=03:00:00 TimeMin=N/A SubmitTime=2024-01-22T08:48:41 EligibleTime=2024-01-22T08:48:41 AccrueTime=2024-01-22T08:48:41 StartTime=2024-02-10T13:18:33 EndTime=2024-02-10T16:18:33 Deadline=N/A SuspendTime=None SecsPreSuspend=0 LastSchedEval=2024-01-22T16:40:18 Scheduler=Main Partition=condo-utia AllocNode:Sid=localhost:1184486 ReqNodeList=(null) ExcNodeList=(null) NodeList= NumNodes=1-1 NumCPUs=2 NumTasks=2 CPUs/Task=1 ReqB:S:C:T=0:0:*:* ReqTRES=cpu=2,mem=7600M,node=1,billing=2 AllocTRES=(null) Socks/Node=* NtasksPerN:B:S:C=2:0:*:* CoreSpec=* MinCPUsNode=2 MinMemoryCPU=3800M MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=/lustre/isaac/scratch/username/jobs/condo_utia.sh WorkDir=/lustre/isaac/scratch/username/jobs StdErr=/lustre/isaac/scratch/username/jobs/Condo-utia.o1201 StdIn=/dev/null StdOut=/lustre/isaac/scratch/username/jobs/Condo-utia.o1201 Power=