Monitoring Jobs on ISAAC ORI
Introduction
Jobs submitted to the cluster can encounter a variety of issues that slow or even halt execution. In some cases, users can diagnose and repair these problems themselves. In this document, you will learn how to monitor your submitted jobs and troubleshoot common job issues.
Monitoring Utilities
The squeue
command can be used to facilitate job monitoring.
$ squeue -u <netID> JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 1202 campus Job3 username PD 0:00 2 (Resources) 1201 campus Job1 username R 0:05 2 node[001-002] 1200 campus Job2 username R 0:10 2 node[004-005]
The description of each of the columns of the output from the squeue
command is given below:
Column | Description |
JOBID | The unique identifier of each job |
PARTITION | The partition/queue from which the resources are to be allocated to the job |
NAME | The name of the job specified in the Slurm script using #SBATCH -J option. If the -J option is not used, Slurm will use the name of the batch script. |
USER | The login name of the user submitting the job |
ST | Status of the job. Slurm scheduler uses short notation to give the status of the job. The meaning of these short notations is given in the table below. |
TIME | The maximum wall time requested by the user for a job |
NODES | The requested number of nodes on which the job is running along with the node names if resources are already allocated |
When a job is submitted, it passes through various states. The values of these states for a job is given by squeue command under the column ST. The possible values of the job under ST column are given below:
Value | Meaning | Description |
CG | Completing | Job is about to complete. |
PD | Pending | Job is waiting for the resources to be allocated |
R | Running | Job is running on the allocated resources |
S | Suspended | Job was allocated resources but the execution got suspended due to some problem and CPUs are released for other jobs |
NF | Node Failure | Job terminated due to failure of one or more allocated nodes |
The commands isaac-sinfo
and showpartitions
can be used to check partitions, features and current availability of the requested computing resources.
$ isaac-sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST campus up infinite 4 idle skk[101001-101004] campus-gpu up 30-00:00:0 1 idle skvk1131
$ showpartitions Partition statistics for cluster ori at Thu Jun 13 08:22:24 EDT 2024 Partition #Nodes #CPU_cores Cores_pending Job_Nodes MaxJobTime Cores Mem/Node Name State Total Idle Total Idle Resorc Other Min Max Day-hr:mn /node (GB) campus up 4 4 160 160 0 0 1 infin infinite 40 190 campus-gpu up 1 0 40 0 0 0 1 infin 30-00:00 40 190
The command sacct -B -j {jobid}
can be used to check the job submission parameters.
$ sacct -B -j 1201 Batch Script for 1201 -------------------------------------------------------------------------------- #!/bin/bash #SBATCH -J Condo-utia #Job name #SBATCH -A acf-utk0XYZ #Write your project account associated to utia condo #SBATCH -p condo-utia #SBATCH -N 1 #SBATCH --ntasks-per-node=2 #--ntasks is used when we want to define total number of processors #SBATCH --time=03:00:00 #SBATCH -o Condo-utia.o%j #Dump output and error in one file #SBATCH --qos=condo ########### Perform some simple commands ######################## set -x ########### Run your parallel executable with srun ############### srun ./hello_mpi
The command scontrol show job {jobid}
can be used to check details of the requested resources for the job.
$scontrol show job 1201 JobId=1201 JobName=Condo-utia UserId=username(10577) GroupId=tug2106(3319) MCS_label=N/A Priority=4600 Nice=0 Account=acf-utk0XYZ QOS=condo JobState=PENDING Reason=Priority Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:00:00 TimeLimit=03:00:00 TimeMin=N/A SubmitTime=2024-01-22T08:48:41 EligibleTime=2024-01-22T08:48:41 AccrueTime=2024-01-22T08:48:41 StartTime=2024-02-10T13:18:33 EndTime=2024-02-10T16:18:33 Deadline=N/A SuspendTime=None SecsPreSuspend=0 LastSchedEval=2024-01-22T16:40:18 Scheduler=Main Partition=condo-utia AllocNode:Sid=localhost:1184486 ReqNodeList=(null) ExcNodeList=(null) NodeList= NumNodes=1-1 NumCPUs=2 NumTasks=2 CPUs/Task=1 ReqB:S:C:T=0:0:*:* ReqTRES=cpu=2,mem=7600M,node=1,billing=2 AllocTRES=(null) Socks/Node=* NtasksPerN:B:S:C=2:0:*:* CoreSpec=* MinCPUsNode=2 MinMemoryCPU=3800M MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=/lustre/isaac/scratch/username/jobs/condo_utia.sh WorkDir=/lustre/isaac/scratch/username/jobs StdErr=/lustre/isaac/scratch/username/jobs/Condo-utia.o1201 StdIn=/dev/null StdOut=/lustre/isaac/scratch/username/jobs/Condo-utia.o1201 Power=