High Performance & Scientific Computing

Monitoring and Modifying Jobs on ISAAC Legacy

Introduction

Jobs submitted to the cluster can encounter a variety of issues that slow or even halt execution. In some cases, users can diagnose and repair these problems themselves. In this document, you will learn how to monitor your submitted jobs, make modifications to them, and troubleshoot common job issues.

Monitoring Utilities

Job monitoring enables users to determine the status of their submitted or active jobs. Tracking the status of these jobs can help identify issues that require remediation. Several tools on the cluster facilitate job monitoring. qstat is one of these tools.

qstat outputs a summary of the jobs a user has submitted to the cluster, both interactive and non-interactive. To review a brief summary, execute qstat -a. Figure 2.1 shows the output of qstat with this option. To review a verbose summary, execute qstat -f. Most situations will not require verbose output, but it can be useful for troubleshooting scenarios.

[user-x@acf-login5 user-x]$ qstat -a

apollo-acf: 
                                                                                                                           Req'd          Req'd        Elap
Job ID                  Username    Queue    Jobname          SessID  NDS   TSK   Memory      Time    S   Time
-----------------------    -----------     -------- ---------------- ------ ----- ------ --------- --------- - ---------
2804765.apollo-acf      user-x       batch    sample_job.sh       --      1      1       --   01:00:00 H       --

Figure 2.1 – Output of qstat -a

In total, qstat -a outputs ten columns. Most of the columns are self-explanatory. Table 2.1 describes the purpose of columns that are not readily obvious.

qstat Column	Description
NDS	Shows the amount of nodes requested by the job. qsub’s `-l nodes=<num-nodes>` option influences this number.
TSK	Shows the amount of cores requested by the job. qsub’s `-l nodes=<num-nodes>:ppn=<num-cores>` option influences this number.
S	Shows the job’s status. Table 2.2 defines the possible job statuses.

Job Status	Description
C	Job has completed execution. Please note that this does not indicate the job successfully completed, only that it was processed by the scheduler and executed.
E	Job has exited after execution.
H	Job is held. It will not execute until it is released.
Q	Job is queued and eligible to execute. When sufficient resources become available, the job will execute.
R	Job is currently running.
T	Job is being moved to a new location.
W	Job is waiting for its execution time to be reached.

Because qstat -a outputs summarized information on every job a user has submitted, it is not well-suited to find detailed information on individual jobs. To monitor individual jobs, either the qstat -f <job-ID> or the checkjob utilities can be used. Figure 2.2 shows the output of checkjob. Figure 2.3 presents the syntax for the checkjob utility. The <job-ID> argument accepts the job’s numeric identifier followed by the batch server’s hostname. In Figure 2.2, the job’s identifier is 2804750.apollo-acf. When you submit a job, the scheduler outputs its identifier. You may also use qstat -a to quickly find job identifiers.

[user-x@acf-login5 user-x]$ checkjob 2804750.apollo-acf
job 2804750

AName: sample_job.sh
State: Running 
Creds:  user:user-x  group:tug0001  account:ACF-UTK0011  class:batch  qos:campus
WallTime:   00:00:00 of 1:00:00
SubmitTime: Tue Mar 31 13:16:36
  (Time Queued  Total: 00:00:23  Eligible: 00:00:23)

StartTime: Tue Mar 31 13:16:59
TemplateSets:  DEFAULT
NodeMatchPolicy: EXACTNODE
Total Requested Tasks: 1

Req[0]  TaskCount: 1  Partition: beacon
Opsys: ---  Arch: ---  Features: batch
Dedicated Resources Per Task: PROCS: [ALL]
NodeCount:  1

Allocated Nodes:

SystemID: Moab SystemJID: 2804750 Notification Events: JobFail IWD: /lustre/haven/user/user-x StartCount: 1 Partition List: beacon,general Flags: RESTARTABLE,ALLPROCS Attr: checkpoint StartPriority: 102998 IterationJobRank: 0 PE: 0.00 Reservation ‘2804750’ (-00:00:16 -> 00:59:44 Duration: 1:00:00)Figure 2.2 – Output of checkjob

checkjob <job-ID>

Figure 2.3 – Syntax of checkjob

Another useful job monitoring utility is showq. When it is executed without any options or arguments, it shows the cluster’s entire queue. With the -w user option, it will only show the queue for the specified user. Figure 2.4 shows the output of showq with the -w user option. Observe that showq divides its output into active, eligible, and blocked jobs. This provides a convenient way to determine the current status of all your submitted jobs. You could also use the -w option with group to identify the queue for a specific group on the cluster.

[user-x@acf-login5 user-x]$ showq -w user=$USER

active jobs------------------------
JOBID              USERNAME      STATE PROCS   REMAINING            STARTTIME


0 active jobs            0 of 7192 processors in use by local jobs (0.00%)
                          0 of 277 nodes active      (0.00%)

eligible jobs----------------------
JOBID              USERNAME      STATE PROCS     WCLIMIT            QUEUETIME


0 eligible jobs   

blocked jobs-----------------------
JOBID              USERNAME      STATE PROCS     WCLIMIT            QUEUETIME

2809097               user-x   UserHold     8     1:00:00  Wed Apr  1 15:45:42

1 blocked job    

Total job:  1

Figure 2.4 – Output of showq -w user=$USER

Modifying Jobs

Jobs may require alteration to execute after they are submitted. While syntax errors in a job script will prevent the job from being submitted, errors with the values provided in the job script may not. These situations can be remedied by users. Please note that if the job is running and requires modification, an administrator must make those changes. Please contact the OIT Help Desk (see https://help.utk.edu/) if you need a running job to be modified.

In general, the process for modifying a job in the queue is as follows.

Use one of the utilities mentioned in the Job Monitoring section to gather information about the jobs that require alteration. You will need the job’s identifier and the parameters that require modification.
Use the qhold command to remove the job(s) from the queue. Figure 3.1 shows the syntax for this command. Replace <job-ID> with the job’s numeric identifier followed by the hostname of the batch server (apollo-acf).
Use the qalter command to modify the job. For instance, to modify the amount of nodes requested by the job, execute the command shown in Figure 3.2.
Use the qrls command to put the job back into the scheduler’s queue. Figure 3.3 shows the syntax for this command.
Continue to monitor the job and make additional modifications, if necessary.

If you determine that the job must be completely removed from the queue, use the qdel command. The job will be completely purged from the queue. Figure 3.4 shows how to use the qdel command.

qdel <job-ID>

Figure 3.4 – Syntax of qdel

Common Job Issues

Various issues can affect the submission or execution of jobs. In these situations, users can ordinarily remedy the problems themselves without requiring assistance. This section covers the most common problems users encounter with their jobs and provides instructions to resolve them.

Job Script Errors

Syntax errors and invalid arguments are frequent in job scripts. If the job cannot be submitted, verify that the job script conforms to the appropriate syntax for the interpreter and uses proper arguments for each PBS option. For example, consider the error in Figure 4.1. In this case, the user provided the ppn value after a colon rather than an equal (=) sign. The scheduler will not permit a job with this error to be queued.

#PBS -l nodes=1:ppn:4

Figure 4.1 – Syntax Error with the ppn Option

Another possible error is shown in Figure 4.2. Here, the user provided two environment variables with the incorrect syntax. The leading dollar sign ($) should be removed to ensure that the variables are passed to the job. Additionally, there should be no spaces between variables, only commas.

#PBS -v $SCRATCHDIR $LD_LIBRARY_PATH

Figure 4.2 – Syntax Error with the -v Option

Resource Limitations

Jobs that occupy the queue for extended periods of time have likely requested more resources than the cluster has available. In these situations, modify the resource requirements of the job to reflect what the job needs to execute. For instance, a job that requests four nodes is unlikely to execute in a timely fashion. Instead, it should request two nodes with a ppn (processors per node) value that indicates how many cores the job requires. Figure 4.3 shows the alteration that should be made. Review the process for altering jobs outlined in the Modifying Jobs section before attempting to make any modifications.

qalter -l nodes=2:ppn=8 1112223.apollo-acf

Figure 4.3 – Changing a Job’s Resource Request

If the job specifies the -n option to the qsub command or in the job script, it should be removed unless the job requires an entire node. Figure 4.4 shows how to unset the node-exclusive option from a submitted job. The output of qstat -f will reflect the change under the “node_exclusive” key.

qalter -n n 3334445.apollo-acf

Figure 4.4 – Removing the Node-Exclusive Option from a Job

Using the Wrong Partition or Feature

Jobs may request a partition that does not exist. Figure 4.5 illustrates an invalid partition specification. “beacon_gpu” is not a partition; it is a feature. The scheduler will not queue this job with this partition selected. It should be changed to the valid partition “beacon” so that it will be queued.

#PBS -l partition=beacon_gpu

Figure 4.5 – Invalid Partition Specification

Other jobs may request a partition to which the submitting user has no access. In these cases, the partition should be changed to one that the user can access. Figure 4.6 shows how to use the qalter command to change the job’s partition. Remember to follow the process for altering jobs outlined in the Modifying Jobs section before altering a job.

qalter -l partition=general 9998881.apollo-acf

Figure 4.6 – Changing a Job’s Partition

Though less likely, errors involving features can occur. Requesting features like “gpu” or “bigcore” will prevent the job from being queued. Features should only be used to specify the exact node set on which the jobs should run. Figure 4.7 demonstrates a change to a job that targets the general partition and the sigma feature, which will place the job on a Sigma node.

qalter -l partition=general,feature=sigma 5554443.apollo-acf

Figure 4.7 – Changing a Job’s Partition and Feature

For more information about partitions and features, please review the Running Jobs document.

Using the Wrong Project or Reservation

If the wrong project or reservation is used to submit a job, the scheduler may refuse to place it in the queue or it may occupy the queue for an indefinite time. Before submitting a job, review the project(s) to which you belong in the User Portal. Use the project’s identifier to submit your jobs to the scheduler. By default, all UTK users belong to the ACF-UTK0011 project, while all UTHSC users belong to the ACF-UTHSC0001 project. Figure 4.8 shows how to modify the project used by a submitted job.

qalter -A ACF-UTK0011 3339991.apollo-acf

Figure 4.8 – Changing a Job’s Project

Reservations are generally only available for a limited time. For example, a class may have a two-day reservation to have exclusive access to some of the nodes in the cluster. The reservation identifier should be provided to all the users who require access to the reservation. The showres command can also be used to output the reservations to which you belong. If you use the wrong reservation or mistype the identifier, use the command in Figure 4.9 to change it.

qalter -l advres <res-ID> 4445553.apollo-acf

Figure 4.9 – Changing a Job’s Reservation

Some projects have access to private condos. In these cases, the QoS (quality of service) attribute must be specified in order to access the condo. Review the Writing Job Scripts and Running Jobs documents for more information. If you need to change the QoS attribute of a queued job, use the command in Figure 4.10 to modify it.

qalter -l qos=condo 1119992.apollo-acf

The University of Tennessee, Knoxville

Office of Innovative Technologies