Monitoring and Modifying Jobs on ISAAC Legacy
Introduction
Jobs submitted to the cluster can encounter a variety of issues that slow or even halt execution. In some cases, users can diagnose and repair these problems themselves. In this document, you will learn how to monitor your submitted jobs, make modifications to them, and troubleshoot common job issues.
Monitoring Utilities
Job monitoring enables users to determine the status of their submitted or active jobs. Tracking the status of these jobs can help identify issues that require remediation. Several tools on the cluster facilitate job monitoring. qstat
is one of these tools.
qstat
outputs a summary of the jobs a user has submitted to the cluster, both interactive and non-interactive. To review a brief summary, execute qstat -a
. Figure 2.1 shows the output of qstat
with this option. To review a verbose summary, execute qstat -f
. Most situations will not require verbose output, but it can be useful for troubleshooting scenarios.
[user-x@acf-login5 user-x]$ qstat -a apollo-acf: Req'd Req'd Elap Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time ----------------------- ----------- -------- ---------------- ------ ----- ------ --------- --------- - --------- 2804765.apollo-acf user-x batch sample_job.sh -- 1 1 -- 01:00:00 H --
Figure 2.1 – Output of qstat -a
In total, qstat -a outputs ten columns. Most of the columns are self-explanatory. Table 2.1 describes the purpose of columns that are not readily obvious.
qstat Column | Description |
---|---|
NDS | Shows the amount of nodes requested by the job. qsub’s -l nodes=<num-nodes> option influences this number. |
TSK | Shows the amount of cores requested by the job. qsub’s -l nodes=<num-nodes>:ppn=<num-cores> option influences this number. |
S | Shows the job’s status. Table 2.2 defines the possible job statuses. |
Job Status | Description |
---|---|
C | Job has completed execution. Please note that this does not indicate the job successfully completed, only that it was processed by the scheduler and executed. |
E | Job has exited after execution. |
H | Job is held. It will not execute until it is released. |
Q | Job is queued and eligible to execute. When sufficient resources become available, the job will execute. |
R | Job is currently running. |
T | Job is being moved to a new location. |
W | Job is waiting for its execution time to be reached. |
Because qstat -a
outputs summarized information on every job a user has submitted, it is not well-suited to find detailed information on individual jobs. To monitor individual jobs, either the qstat -f <job-ID>
or the checkjob
utilities can be used. Figure 2.2 shows the output of checkjob
. Figure 2.3 presents the syntax for the checkjob
utility. The <job-ID> argument accepts the job’s numeric identifier followed by the batch server’s hostname. In Figure 2.2, the job’s identifier is 2804750.apollo-acf. When you submit a job, the scheduler outputs its identifier. You may also use qstat -a
to quickly find job identifiers.
[user-x@acf-login5 user-x]$ checkjob 2804750.apollo-acf job 2804750 AName: sample_job.sh State: Running Creds: user:user-x group:tug0001 account:ACF-UTK0011 class:batch qos:campus WallTime: 00:00:00 of 1:00:00 SubmitTime: Tue Mar 31 13:16:36 (Time Queued Total: 00:00:23 Eligible: 00:00:23) StartTime: Tue Mar 31 13:16:59 TemplateSets: DEFAULT NodeMatchPolicy: EXACTNODE Total Requested Tasks: 1 Req[0] TaskCount: 1 Partition: beacon Opsys: --- Arch: --- Features: batch Dedicated Resources Per Task: PROCS: [ALL] NodeCount: 1 Allocated Nodes:
SystemID: Moab SystemJID: 2804750 Notification Events: JobFail IWD: /lustre/haven/user/user-x StartCount: 1 Partition List: beacon,general Flags: RESTARTABLE,ALLPROCS Attr: checkpoint StartPriority: 102998 IterationJobRank: 0 PE: 0.00 Reservation ‘2804750’ (-00:00:16 -> 00:59:44 Duration: 1:00:00)Figure 2.2 – Output of checkjob
checkjob <job-ID>
Figure 2.3 – Syntax of checkjob
Another useful job monitoring utility is showq. When it is executed without any options or arguments, it shows the cluster’s entire queue. With the -w user option, it will only show the queue for the specified user. Figure 2.4 shows the output of showq with the -w user option. Observe that showq divides its output into active, eligible, and blocked jobs. This provides a convenient way to determine the current status of all your submitted jobs. You could also use the -w option with group to identify the queue for a specific group on the cluster.
[user-x@acf-login5 user-x]$ showq -w user=$USER active jobs------------------------ JOBID USERNAME STATE PROCS REMAINING STARTTIME 0 active jobs 0 of 7192 processors in use by local jobs (0.00%) 0 of 277 nodes active (0.00%) eligible jobs---------------------- JOBID USERNAME STATE PROCS WCLIMIT QUEUETIME 0 eligible jobs blocked jobs----------------------- JOBID USERNAME STATE PROCS WCLIMIT QUEUETIME 2809097 user-x UserHold 8 1:00:00 Wed Apr 1 15:45:42 1 blocked job Total job: 1
Figure 2.4 – Output of showq -w user=$USER
Modifying Jobs
Jobs may require alteration to execute after they are submitted. While syntax errors in a job script will prevent the job from being submitted, errors with the values provided in the job script may not. These situations can be remedied by users. Please note that if the job is running and requires modification, an administrator must make those changes. Please contact the OIT Help Desk (see https://help.utk.edu/) if you need a running job to be modified.
In general, the process for modifying a job in the queue is as follows.
- Use one of the utilities mentioned in the Job Monitoring section to gather information about the jobs that require alteration. You will need the job’s identifier and the parameters that require modification.
- Use the
qhold
command to remove the job(s) from the queue. Figure 3.1 shows the syntax for this command. Replace <job-ID> with the job’s numeric identifier followed by the hostname of the batch server (apollo-acf). - Use the qalter command to modify the job. For instance, to modify the amount of nodes requested by the job, execute the command shown in Figure 3.2.
- Use the qrls command to put the job back into the scheduler’s queue. Figure 3.3 shows the syntax for this command.
- Continue to monitor the job and make additional modifications, if necessary.
If you determine that the job must be completely removed from the queue, use the qdel
command. The job will be completely purged from the queue. Figure 3.4 shows how to use the qdel
command.
qdel <job-ID>
Figure 3.4 – Syntax of qdel
Common Job Issues
Various issues can affect the submission or execution of jobs. In these situations, users can ordinarily remedy the problems themselves without requiring assistance. This section covers the most common problems users encounter with their jobs and provides instructions to resolve them.
Job Script Errors
Syntax errors and invalid arguments are frequent in job scripts. If the job cannot be submitted, verify that the job script conforms to the appropriate syntax for the interpreter and uses proper arguments for each PBS option. For example, consider the error in Figure 4.1. In this case, the user provided the ppn value after a colon rather than an equal (=) sign. The scheduler will not permit a job with this error to be queued.
#PBS -l nodes=1:ppn:4
Figure 4.1 – Syntax Error with the ppn Option
Another possible error is shown in Figure 4.2. Here, the user provided two environment variables with the incorrect syntax. The leading dollar sign ($) should be removed to ensure that the variables are passed to the job. Additionally, there should be no spaces between variables, only commas.
#PBS -v $SCRATCHDIR $LD_LIBRARY_PATH
Figure 4.2 – Syntax Error with the -v Option
Resource Limitations
Jobs that occupy the queue for extended periods of time have likely requested more resources than the cluster has available. In these situations, modify the resource requirements of the job to reflect what the job needs to execute. For instance, a job that requests four nodes is unlikely to execute in a timely fashion. Instead, it should request two nodes with a ppn (processors per node) value that indicates how many cores the job requires. Figure 4.3 shows the alteration that should be made. Review the process for altering jobs outlined in the Modifying Jobs section before attempting to make any modifications.
qalter -l nodes=2:ppn=8 1112223.apollo-acf
Figure 4.3 – Changing a Job’s Resource Request
If the job specifies the -n option to the qsub command or in the job script, it should be removed unless the job requires an entire node. Figure 4.4 shows how to unset the node-exclusive option from a submitted job. The output of qstat -f
will reflect the change under the “node_exclusive” key.
qalter -n n 3334445.apollo-acf
Figure 4.4 – Removing the Node-Exclusive Option from a Job
Using the Wrong Partition or Feature
Jobs may request a partition that does not exist. Figure 4.5 illustrates an invalid partition specification. “beacon_gpu” is not a partition; it is a feature. The scheduler will not queue this job with this partition selected. It should be changed to the valid partition “beacon” so that it will be queued.
#PBS -l partition=beacon_gpu
Figure 4.5 – Invalid Partition Specification
Other jobs may request a partition to which the submitting user has no access. In these cases, the partition should be changed to one that the user can access. Figure 4.6 shows how to use the qalter command to change the job’s partition. Remember to follow the process for altering jobs outlined in the Modifying Jobs section before altering a job.
qalter -l partition=general 9998881.apollo-acf
Figure 4.6 – Changing a Job’s Partition
Though less likely, errors involving features can occur. Requesting features like “gpu” or “bigcore” will prevent the job from being queued. Features should only be used to specify the exact node set on which the jobs should run. Figure 4.7 demonstrates a change to a job that targets the general partition and the sigma feature, which will place the job on a Sigma node.
qalter -l partition=general,feature=sigma 5554443.apollo-acf
Figure 4.7 – Changing a Job’s Partition and Feature
For more information about partitions and features, please review the Running Jobs document.
Using the Wrong Project or Reservation
If the wrong project or reservation is used to submit a job, the scheduler may refuse to place it in the queue or it may occupy the queue for an indefinite time. Before submitting a job, review the project(s) to which you belong in the User Portal. Use the project’s identifier to submit your jobs to the scheduler. By default, all UTK users belong to the ACF-UTK0011 project, while all UTHSC users belong to the ACF-UTHSC0001 project. Figure 4.8 shows how to modify the project used by a submitted job.
qalter -A ACF-UTK0011 3339991.apollo-acf
Figure 4.8 – Changing a Job’s Project
Reservations are generally only available for a limited time. For example, a class may have a two-day reservation to have exclusive access to some of the nodes in the cluster. The reservation identifier should be provided to all the users who require access to the reservation. The showres command can also be used to output the reservations to which you belong. If you use the wrong reservation or mistype the identifier, use the command in Figure 4.9 to change it.
qalter -l advres <res-ID> 4445553.apollo-acf
Figure 4.9 – Changing a Job’s Reservation
Some projects have access to private condos. In these cases, the QoS (quality of service) attribute must be specified in order to access the condo. Review the Writing Job Scripts and Running Jobs documents for more information. If you need to change the QoS attribute of a queued job, use the command in Figure 4.10 to modify it.
qalter -l qos=condo 1119992.apollo-acf