Jobs submitted to the cluster can encounter a variety of issues that slow or even halt execution. In some cases, users can diagnose and repair these problems themselves. In this document, you will learn how to monitor your submitted jobs, make modifications to them, and troubleshoot common job issues.
Job monitoring enables users to determine the status of their submitted or active jobs. Tracking the status of these jobs can help identify issues that require remediation. Several tools on the cluster facilitate job monitoring.
qstat is one of these tools.
qstat outputs a summary of the jobs a user has submitted to the cluster, both interactive and non-interactive. To review a brief summary, execute
qstat -a. Figure 2.1 shows the output of
qstat with this option. To review a verbose summary, execute
qstat -f. Most situations will not require verbose output, but it can be useful for troubleshooting scenarios.
[user-x@acf-login5 user-x]$ qstat -a apollo-acf: Req'd Req'd Elap Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time ----------------------- ----------- -------- ---------------- ------ ----- ------ --------- --------- - --------- 2804765.apollo-acf user-x batch sample_job.sh -- 1 1 -- 01:00:00 H --
Figure 2.1 – Output of qstat -a
In total, qstat -a outputs ten columns. Most of the columns are self-explanatory. Table 2.1 describes the purpose of columns that are not readily obvious.
|NDS||Shows the amount of nodes requested by the job. qsub’s |
|TSK||Shows the amount of cores requested by the job. qsub’s |
|S||Shows the job’s status. Table 2.2 defines the possible job statuses.|
|C||Job has completed execution. Please note that this does not indicate the job successfully completed, only that it was processed by the scheduler and executed.|
|E||Job has exited after execution.|
|H||Job is held. It will not execute until it is released.|
|Q||Job is queued and eligible to execute. When sufficient resources become available, the job will execute.|
|R||Job is currently running.|
|T||Job is being moved to a new location.|
|W||Job is waiting for its execution time to be reached.|
qstat -a outputs summarized information on every job a user has submitted, it is not well-suited to find detailed information on individual jobs. To monitor individual jobs, either the
qstat -f <job-ID> or the
checkjob utilities can be used. Figure 2.2 shows the output of
checkjob. Figure 2.3 presents the syntax for the
checkjob utility. The <job-ID> argument accepts the job’s numeric identifier followed by the batch server’s hostname. In Figure 2.2, the job’s identifier is 2804750.apollo-acf. When you submit a job, the scheduler outputs its identifier. You may also use
qstat -a to quickly find job identifiers.
[user-x@acf-login5 user-x]$ checkjob 2804750.apollo-acf job 2804750 AName: sample_job.sh State: Running Creds: user:user-x group:tug0001 account:ACF-UTK0011 class:batch qos:campus WallTime: 00:00:00 of 1:00:00 SubmitTime: Tue Mar 31 13:16:36 (Time Queued Total: 00:00:23 Eligible: 00:00:23) StartTime: Tue Mar 31 13:16:59 TemplateSets: DEFAULT NodeMatchPolicy: EXACTNODE Total Requested Tasks: 1 Req TaskCount: 1 Partition: beacon Opsys: --- Arch: --- Features: batch Dedicated Resources Per Task: PROCS: [ALL] NodeCount: 1 Allocated Nodes:
SystemID: Moab SystemJID: 2804750 Notification Events: JobFail IWD: /lustre/haven/user/user-x StartCount: 1 Partition List: beacon,general Flags: RESTARTABLE,ALLPROCS Attr: checkpoint StartPriority: 102998 IterationJobRank: 0 PE: 0.00 Reservation ‘2804750’ (-00:00:16 -> 00:59:44 Duration: 1:00:00)Figure 2.2 – Output of checkjob
Figure 2.3 – Syntax of checkjob
Another useful job monitoring utility is showq. When it is executed without any options or arguments, it shows the cluster’s entire queue. With the -w user option, it will only show the queue for the specified user. Figure 2.4 shows the output of showq with the -w user option. Observe that showq divides its output into active, eligible, and blocked jobs. This provides a convenient way to determine the current status of all your submitted jobs. You could also use the -w option with group to identify the queue for a specific group on the cluster.
[user-x@acf-login5 user-x]$ showq -w user=$USER active jobs------------------------ JOBID USERNAME STATE PROCS REMAINING STARTTIME 0 active jobs 0 of 7192 processors in use by local jobs (0.00%) 0 of 277 nodes active (0.00%) eligible jobs---------------------- JOBID USERNAME STATE PROCS WCLIMIT QUEUETIME 0 eligible jobs blocked jobs----------------------- JOBID USERNAME STATE PROCS WCLIMIT QUEUETIME 2809097 user-x UserHold 8 1:00:00 Wed Apr 1 15:45:42 1 blocked job Total job: 1
Figure 2.4 – Output of showq -w user=$USER
Jobs may require alteration to execute after they are submitted. While syntax errors in a job script will prevent the job from being submitted, errors with the values provided in the job script may not. These situations can be remedied by users. Please note that if the job is running and requires modification, an administrator must make those changes. Please contact the OIT Help Desk (see https://help.utk.edu/) if you need a running job to be modified.
In general, the process for modifying a job in the queue is as follows.
qholdcommand to remove the job(s) from the queue. Figure 3.1 shows the syntax for this command. Replace <job-ID> with the job’s numeric identifier followed by the hostname of the batch server (apollo-acf).
If you determine that the job must be completely removed from the queue, use the
qdel command. The job will be completely purged from the queue. Figure 3.4 shows how to use the
Figure 3.4 – Syntax of qdel
Various issues can affect the submission or execution of jobs. In these situations, users can ordinarily remedy the problems themselves without requiring assistance. This section covers the most common problems users encounter with their jobs and provides instructions to resolve them.
Syntax errors and invalid arguments are frequent in job scripts. If the job cannot be submitted, verify that the job script conforms to the appropriate syntax for the interpreter and uses proper arguments for each PBS option. For example, consider the error in Figure 4.1. In this case, the user provided the ppn value after a colon rather than an equal (=) sign. The scheduler will not permit a job with this error to be queued.
#PBS -l nodes=1:ppn:4
Figure 4.1 – Syntax Error with the ppn Option
Another possible error is shown in Figure 4.2. Here, the user provided two environment variables with the incorrect syntax. The leading dollar sign ($) should be removed to ensure that the variables are passed to the job. Additionally, there should be no spaces between variables, only commas.
#PBS -v $SCRATCHDIR $LD_LIBRARY_PATH
Figure 4.2 – Syntax Error with the -v Option
Jobs that occupy the queue for extended periods of time have likely requested more resources than the cluster has available. In these situations, modify the resource requirements of the job to reflect what the job needs to execute. For instance, a job that requests four nodes is unlikely to execute in a timely fashion. Instead, it should request two nodes with a ppn (processors per node) value that indicates how many cores the job requires. Figure 4.3 shows the alteration that should be made. Review the process for altering jobs outlined in the Modifying Jobs section before attempting to make any modifications.
qalter -l nodes=2:ppn=8 1112223.apollo-acf
Figure 4.3 – Changing a Job’s Resource Request
If the job specifies the -n option to the qsub command or in the job script, it should be removed unless the job requires an entire node. Figure 4.4 shows how to unset the node-exclusive option from a submitted job. The output of
qstat -f will reflect the change under the “node_exclusive” key.
qalter -n n 3334445.apollo-acf
Figure 4.4 – Removing the Node-Exclusive Option from a Job
Jobs may request a partition that does not exist. Figure 4.5 illustrates an invalid partition specification. “beacon_gpu” is not a partition; it is a feature. The scheduler will not queue this job with this partition selected. It should be changed to the valid partition “beacon” so that it will be queued.
#PBS -l partition=beacon_gpu
Figure 4.5 – Invalid Partition Specification
Other jobs may request a partition to which the submitting user has no access. In these cases, the partition should be changed to one that the user can access. Figure 4.6 shows how to use the qalter command to change the job’s partition. Remember to follow the process for altering jobs outlined in the Modifying Jobs section before altering a job.
qalter -l partition=general 9998881.apollo-acf
Figure 4.6 – Changing a Job’s Partition
Though less likely, errors involving features can occur. Requesting features like “gpu” or “bigcore” will prevent the job from being queued. Features should only be used to specify the exact node set on which the jobs should run. Figure 4.7 demonstrates a change to a job that targets the general partition and the sigma feature, which will place the job on a Sigma node.
qalter -l partition=general,feature=sigma 5554443.apollo-acf
Figure 4.7 – Changing a Job’s Partition and Feature
For more information about partitions and features, please review the Running Jobs document.
If the wrong project or reservation is used to submit a job, the scheduler may refuse to place it in the queue or it may occupy the queue for an indefinite time. Before submitting a job, review the project(s) to which you belong in the User Portal. Use the project’s identifier to submit your jobs to the scheduler. By default, all UTK users belong to the ACF-UTK0011 project, while all UTHSC users belong to the ACF-UTHSC0001 project. Figure 4.8 shows how to modify the project used by a submitted job.
qalter -A ACF-UTK0011 3339991.apollo-acf
Figure 4.8 – Changing a Job’s Project
Reservations are generally only available for a limited time. For example, a class may have a two-day reservation to have exclusive access to some of the nodes in the cluster. The reservation identifier should be provided to all the users who require access to the reservation. The showres command can also be used to output the reservations to which you belong. If you use the wrong reservation or mistype the identifier, use the command in Figure 4.9 to change it.
qalter -l advres <res-ID> 4445553.apollo-acf
Figure 4.9 – Changing a Job’s Reservation
Some projects have access to private condos. In these cases, the QoS (quality of service) attribute must be specified in order to access the condo. Review the Writing Job Scripts and Running Jobs documents for more information. If you need to change the QoS attribute of a queued job, use the command in Figure 4.10 to modify it.
qalter -l qos=condo 1119992.apollo-acf