Usually a new user or new work project starts by logging into one of the login nodes, or with the web-based Open OnDemand (See our separate web page on Open OnDemand). The login nodes should only be used for basic tasks, such as, file editing or management, code compilation, and job submission. This document will explain how create jobs to perform work and to use the Torque & Moab services to submit, monitor, and alter jobs on the cluster. This includes also how to run an interactive job on a compute node resource which is called an “interactive batch job.” The cluster is set up to maximize utilization of the shared resources. To learn about our responsible node sharing and its implementation, please review the Responsible Node Sharing on ISAAC webpage.
Do not run production jobs on the login nodes. If you run computational work interactively on a login node, those processes are subject being terminated by system administrators. Login nodes are a shared resource by all users and should not be used for computational work. Instead, use the cluster’s compute resources for computational work as described below.
The Torque scheduler priority of a job influences how quickly it is assigned resources to begin execution on the cluster’s compute node resources. The general factors that affect priority are listed below.
The cluster uses a variety of mechanisms to determine how to schedule jobs. Most of the resource request mechanisms can be manipulated by users to ensure that their jobs are scheduled and executed in a reasonable time period.
In the context of the cluster, a condo is a logical group of compute nodes that act as an independent cluster to effectively schedule jobs. Institutional and private condos exist on the cluster. All users belong to their respective institutional condo, whether that be UTK or UTHSC. Private condos are owned and used by individual investors and their associated projects. With a private condo, the investor and his project have exclusive use of the compute nodes in the condo. Investors can choose to share their private condo with the entire cluster under the Responsible Node Sharing implementation, but this is not a requirement.
A project is the combination of an account used in the batch system and is related to a Linux group used on the system for file access control. In the cluster’s condo-based scheduling model projects and their associated accounts in the batch system are a key component. Projects control access to institutional and private condos. For UTK institutional users, the default project account is ACF-UTK0011. For UTHSC institutional users, the default project is ACF-UTHSC0001. Other projects follow the same project identifier format. To determine which projects you belong, login to the User Portal and look under the “Projects” header. Always ensure you use the correct project account when submitting jobs with a PBS directive. The following is an example of specifying to use the UTK institutional project account as a PBS directive in a job script:
#PBS -A ACF-UTK0011
Reservations are special allocations granted to a project. A reservation grants exclusive access to a set of nodes for a specific time period. For researcher who invest in equipment for their exclusive use on the system these are called private condos. To make use of a private condo specify in the PBS directives the project account that is associated with the private condo and specify the Quality of Service (qos) as condo. One can also specify the reservation id in the PBS directives but it should not be necessary to do so. An example PBS directive set for a private condo job submission is:
#PBS -A ACF-UTK0003
#PBS -l qos=condo
The above would still need PBS directives to specify the requested nodes, cores and walltime parameters, any optional partition or features that would relate to the private condo, and any job naming, emailing or other optional PBS directives. A collection of complete sample jobs scripts are available at /lustre/haven/examples/jobs.
Also, see the output of acfreservations for a list of the resources available for a a project account. Showing what resources account ACF-UTK0003 has access to one would do the following command:
$ acfreservations ACF-UTK0003
The scheduler uses queues to organize jobs. At the time of this writing, the cluster uses the batch queue and the debug queue. The former queue is the default queue to which all jobs are submitted. Users are not required to specify this queue when they submit jobs. The debug queue must be specified in the job script or on the command line. Figure 2.1 shows how to use this queue for job scripts, while Figure 2.2 shows how to use this queue for interactive jobs. In general, the debug queue should only be used for interactive work and short sessions of testing. All jobs submitted to the debug queue are limited to an hour of walltime.
#PBS -q debug
Figure 2.1 – Specifying the Debug Queue in a Job Script
qsub -I -q debug
Figure 2.2 – Specifying the Debug Queue for an Interactive Job
Similar nodes are contained within partitions. In addition to grouping like nodes, it also enables users to specify the node set they wish to use. The current partitions are listed below. At the time of this writing, jobs default to using the general, beacon, and rho partitions. Be aware that because Rho is included in the default partition, you will only receive 2GB of memory per core if you do not specify a partition or ppn value. Thus, ensure that your job uses the appropriate partition. To learn more about the nodes within each partition, review the System Overview document. To learn how to target specific partitions, please review the Writing Job Scripts document.
Some partitions consist of multiple node sets. To target specific node sets within a partition, use a feature attribute. For instance, to target the Beacon GPU nodes, use the beacon_gpu feature. The available features are listed below. More information on how to use feature attributes is in the Writing Job Scripts document. Additionally, the Targeting GPU Nodes section of this document shows how the feature attribute is used in a job script.
QoS (Quality of Service) attributes define node allocations and wallclock limitations. At the time of this writing, opportunistic users are limited to 48 jobs, 24 nodes, and 24 hours of walltime for their jobs. Table 2.1 outlines the available QoS attributes.
|QoS Attribute||Min. Allocation||Max. Allocation||Wall Clock Limit|
|Condo||1 Node||Condo Max.||28 Days|
|Campus||1 Node||24 Nodes||24 Hours|
|Overflow||1 Node||24 Nodes||24 Hours|
|Long (UTHSC Projects Only)||1 Node||24 Nodes||6 Days|
|Long (UTK Projects Only)||1 Node||24 Nodes||3 Days|
By default, jobs run in the institutional condos use the campus QoS attribute. UTHSC users can specify the long QoS attribute if their project has access to it. With the long QoS attribute, jobs can run for up to six days. The overflow QoS attribute allows users to run their jobs in a condo and spill over into the user’s institutional condo if necessary. The condo QoS attribute permits users with private condos on the cluster to run their jobs on those nodes. If you wish to run a job in a private condo, use the condo QoS attribute in your job script for non-interactive jobs or with your qsub command for interactive jobs.
In general, it is best to specify the QoS attribute that applies to your situation rather than rely on the defaults. Please review the Writing Job Scripts document to learn how to use these QoS attributes in your jobs.
Responsible node sharing on the cluster has altered the process for job submission. Previously, a single-node job would consume an entire node’s resources. Now, multiple jobs can share the same node. This increases the cluster’s throughput and resource utilization, benefiting users and administration. More information is available in the Responsible Node Sharing document. Rather than deal with what node sharing is, this section deals with how it affects the resources allocated to jobs.
In order to facilitate node sharing, the resource manager implements default resource allocations to jobs that do not specify what they require. By default, a job that does not specify the resources it requires will use the default partition, which consists of the beacon, rho, and general partitions. The job will receive a single core and 2GB of memory. Job scripts that previously worked on the cluster will not likely function in the same way due to these changes. When the job is submitted, the resource manager informs the user about these default values and provides guidance on requesting additional resources. Figure 3.1 shows this output from a submitted interactive job that does not specify its resource requirements.
[user-x@acf-login5 user-x]$ qsub -I -l nodes=1,walltime=1:00:00 Notice: Using default project ACF-UTK0011 Notice: Using default nodes=1:ppn=1 Notice: Using default partition ['beacon', 'rho', 'general'] Notice: Setting vmem=2048MB due to selected partitions and number of cores. If you need more memory, please resubmit your job requesting more cores. If you need more memory per core, please resubmit requesting node types with higher memory per core. Notice: Using default node access policy: shared
Figure 3.1 – Submitting a Job with No Resource Specifications
In Figure 3.1, observe that the resource manager allocates a single core to the job. This value can be changed by users with the ppn (processors per node) option. It is specified on the same line as the
-l nodes=<num-nodes> option. Figure 3.2 shows how the ppn option is used in the context of a job script. The usage is the same in an interactive job minus the
#PBS directive prefix.
#PBS -l nodes=1:ppn=8
Figure 3.2 – Usage of the ppn Option in a Job Script
It is important to note that the value specified for the ppn option directly influences the amount of memory allocated to the job. The higher the ppn value is, the more memory the job receives. Users cannot manually specify their memory requirements in job scripts or with the qsub command. They must provide a higher ppn value to obtain more memory for the job. If they attempt to specify the amount of memory they desire, the resource manager will reject the job. In Figure 3.2, the job would receive eight cores worth of memory, which will vary with the partition and feature. Another important factor to consider with the ppn option is the partition and feature in use by the job. A job that runs on the beacon partition receives 16GB of memory per core, while the same job on the rho partition receives 2GB of memory per core. The general partition has additional caveats because three node sets belong to it. The formula that calculates the memory allocation a job receives is depicted in Figure 3.3. For reference purposes, Table 3.1 documents the amount of memory available per core on each node set in the cluster.
Total memory = cores requested / total cores on the node
Figure 3.3 – Formula to Calculate the Amount of Memory Allocated to a Job
|Partition||Feature||Memory per Core (MB)|
Use Table 3.1 when determining which partitions and features to use with your job. Note that the default partition, which is used when no partition is specified, uses Rho’s per-core memory amount of 2048MB. The general partition uses sigma_bigcore’s per-core memory amount of 4682MB. If you wish to receive more memory, specify a partition and feature that provides a higher per-core memory amount. Additional information is available in the Partitions and Features section of this document. The hardware resources available to each node set is documented in the System Overview document.
In scenarios where your job requires all the resources available to a node, the resource manager provides the option to make the job node-exclusive. Node-exclusivity allocates an entire node to the job so that it does not share resources with other jobs. Be aware that using this option may delay the execution of your job. Figure 3.4 shows the PBS directive to include in a job script to specify node-exclusivity. The option is the same for interactive jobs minus the
#PBS directive prefix.
Figure 3.4 – Making a Job Node-Exclusive
For more information on writing job scripts that include these options, please review the Writing Job Scripts document. To learn how to monitor and troubleshoot jobs with problems related to node sharing, please refer to the Monitoring and Modifying Jobs document.
Non-interactive batch jobs are submitted to the scheduler using job scripts. These scripts contain PBS directives and shell commands. To learn more about job scripts, please visit the Writing Job Scripts document. Please be aware that if you set your job to be node-exclusive, it may take longer for your job to run depending on resource availability.
If you already have a job script ready, then follow the process outlined below to submit your job to the cluster.
qstat -ato verify that the job was submitted and is queued. A “Q” should appear under the “S” column of qstat. For more information on monitoring your jobs, please review the Monitoring and Modifying Jobs document.
If you need to pass arguments to your job, use the qsub -F option. Figure 4.3 shows the syntax for this option. For more information about command-line arguments, please review the Writing Job Scripts document.
qsub -F “arg_1 arg_2” /path/to/script
Figure 4.3 – Passing Arguments to a Job Script
Interactive jobs enable users to directly manipulate the cluster’s compute resources. Rather than drafting a job script to submit to the cluster, the user uses the qsub command with the appropriate options to submit the job. In general, the options for interactive jobs are the same options used with job scripts. The difference is that the options are not specified as PBS directives, but as options for the executable. Table 4.1 lists the pertinent options for interactive jobs. Figure 4.4 shows the syntax to use for a basic interactive job.
|Submits an interactive job to the scheduler.|
|Runs the interactive job under the project specified in <account>.|
|Instructs the scheduler to make the job node-exclusive. The job will consume the entire node. Interactive jobs should not require an entire node. If your job does, consider using a non-interactive job script.|
|Passes the specified variables to the interactive job. Provide the list of variables in a comma-delimited format.|
|Specifies the queue in which the job should be placed. At the time of this writing, only the |
|Defines the resources required by jobs. Refer to Table 3.2 in the Writing Job Scripts document for more information on the arguments this option accepts.|
qsub -I -l nodes=1,walltime=1:00:00
Figure 4.4 – Syntax for Submitting an Interactive Job
Please note that the first option in Figure 4.4 is an upper-case “i.” The second option is a lower-case “l.”
Once you submit an interactive job, the scheduler will queue the job and execute it when resources are available. Generally, a small, hour-long interactive job should begin within five minutes of submission; however, if the cluster is experiencing high resource utilization, it could take longer.
When you finish your work in the interactive job, issue the
exit command to complete the job and return to the login node.
The mpirun command facilitates the execution of MPI programs. These programs execute in parallel across multiple nodes to enhance performance and resource utilization. When you use mpirun, you can specify the total number of ranks you desire the program to use, in addition to the amount of processes you wish to run on each node. By specifying the amount of ranks and processes, you have greater control over the execution of your jobs on the Open Enclave.
Before you use mpirun in your job, please review the System Overview document to familiarize yourself with the core counts of each node. Understanding the amount of cores at your disposal is critical to using mpirun correctly.
To specify the amount of ranks for your MPI program, use the -n option of mpirun. For instance, if you execute
mpirun -n 16 ./test_job on a single Beacon node, one rank will be placed on each core because one Beacon node has a total of sixteen cores between two processors.
mpirun is not limited to one rank per core, however; nodes can be oversubscribed. To oversubscribe a node is to specify more ranks than the node has cores. By default, additional ranks will not be placed until all the cores on each node are filled. To illustrate this process, consider a job that has requested four Rho nodes. Each Rho node has sixteen cores; in this case, the job has 64 cores allocated to it. If this job executes
mpirun -n 256 ./rho_job, 64 ranks will be placed across each core on each node. After all 64 cores have received a rank, an additional 64 ranks will be placed on each core. This process will continue until each rank has been allocated.
If the amount of ranks is fewer than the available cores on a node, the ranks are evenly spread across processors. As mentioned previously, one Beacon node has sixteen cores between two processors. If a job executes
mpirun -n 8 on one of these nodes, four ranks will be placed on the first processor and four ranks will be placed on the second processor.
For greater control over rank placement, mpirun uses the -ppn option. ppn (processes per node) defines how many ranks should execute on each node. By default, ranks are placed based upon the number of cores each node contains. As an example, using
mpirun -n 45 -ppn 15 ./ppn_job across three Beacon nodes would place sixteen ranks on the first two nodes and thirteen ranks on the last one. To override this behavior, use the -f $PBS_NODEFILE option with mpirun so that it can use the -ppn option properly. If you execute
mpirun -n 45 -ppn 15 -f $PBS_NODEFILE ./ppn_job, it will place fifteen ranks across all three Beacon nodes.
Before you attempt to run an MPI program, verify that you have loaded the appropriate compiler and MPI implementation with the
module list command. By default, Intel’s MPI implementation is loaded into your environment. You can switch to other implementations with the
module swap command. Please refer to the Modules document for more information on how to use the module commands. If you intend to use a Python MPI program, load the mpi4py module.
To use the ACF Rome node, which is a single AMD node with 32 cores, partition=amd, and feature=rome must be used during job submission.
Most ACF software should run on this node without modification. However, Intel compilers may produce optimized code for Intel processors if specific compile options are utilized. These options include the –xHost flags for example, -xCORE-AVX2, -xAVX, and –xSSE4.2. Software built for Intel processors may give errors and need to be rebuilt to run on the AMD Rome node. In this case, please contact the OIT HelpDesk (https://oit.utk.edu/help/) specifying the software and error encountered. Modules that are rebuilt for AMD node will have “*_AMD” appended to their names.
If your job(s) require GPUs (graphics processing units), the process for job submission differs. For non-interactive jobs, you specify a partition and feature set that contains GPU nodes in your batch script. For interactive jobs, you specify these options with the qsub -I (lower-case “l”) command. You must also load the relevant modules that will use the GPUs, such as tensorflow-gpu. For more information on modules and the commands to manipulate them, please refer to the Working with Modules document.
If you intend to use the Beacon GPU nodes, use the ACF-UTK0011 or ACF-UTHSC0001 project for both interactive and non-interactive jobs. At the time of this writing, the Beacon GPU nodes are the only GPU nodes available to all Open Enclave users. Otherwise, specify a project to which you belong that provides access to GPU nodes. Please refer to the System Overview document for more information on which condos have GPUs.
Figure 5.1 shows a sample batch script that targets GPUs on the Open Enclave using the -A option. If necessary, replace the tensorflow-gpu module with the modulefile you require. For the other options, please refer to the Writing Job Scripts document for more information. Note that
./gpu_job refers to the code that will execute on the nodes allocated to your job. Verify that it is in the same directory as the batch script if you use the example as-is. Please note that the line numbers are for reference purposes and should not be included in your job script.
1 #PBS -S /bin/bash 2 #PBS -A ACF-UTK0011 3 #PBS -l partition=beacon 4 #PBS -l feature=beacon_gpu 5 #PBS -l nodes=1 6 #PBS -l walltime=24:00:00 7 8 module load tensorflow-gpu 9 10 cd $PBS_O_WORKDIR 11 mpirun -n 16 ./gpu_job
Figure 5.1 – Sample Batch Script to Target GPU Nodes
To target GPU nodes with an interactive job, follow the general process described in the Interactive Jobs section of this document. For the -A option, specify ACF-UTK0011 or ACF-UTHSC0001 if you intend to use Beacon GPU nodes. If not, specify a project to which you belong that provides GPU resources.
After the interactive job starts, load the relevant modulefiles with the module load command so that you can utilize the allocated GPUs. Please refer to the Working with Modules document for more information. To query for the GPU and its available driver, execute the
nvidia-smi command after the interactive job starts. Figure 5.2 shows the output of this command.
[...@acf-bk003 ~]$ nvidia-smi Tue Dec 31 16:30:13 2019 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 418.67 Driver Version: 418.67 CUDA Version: 10.1 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla K20Xm On | 00000000:81:00.0 Off | 0 | | N/A 18C P8 29W / 235W | 0MiB / 5700MiB | 0% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+
Figure 5.2 – Sample Output of the nvidia-smi Command
For short, small jobs, users can use backfill resources. These resources are otherwise idle and will enable jobs to quickly execute. To see which resources are currently available in the backfill, execute the
showbf command. Figure 6.1 shows the output of
showbf if resources are available.
[user-x@acf-login5 user-x]$ showbf Partition Tasks Nodes Duration StartOffset StartDate --------- ------ ----- ------------ ------------ -------------- ALL 24 1 1:00:00 00:00:00 16:19:49_04/01 ALL 24 1 INFINITY 00:00:00 16:19:49_04/01 monster 24 1 1:00:00 00:00:00 16:19:49_04/01 monster 24 1 INFINITY 00:00:00 16:19:49_04/01
Figure 6.1 – Output of showbf
In the case of Figure 6.1, the user could write a job script that targeted the monster partition and expect to quickly land on the node because it is considered a backfill resource. Be aware that backfill resources may or may not be available depending on the cluster’s current resource utilization.