High Performance & Scientific Computing

Running Jobs [DEPRECATED]

IMPORTANT

The SLURM transition on ISAAC Legacy finished on June 30 2022. This involved several changes to the partition structure on ISAAC Legacy. Primarily, the ‘campus’ partition has now separated into ‘campus-rho’, ‘campus-sigma’, ‘campus-beacon’, and ‘campus-skylake’. Users must change ‘campus’ to one of these partition names.

Introduction

Usually a new user or new work project starts by logging into one of the login nodes, or with the web-based Open OnDemand (See our separate web page on Open OnDemand). The login nodes should only be used for basic tasks, such as, file editing or management, code compilation, and job submission. This document will explain how create jobs to perform work and to use the Torque & Moab services to submit, monitor, and alter jobs on the cluster. This includes also how to run an interactive job on a compute node resource which is called an “interactive batch job.” The cluster is set up to maximize utilization of the shared resources. To learn about our responsible node sharing and its implementation, please review the Responsible Node Sharing on ISAAC webpage.

1 Scheduling Policies
- 1.1 Job Priority
- 1.2 Job Access Control
2 Job Resource Requirements
3 Submitting Jobs
4 Using the AMD Rome Node
5 Targeting GPU Nodes
6 Targeting Backfill Resources

1 Scheduling Policies

1.1 Job Priority

The Torque scheduler priority of a job influences how quickly it is assigned resources to begin execution on the cluster’s compute node resources. The general factors that affect priority are listed below.

Jobs that request more nodes receive an initial higher priority
The longer a job waits in the queue, the higher its priority becomes over time
The amount of jobs a individual user submits affects the priority of the submitted jobs. Currently, only twenty jobs submitted by the same user can be simultaneously considered in the scheduler. Only 20 jobs are considered per user for each scheduling pass, all others are blocked for consideration. This limitation only applies to jobs submitted to the institutional condos.

1.2 Job Access Control

The cluster uses a variety of mechanisms to determine how to schedule jobs. Most of the resource request mechanisms can be manipulated by users to ensure that their jobs are scheduled and executed in a reasonable time period.

1.2.1 Condos

In the context of the cluster, a condo is a logical group of compute nodes that act as an independent cluster to effectively schedule jobs. Institutional and private condos exist on the cluster. All users belong to their respective institutional condo, whether that be UTK or UTHSC. Private condos are owned and used by individual investors and their associated projects. With a private condo, the investor and his project have exclusive use of the compute nodes in the condo. Investors can choose to share their private condo with the entire cluster under the Responsible Node Sharing implementation, but this is not a requirement.

1.2.2 Project Accounts and Reservations

A project is the combination of an account used in the batch system and is related to a Linux group used on the system for file access control. In the cluster’s condo-based scheduling model projects and their associated accounts in the batch system are a key component. Projects control access to institutional and private condos. For UTK institutional users, the default project account is ACF-UTK0011. For UTHSC institutional users, the default project is ACF-UTHSC0001. Other projects follow the same project identifier format. To determine which projects you belong, login to the User Portal and look under the “Projects” header. Always ensure you use the correct project account when submitting jobs with a PBS directive. The following is an example of specifying to use the UTK institutional project account as a PBS directive in a job script:

#PBS -A ACF-UTK0011

Reservations are special allocations granted to a project. A reservation grants exclusive access to a set of nodes for a specific time period. For researcher who invest in equipment for their exclusive use on the system these are called private condos. To make use of a private condo specify in the PBS directives the project account that is associated with the private condo and specify the Quality of Service (qos) as condo. One can also specify the reservation id in the PBS directives but it should not be necessary to do so. An example PBS directive set for a private condo job submission is:

#PBS -A ACF-UTK0003
#PBS -l qos=condo

The above would still need PBS directives to specify the requested nodes, cores and walltime parameters, any optional partition or features that would relate to the private condo, and any job naming, emailing or other optional PBS directives. A collection of complete sample PBS/Torque job scripts are also available at /lustre/haven/examples/Torque/jobs.

Also, see the output of acfreservations for a list of the resources available for a a project account.

$ acfreservations -u <userID> ACF-<project-account>

For example: to show what resources account ACF-UTK0003 has access to, one would do the following command:

$ acfreservations -u user1234 ACF-UTK0003

1.2.3 Queues

The scheduler uses queues to organize jobs. At the time of this writing, the cluster uses the batch queue and the debug queue. The batch queue is the default queue to which all jobs are submitted. Users are not required to specify this queue when they submit jobs. The debug queue must be specified in the job script or on the command line. Figure 1.1 shows how to use this queue for job scripts, while Figure 1.2 shows how to use this queue for interactive jobs. In general, the debug queue is used for interactive work and short sessions of one hour or less for testing. All jobs submitted to the debug queue are limited to an hour of walltime.

#PBS -q debug

Figure 1.1 – Specifying the Debug Queue in a Job Script

qsub -I -q debug

Figure 1.2 – Specifying the Debug Queue for an Interactive Job

1.2.4 Partitions

Similar nodes are contained within partitions. In addition to grouping like nodes, it also enables users to specify the node set they wish to use. The current partitions are listed below. At the time of this writing, jobs default to using the general, beacon, and rho partitions. Be aware that because Rho is included in the default partition, you will only receive 2GB of memory per core if you do not specify a partition or ppn value. Thus, ensure that your job uses the appropriate partition. To learn more about the nodes within each partition, review the System Overview document. To learn how to target specific partitions, please review the Writing Job Scripts document.

general
beacon
rho
monster
amd
bigmem

1.2.5 Features

Some partitions consist of multiple node sets. To target specific node sets within a partition, use a feature attribute. For instance, to target the Beacon GPU nodes, use the beacon_gpu feature. The available features are listed below. More information on how to use feature attributes is in the Writing Job Scripts document. Additionally, the Targeting GPU Nodes section of this document shows how the feature attribute is used in a job script.

sigma (general partition)
sigma_bigcore (general partition)
rho (general partition)
skylake (general partition)
beacon_gpu (beacon partition)
rome (for the amd partition)
cascadelake_bigmem (bigmem partition)

1.2.6 QoS (Quality of Service) Attributes

QoS (Quality of Service) attributes define node allocations and wallclock limitations. At the time of this writing, opportunistic users are limited to 48 jobs, 24 nodes, and 24 hours of walltime for their jobs. Table 1.1 outlines the available QoS attributes.

QoS Attribute (-l qos={value})	Min. Allocation	Max. Allocation	Wall Clock Limit
condo	1 Node	Condo Max.	28 Days
campus	1 Node	24 Nodes	24 Hours
overflow	1 Node	24 Nodes	24 Hours
long (UTHSC Projects Only)	1 Node	24 Nodes	6 Days
long-utk (UTK Projects Only)	1 Node	18 Nodes	3 Days

Table 1.1

Jobs that run in the institutional condos should use the campus QoS attribute and if this is not specified the qsub submit filter will add it by default. UTHSC users can specify the long QoS attribute if their project has access to it (see the project in the portal for the valid QoS values for each project). With the long QoS attribute, UTHSC jobs can run for up to six days in the UTHSC institutional condo. The overflow QoS attribute allows users to run their jobs in an associated private condo and spill over into the user’s institutional condo, whichever can be scheduled first. The job will not cross the condos, but more specifically will run either in the private condo or the institutional condo whichever it can schedule first. The condo QoS attribute permits users with private condos on the cluster to run their jobs on those nodes reserved for the private condos. If you wish to run a job in a private condo, use the condo QoS attribute in your job script for non-interactive jobs or with your qsub command for interactive jobs. If a project has the condo QoS listed first in the portal, if a QoS is not specified the job submission wrapper will set the QoS to condo.

For condo users Click here to see the examples of different scenarios in which a job using a specific project account and QoS attribute can be submitted to land it on a private or institutional condo.

In general, it is best to specify the QoS attribute that applies to your situation rather than rely on the defaults. Please review the Writing Job Scripts document to learn how to use these QoS attributes in your jobs. Also, several examples are shown in the /lustre/haven/examples/Torque/jobs directory.

2 Job Resource Requirements

Responsible node sharing on the cluster has altered the process for job submission. Previously, a single-node job would consume an entire node’s resources. Now, multiple jobs can share the same node. This increases the cluster’s throughput and resource utilization, benefiting users and administration. More information is available in the Responsible Node Sharing document. Rather than deal with what node sharing is, this section deals with how it affects the resources allocated to jobs.

In order to facilitate node sharing, the resource manager implements default resource allocations to jobs that do not specify what they require. By default, a job that does not specify the resources it requires will use the default partition, which consists of the beacon, rho, and general partitions. The job will receive a single core and 2GB of memory. Job scripts that previously worked on the cluster will not likely function in the same way due to these changes. When the job is submitted, the resource manager informs the user about these default values and provides guidance on requesting additional resources. Figure 2.1 shows this output from a submitted interactive job that does not specify its resource requirements.

[user-x@acf-login5 user-x]$ qsub -I -l nodes=1,walltime=1:00:00

Notice: Using default project ACF-UTK0011

Notice: Using default nodes=1:ppn=1 

Notice: Using default partition ['beacon', 'rho', 'general']

Notice: Setting vmem=2048MB due to selected partitions and number of cores.
	If you need more memory, please resubmit your job requesting more cores.
	If you need more memory per core, please resubmit requesting node types with higher memory per core.
Notice: Using default node access policy: shared

Figure 2.1 – Submitting a Job with No Resource Specifications

In Figure 2.1, observe that the resource manager allocates a single core to the job. This value can be changed by users with the ppn (processors per node) option. It is specified on the same line as the -l nodes=<num-nodes> option. Figure 2.2 shows how the ppn option is used in the context of a job script. The usage is the same in an interactive job minus the #PBS directive prefix.

#PBS -l nodes=1:ppn=8

Figure 2.2 – Usage of the ppn Option in a Job Script

It is important to note that the value specified for the ppn option directly influences the amount of memory allocated to the job. The higher the ppn value is, the more memory the job receives. Users cannot manually specify their memory requirements in job scripts or with the qsub command. They must provide a higher ppn value to obtain more memory for the job. If they attempt to specify the amount of memory they desire, the resource manager will reject the job. In Figure 2.2, the job would receive eight cores worth of memory, which will vary with the partition and feature. Another important factor to consider with the ppn option is the partition and feature in use by the job. A job that runs on the beacon partition receives 16GB of memory per core, while the same job on the rho partition receives 2GB of memory per core. The general partition has additional caveats because three node sets belong to it. The formula that calculates the memory allocation a job receives is depicted in Figure 3.3. For reference purposes, Table 2.1 documents the amount of memory available per core on each node set in the cluster.

Total memory = cores requested / total cores on the node

Figure 2.3 – Formula to Calculate the Amount of Memory Allocated to a Job

Partition	Feature	Memory per Core (MB)
beacon	beacon beacon_gpu	16384 16384
rho	rho	2048
general	sigma sigma_bigcore skylake	5462 4682 4916
monster	monster	43691
skylake_volta	skylake_volta	4916

Table 2.1

Use Table 2.1 when determining which partitions and features to use with your job. Note that the default partition, which is used when no partition is specified, uses Rho’s per-core memory amount of 2048MB. The general partition uses sigma_bigcore’s per-core memory amount of 4682MB. If you wish to receive more memory, specify a partition and feature that provides a higher per-core memory amount. Additional information is available in the Partitions and Features section of this document. The hardware resources available to each node set is documented in the System Overview document.

In scenarios where your job requires all the resources available to a node, the resource manager provides the option to make the job node-exclusive. Node-exclusivity allocates an entire node to the job so that it does not share resources with other jobs. Be aware that using this option may delay the execution of your job. Figure 2.4 shows the PBS directive to include in a job script to specify node-exclusivity. The option is the same for interactive jobs minus the #PBS directive prefix.

#PBS -n

Figure 2.4 – Making a Job Node-Exclusive

For more information on writing job scripts that include these options, please review the Writing Job Scripts document. To learn how to monitor and troubleshoot jobs with problems related to node sharing, please refer to the Monitoring and Modifying Jobs document.

3 Submitting Jobs

3.1 Job Scripts

Non-interactive batch jobs are submitted to the scheduler using job scripts. These scripts contain PBS directives and shell commands. To learn more about job scripts, please visit the Writing Job Scripts document. Please be aware that if you set your job to be node-exclusive, it may take longer for your job to run depending on resource availability.

If you already have a job script ready, then follow the process outlined below to submit your job to the cluster.

Change your directory to Lustre scratch space. Figure 3.1 shows the command that will place you in this space. All non-interactive batch jobs should be submitted from this space for the best performance.
Use the qsub command to submit the job. Figure 3.2 shows the syntax for this command.
If successful, a job identifier should appear to indicate that the job was submitted. Execute qstat -a to verify that the job was submitted and is queued. A “Q” should appear under the “S” column of qstat. For more information on monitoring your jobs, please review the Monitoring and Modifying Jobs document.

If you need to pass arguments to your job, use the qsub -F option. Figure 3.3 shows the syntax for this option. For more information about command-line arguments, please review the Writing Job Scripts document.

cd /lustre/haven/user/<your-user-ID>

Figure 3.1 – Change directory to lustre scratch space

qsub /path/to/job-submission-script

Figure 3.2 – The qsub command to submit the job

qsub -F “arg_1 arg_2” /path/to/job-submission-script

Figure 3.3 – Passing Arguments to a Job Script

3.2 Interactive Jobs

Interactive jobs enable users to directly manipulate the cluster’s compute resources. Rather than drafting a job script to submit to the cluster, the user uses the qsub command with the appropriate options to submit the job. In general, the options for interactive jobs are the same options used with job scripts. The difference is that the options are not specified as PBS directives, but as options for the executable. Table 3.1 lists the pertinent options for interactive jobs. Figure 3.4 shows the syntax to use for a basic interactive job.

qsub Option	Description
`qsub -I`	Submits an interactive job to the scheduler.
`qsub -A <account>`	Runs the interactive job under the project specified in <account>.
`qsub -n`	Instructs the scheduler to make the job node-exclusive. The job will consume the entire node. Interactive jobs should not require an entire node. If your job does, consider using a non-interactive job script.
`qsub -v <comma-delimited-variables>`	Passes the specified variables to the interactive job. Provide the list of variables in a comma-delimited format.
`qsub -q <queue>`	Specifies the queue in which the job should be placed. At the time of this writing, only the `batch` and `debug` queues are available to users.
`qsub -l <resources>`	Defines the resources required by jobs. Refer to Table 3.2 in the Writing Job Scripts document for more information on the arguments this option accepts.

Table 3.1

qsub -I -l nodes=1,walltime=1:00:00

Figure 3.4 – Syntax for Submitting an Interactive Job

Please note that the first option in Figure 3.4 is an upper-case “i.” The second option is a lower-case “l.”

Once you submit an interactive job, the scheduler will queue the job and execute it when resources are available. Generally, a small, hour-long interactive job should begin within five minutes of submission; however, if the cluster is experiencing high resource utilization, it could take longer.

When you finish your work in the interactive job, issue the exit command to complete the job and return to the login node.

3.3 Parallel Jobs with mpirun

The mpirun command facilitates the execution of MPI programs. These programs execute in parallel across multiple nodes to enhance performance and resource utilization. When you use mpirun, you can specify the total number of ranks you desire the program to use, in addition to the amount of processes you wish to run on each node. By specifying the amount of ranks and processes, you have greater control over the execution of your jobs on the Open Enclave.

Before you use mpirun in your job, please review the System Overview document to familiarize yourself with the core counts of each node. Understanding the amount of cores at your disposal is critical to using mpirun correctly.

To specify the amount of ranks for your MPI program, use the -n option of mpirun. For instance, if you execute mpirun -n 16 ./test_job on a single Beacon node, one rank will be placed on each core because one Beacon node has a total of sixteen cores between two processors.

mpirun is not limited to one rank per core, however; nodes can be oversubscribed. To oversubscribe a node is to specify more ranks than the node has cores. By default, additional ranks will not be placed until all the cores on each node are filled. To illustrate this process, consider a job that has requested four Rho nodes. Each Rho node has sixteen cores; in this case, the job has 64 cores allocated to it. If this job executes mpirun -n 256 ./rho_job, 64 ranks will be placed across each core on each node. After all 64 cores have received a rank, an additional 64 ranks will be placed on each core. This process will continue until each rank has been allocated.

If the amount of ranks is fewer than the available cores on a node, the ranks are evenly spread across processors. As mentioned previously, one Beacon node has sixteen cores between two processors. If a job executes mpirun -n 8 on one of these nodes, four ranks will be placed on the first processor and four ranks will be placed on the second processor.

For greater control over rank placement, mpirun uses the -ppn option. ppn (processes per node) defines how many ranks should execute on each node. By default, ranks are placed based upon the number of cores each node contains. As an example, using mpirun -n 45 -ppn 15 ./ppn_job across three Beacon nodes would place sixteen ranks on the first two nodes and thirteen ranks on the last one. To override this behavior, use the -f $PBS_NODEFILE option with mpirun so that it can use the -ppn option properly. If you execute mpirun -n 45 -ppn 15 -f $PBS_NODEFILE ./ppn_job, it will place fifteen ranks across all three Beacon nodes.

Before you attempt to run an MPI program, verify that you have loaded the appropriate compiler and MPI implementation with the module list command. By default, Intel’s MPI implementation is loaded into your environment. You can switch to other implementations with the module swap command. Please refer to the Modules document for more information on how to use the module commands. If you intend to use a Python MPI program, load the mpi4py module.

4 Using the AMD Rome node

To use the ACF Rome node, which is a single AMD node with 32 cores, partition=amd, and feature=rome must be used during job submission.

Most ACF software should run on this node without modification. However, Intel compilers may produce optimized code for Intel processors if specific compile options are utilized. These options include the –xHost flags for example, -xCORE-AVX2, -xAVX, and –xSSE4.2. Software built for Intel processors may give errors and need to be rebuilt to run on the AMD Rome node. In this case, please contact the OIT HelpDesk (https://oit.utk.edu/help/) specifying the software and error encountered. Modules that are rebuilt for AMD node will have “*_AMD” appended to their names.

5 Targeting GPU Nodes

If your job(s) require GPUs (graphics processing units), the process for job submission differs. For non-interactive jobs, you specify a partition and feature set that contains GPU nodes in your batch script. For interactive jobs, you specify these options with the qsub -I (lower-case “l”) command. You must also load the relevant modules that will use the GPUs, such as tensorflow-gpu. For more information on modules and the commands to manipulate them, please refer to the Working with Modules document.

If you intend to use the Beacon GPU nodes, use the ACF-UTK0011 or ACF-UTHSC0001 project for both interactive and non-interactive jobs. At the time of this writing, the Beacon GPU nodes are the only GPU nodes available to all Open Enclave users. Otherwise, specify a project to which you belong that provides access to GPU nodes. Please refer to the System Overview document for more information on which condos have GPUs.

Figure 5.1 shows a sample batch script that targets GPUs on the Open Enclave using the -A option. If necessary, replace the tensorflow-gpu module with the modulefile you require. For the other options, please refer to the Writing Job Scripts document for more information. Note that ./gpu_job refers to the code that will execute on the nodes allocated to your job. Verify that it is in the same directory as the batch script if you use the example as-is. Please note that the line numbers are for reference purposes and should not be included in your job script.

1	#PBS -S /bin/bash
2	#PBS -A ACF-UTK0011
3	#PBS -l partition=beacon
4	#PBS -l feature=beacon_gpu
5	#PBS -l nodes=1
6	#PBS -l walltime=24:00:00
7
8	module load tensorflow-gpu
9
10	cd $PBS_O_WORKDIR
11	mpirun -n 16 ./gpu_job

Figure 5.1 – Sample Batch Script to Target GPU Nodes

To target GPU nodes with an interactive job, follow the general process described in the Interactive Jobs section of this document. For the -A option, specify ACF-UTK0011 or ACF-UTHSC0001 if you intend to use Beacon GPU nodes. If not, specify a project to which you belong that provides GPU resources.

After the interactive job starts, load the relevant modulefiles with the module load command so that you can utilize the allocated GPUs. Please refer to the Working with Modules document for more information. To query for the GPU and its available driver, execute the nvidia-smi command after the interactive job starts. Figure 5.2 shows the output of this command.

[...@acf-bk003 ~]$ nvidia-smi
Tue Dec 31 16:30:13 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.67       Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K20Xm         On   | 00000000:81:00.0 Off |                    0 |
| N/A   18C    P8    29W / 235W |      0MiB /  5700MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
 
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Figure 5.2 – Sample Output of the nvidia-smi Command

6 Targeting Backfill Resources

For short, small jobs, users can use backfill resources. These resources are otherwise idle and will enable jobs to quickly execute. To see which resources are currently available in the backfill, execute the showbf command. Figure 6.1 shows the output of showbf if resources are available.

[user-x@acf-login5 user-x]$ showbf
Partition      Tasks  Nodes      Duration   StartOffset       StartDate
---------     ------  -----  ------------  ------------  --------------
ALL               24      1       1:00:00      00:00:00  16:19:49_04/01
ALL               24      1      INFINITY      00:00:00  16:19:49_04/01
monster           24      1       1:00:00      00:00:00  16:19:49_04/01
monster           24      1      INFINITY      00:00:00  16:19:49_04/01

Figure 6.1 – Output of showbf

In the case of Figure 6.1, the user could write a job script that targeted the monster partition and expect to quickly land on the node because it is considered a backfill resource. Be aware that backfill resources may or may not be available depending on the cluster’s current resource utilization.

Page last updated: June 25, 2021

The University of Tennessee, Knoxville

Office of Innovative Technologies