In the context of the cluster, job scripts submit non-interactive batch jobs to the scheduler for execution. These scripts generally contain PBS directives that indicate the account under which the job should run, the number of nodes the job requires, and the amount of time the job should have allocated to it. In addition, they contain commands that should be executed on the node(s) that are allocated to the job. This document covers the basics of scripting and how to write job scripts for use on the cluster. If you already have a job script prepared for submission, then review the Running Jobs document to learn how to submit the script.
To understand how job scripts work, you should generally understand how a script itself works. Be aware that the information in this section describes scripts in a general way; it is not specific to the cluster. Regardless, at the core of a script, it is simply a group of commands organized in a logical manner. An interpreter executes these commands one-by-one to perform some useful function. The interpreter dictates how the script is written because it will parse, or analyze, the script. At the time of this writing, there are five valid interpreters for job scripts on the cluster. Table 2.1 lists these interpreters. For most users, /bin/bash is the easiest and most familiar interpreter to use.
Be aware that scripts do not require special editors. They can be created and modified using standard UNIX / Linux text editors such as vi, nano, and emacs, in addition to Notepad on Windows. There are ISE (integrated scripting environments) programs available, but they are not necessary for job scripts on the cluster.
Variables are another important concept in scripts. They hold a value that can be referenced multiple times and, if necessary, changed. When a variable is set within a script, it is said to be declared. Variable declarations vary from interpreter to interpreter, but their overall purpose remains unchanged. Environment variables also exist. These special variables define important information about your user session, such as your working directory, username, etc. Execute the
env command to see these variables.
Command-line arguments allow you to send information to the script. For instance, you could use these arguments to provide an input file for your code and an output file for the information processed by your code. Another case may require you to specify certain options and arguments, such as the type of operation to run or how many runs should be performed by the application.
Figure 2.1 shows a basic bash script that accepts two command-line arguments, also called positional parameters, and outputs them to the user’s terminal. The lines are numbered for reference purposes.
1 #!/bin/bash 2 # Test script for demonstration purposes 3 4 test_var=$1 5 test_var_2=$2 6 7 echo “$test_var, $test_var_2”
Figure 2.1 – Sample bash Script
Line one specifies the script’s interpreter. In this case, bash was chosen for the interpreter. That means the syntax of the script, or the way it is structured and written, will conform to what bash requires. Line two represents a comment. Comments are not interpreted. They exist to inform others about the purpose of the script and what it does. On lines four and five, two variables are declared: test_var and test_var_2. test_var holds the value of the first command-line argument, while test_var_2 holds the value of the second command-line argument. Finally, line seven prints the value of these variables to the terminal for review. You could save this script, make it executable, and then use the command in Figure 2.2 to output the words “Hello” and “world.”
./test_script.sh Hello world
Figure 2.2 – Executing the bash Script
Job scripts on the cluster use PBS (portable batch system) directives to instruct the scheduler what to do with the submitted jobs. Each directive is preceded by “#PBS”, which informs the scheduler that these items dictate how to execute the job. PBS directives go at the top of the script before any other item. Table 3.1 lists the pertinent PBS directives that most scripts will require. Any combination of the PBS directives listed in Table 3.1 can be used in a job script given that the arguments provided to the directives are valid. Further explanation of these directives follows the sample job scripts in Figures 5.1, 5.2, and 5.3.
|Defines the interpreter that will parse the job script. Table 2.1 lists the valid interpreters to use.|
|Specifies the account, or project, identifier to use for the job. It will generally appear as ACF-<institution><id-num>, such as ACF-UTK0011.|
|Provides the email address that will receive messages from the scheduler concerning the job’s execution.|
|Defines when to receive messages from the scheduler concerning the job’s execution. a stands for aborted, b stands for begun, and e stands for ended. All three arguments can be provided.|
|Specifies to which file output from the job should go.|
|Specifies to which file errors from the job should go.|
|Supplies a custom name for the job.|
|Instructs the interpreter to schedule this job as node-exclusive. The job will consume an entire node rather than sharing it with other jobs.|
|Defines the resources required by the job. Refer to Table 3.2 for more information about the arguments this directive accepts.|
|Passes variables from your environment into the job. Supply the variables in a comma-delimited format.|
|Joins the output (stdout) and error (stderr) streams together.|
|Specifies the queue in which the job should be placed. At the time of this writing, only the |
|Specifies the number of jobs to run in a job array. Job arrays execute the same job with multiple runs. The -t option accepts a single integer (e.g. 10), a range of integers (e.g. 5-10), or multiple integers (e.g. 5,10,20,50). Users can control how many tasks execute concurrently with the %<jobs-to-execute> flag. For example, to execute five jobs at once in a 20 job array, use |
The #PBS -l <resources> directive allows many different arguments to be supplied. At the time of this writing, the essential arguments are
walltime=<hh:mm:ss>; the other arguments are optional. Table 3.2 provides the arguments that the #PBS -l directive supports.
Most of these arguments require an understanding of both the hardware on which the jobs will be running and the software used by the jobs. The System Overview document provides information on the cluster’s hardware resources. Before you decide what hardware resources should be allocated to your job, please consult this document. To determine the requirements of the software used by your job, consult the relevant documentation for each package. It should indicate the type of hardware resources necessary to carry out the functions of the software such as whether it requires GPUs (graphics processing units), how much memory it consumes, and the recommended amount of processors. Please note that users cannot change the memory allocations given to their jobs. If you require more memory for your jobs, specify a higher ppn value.
|Specifies the amount of nodes the job will use. For multinode jobs, this option must be used with a valid ppn value. If a ppn value is not specified, the job will be rejected.|
|Specifies the amount of nodes the job will use and the amount of processors that will be used on each node. By default, users will receive one core for their jobs if they do not specify a ppn value.|
|Specifies the reservation to use for the job. The reservation identification follows the general format of <reservation-name>.<reservation-ID>.|
|Specifies how long the job will run. QoS values influence the maximum amount of time available to jobs. For more information, review the Running Jobs document.|
|Defines the node set on which the jobs should run. Separate multiple partitions with colons (:). For instance, to use the general and rho partitions, use |
|Defines a specific feature of a node set that jobs should use. Review the Running Jobs document for more information.|
|Specifies the Quality of Service (QoS) attribute that the job should use. Review the Running Jobs document for the list of QoS attributes. If you wish to run your job in a private condo, specify the condo QoS attribute.|
In addition to PBS directives, job scripts can also use several PBS variables. Table 4.1 provides a list of the most useful variables.
|Job identifier assigned by the scheduler.|
|Job name specified by the user; only useful with the #PBS -N directive.|
|Name of the file that contains the list of nodes assigned to the job.|
|Directory from which the job was submitted with the qsub command. Jobs should always be submitted from within Lustre space.|
To determine how the fundamentals of scripting, PBS directives, and PBS variables come together to create job scripts, consider the job scripts in Figures 5.1, 5.2, and 5.3. Explanations follow each example. Line numbers are used for reference purposes and should not be included in a job script you intend to submit. To learn how to submit job scripts to the cluster’s scheduler, please consult the Running Jobs document.
1 #PBS -S /bin/bash 2 #PBS -A ACF-UTK0011 3 #PBS -l nodes=1 4 #PBS -l walltime=1:00:00 5 6 module load python3 7 python3 ~/sample_python_code.py
Figure 5.1 – Simple Job Script
In this brief job script, three PBS directives are specified. The user supplies /bin/bash as the script’s interpreter and ACF-UTK0011 as the account under which the job will run. The job will take one node for an hour of walltime. Observe that the #PBS -l directive is split into two. Both the nodes and walltime argument could be specified on the same line, but for the sake of readability and organization, they should be divided. The job itself loads Python 3 and executes Python code. In this case, the output and error file from the job would appear in the directory from which the user submitted it, which should be Lustre scratch space.
1 #PBS -S /bin/bash 2 #PBS -A ACF-UTHSC0001 3 #PBS -m abe 4 #PBS -M email@example.com 5 #PBS -o $HOME/job_data/job-out.txt 6 #PBS -e $HOME/job_data/job-err.txt 7 #PBS -l nodes=1 8 #PBS -l partition=general:beacon 9 #PBS -l qos=long 10 #PBS -l walltime=72:00:00 11 12 cd $PBS_O_WORKDIR 13 14 module load r 15 Rscript $HOME/job_data/r_code.r
Figure 5.2 – Job Script with Additional PBS Options
More advanced PBS options are used in Figure 5.2. Both the -m and -M directives are used to specify that the user wishes to be alerted when the job begins and ends, in addition to whether it aborts. The user specifies that both the output and error files from the job should be written to the job_data directory within the home directory, which is given with the environment variable $HOME. The -l directive specifies one node in the general or beacon partition using the long QoS attribute with a walltime of 72 hours, or three days. Please note that the long QoS attribute is only available to UTHSC users. For more information, please consult the Running Jobs document.
When the job begins, it changes directories to the directory from which the job was submitted, which should have been Lustre scratch space. It loads the R statistical language module and executes some R code. Again, both the output and error files from this job will appear in the job_data directory within the user’s home directory once the job completes.
1 #PBS -S /bin/bash 2 #PBS -A ACF-UTK9999 3 #PBS -N conda_job 4 #PBS -o $HOME/job_data/$PBS_JOBID-out.txt 5 #PBS -e $HOME/job_data/$PBS_JOBID-err.txt 6 #PBS -v SCRATCHDIR,LD_LIBRARY_PATH,CONDA_ENV_PATH 7 #PBS -l nodes=2:ppn=40 8 #PBS -n 9 #PBS -l partition=general 10 #PBS -l feature=skylake 11 #PBS -l qos=condo 12 #PBS -l walltime=240:00:00 13 14 cd $SCRATCHDIR 15 is_mpi=$1 16 17 module load anaconda3/4.4.0 18 conda activate $CONDA_ENV_PATH 19 if [ $is_mpi = “true” ]; then 20 mpirun -n 80 python3 $SCRATCHDIR/job_data/mpi_code.py 21 else 22 python3 $SCRATCHDIR/job_data/job_code.py 23 fi
Figure 5.3 – Job Script with Advanced PBS Options
In Figure 5.3, the job script is specific and complex. The script makes extensive use of environment variables, in addition to providing the precise resource requirements of the job. It requests all forty cores from two Skylake nodes within the user’s private condo for a total of eighty cores. The #PBS -n directive indicates that the job should have exclusive use of both nodes.
The job itself will have the SCRATCHDIR, LD_LIBRARY_PATH, and CONDA_ENV_PATH environment variables available. The first two environment variables are present for all users; the CONDA_ENV_PATH variable is user-defined.
The job executes from within an Anaconda environment stored in Lustre scratch space. The job uses mpirun to execute 80 tasks from the Python code across the two Skylake nodes, which equals one task per core. Observe the
is_mpi variable declared on line fifteen. This variable derives its value from the first command-line argument given to the job script. It expects a boolean (true / false) value. In this example, it determines whether the job executes Python code using MPI or if it executes it without parallelization. More information on passing arguments to job scripts can be found in the Running Jobs document.