High Performance & Scientific Computing

Job Chaining (Multiple Job Submissions)

A Must Read

Before getting into scripting details it is very, very important to understand that a little knowledge can be a bad thing. In this section, there will be descriptions of how to semi-automate one’s research. Automating is not good; this will create a divide between the computational scientist and their data. For instance, if their simulation begins to show an energy drift and they are unaware of this for a week, the study and their allocation could be wasted. Hence the term, semi-automated is used. In addition, with system maintenance, updates, and the sharing the resource with other researchers, one should not chain their simulations for longer than a week.

Important: Run your simulations without job chaining first.

If one submits hundreds of jobs to the system it will cause delays and possible system failures. If this happens, your jobs could be killed, you will be contacted by us, and a job submission stop could be placed on your account. It is the utmost importance to first run your simulations without job chaining, and then first create a simple short test chain, before you bring this into your production cycle. If you have any questions or concerns please do not hesitate to contact us at help@xsede.org.

Understanding Job Dependency

If one needs a longer wall time or would like save time by semi-automating their jobs on a supercomputer; an effective solution is to set up a job chaining script. This method would be useful if your simulation is checkpointable and ready for production runs. First, some definitions are needed.

Chaining PBS jobs requires the knowledge of creating a proper PBS job script and dependency keywords. One can learn how to construct a PBS submission script by reading the Batch Scripts section of the Running Jobs page. Any PBS job that are dependent upon another PBS job are placed in a hold state. The dependency list is set up by knowing the independent job’s ID number and the dependency keywords.

Dependency	Definition
after	Execute current job after listed jobs have begun
afterok	Execute current job after listed jobs have terminated without error
afternotok	Execute current job after listed jobs have terminated with an error
afterany	Execute current job after listed jobs have terminated for any reason
before	Listed jobs can be run after current job begins execution
beforeok	Listed jobs can be run after current job terminates without error
beforenotok	Listed jobs can be run after current job terminates with an error
beforeany	Listed jobs can be run after current job terminates for any reason

The most common dependency keyword used is afterok. The construct of a dependency is “-W depend=dependency expression“.

The dependency expression contains the dependency keyword with one or more job ID numbers (colon separated list). For example, see below:

qsub my_script.pbs –W depend=afterok:1187721

Or included in your PBS script:

#PBS –W depend=before:1187723:1187724

How to create a Job Chain

In order to chain PBS jobs, one creates the necessary PBS submission script and a shell script shown below. Note, after creating this shell script, give it execution privileges by “chmod u+x script.sh”.

It is highly suggested to first practice with a simple PBS job chain that performs a basic calculation, like the helloworld program.

It is highly recommended that one knows how their calculation performs and scales BEFORE semi-automating the process.

A ‘flat’ chain example

A ‘flat’ chain can be used if one would like to submit a sequential series of calculations that consists of various pre/post-production runs. For this, one needs the various PBS submission scripts that handle the separate calculations (named calc1.pbs and calc2.pbs below).

#!/bin/bash
one=$(qsub calc1.pbs)
echo $one
two=$(qsub -W depend=afterok:$one calc2.pbs)
echo $two

One can continue chaining jobs by continuing the numbering convention ($three, $four, …). To execute these scripts, and place them in the PBS queue, one issues the command “./script_name”.

A ‘looped’ chain example

A ‘looped’ chain is useful for a single PBS job that can be re-submitted multiple times. One has to be careful here to ensure the integrity of the looped chain. For instance, a researcher’s molecular dynamics simulation can produce a 4 nanosecond trajectory in 24 hours. They would like to have a total of 12 nanosecond trajectory. The maximum wall time on Kraken is only 24 hours. Submitting this same job 3 times would do the trick. Running up to the 24 wall time limit poses the risk of the job being killed preemptively. In addition, did the simulation write the restart file? If not, the chain could be broken. If the simulation was setup to write the restart every 6 hours, but the last one was not written, then the next simulation would be starting at the 18 hour mark. This is why it is a good idea to run one’s calculation once or twice before semi-automating the process. Benchmarking to understand the simulation’s scaling behavior is another good idea as well.

Below is the ‘looped’ job chain script,

#!/bin/bash
one=$(qsub submit.pbs)
echo $one 
for id in seq 2 4; do 
 two=$(qsub -W depend=afterok:$one submit.pbs)
 one=$two
done

In the above example, the script will submit the PBS submission script “submit.pbs” four times. If one needs a different number of loops, modify the for-loop from 4 to whatever you require (remember do not submit a lot jobs, see A Must Read section) . Viewing these jobs in the queue will show the first submitted job state (S column) as ‘Q’ for Queued. The succeeding ones will have a job state of ‘H’ for Held, because they are dependent on the first job.

The University of Tennessee, Knoxville

Office of Innovative Technologies