Rolling Outage of ISAAC Next Gen cluster
Scheduled Maintenance
The HPSC group at the Office of Innovative Technologies has scheduled a rolling outage for the ISAAC NG from October 7th at 09:00 am EDT to October 9th at 05:00 pm EDT. The purpose of this outage is to upgrade the Operating System (OS) of the entire cluster infrastructure to RHEL 8.9. Since the OS of the system will be upgraded, it may affect certain software which depends upon the system libraries. Therefore, as a precautionary measure, we upgraded one compute node to RHEL 8.9, on which users can test their software and applications. See the Perform Testing section to learn more about how to target this particular compute node.
Note: During this period, at least one login node and one data transfer node will always remain available, and no more than half of nodes in the campus partition will be offline at any time. Therefore, users will at all times be able to run jobs of sufficiently short duration and perform post-processing work on the cluster. We plan to upgrade login2/dtn2 on Monday followed by login1/dtn1 on Tuesday. Users are advised to logout of the login nodes/DTNs the evening prior to scheduled maintenance. All other partitions including private condos will be upgraded in their entirety either on Monday or Tuesday.
Impact on Jobs
- All jobs whose run time falls within the maintenance window will be queued and will run once the requested nodes are back online.
- Users are advised to adjust wall clock times in their job scripts to avoid submitting jobs that fall within the maintenance period.
Perform Testing
As mentioned above, the OS upgrade may affect certain software and applications, dependent on system libraries. Therefore, we created a reservation on one node separate from the reservation on the who cluster and upgraded it to RHEL8.9. Though we tested the functioning of few software installed in the software tree, we encourage the users to test the software and applications they use on this specific node using their job scripts.
To target this node, you need to add one additional SLURM directive in your job script:
#SBATCH --reservation=rhel_8_9_testing
Below is an example of job script that target the upgraded compute node for testing:
#!/bin/bash
#SBATCH --account=ACF-UTK0011
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --time=00-00:10:00
#SBATCH --partition=campus-gpu
#SBATCH --qos=campus-gpu
#SBATCH --output=out.%j
#SBATCH --error=err.%j
#SBATCH --reservation=rhel_8_9_testing
########## Load the module, if any ###########
module purge
module load R/4.1.2
############ command Execution ###############
Rscript example1.R
In the above example, you can change the module name from R/4.1.2 to the module you use for your research computations.
Below is an example of a job script launching MPI tasks on the compute node:
#!/bin/bash
#SBATCH --account=ACF-UTK0011
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=8
#SBATCH --time=00-00:10:00
#SBATCH --partition=campus-gpu
#SBATCH --qos=campus-gpu
#SBATCH --output=out.%j
#SBATCH --error=err.%j
#SBATCH --reservation=rhel_8_9_testing
########## Load the module, if any ###########
module purge
module load software/version
############ command Execution ###############
srun <executable_name> inputfile
If your computation involves GPUs, then you need to request GPUs in the script. You can add the below SLURM directive in your job script:
#!/bin/bash
#SBATCH --account=ACF-UTK0011
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=8
#SBATCH --gres=1
#SBATCH --time=00-00:10:00
#SBATCH --partition=campus-gpu
#SBATCH --qos=campus-gpu
#SBATCH --output=out.%j
#SBATCH --error=err.%j
#SBATCH --reservation=rhel_8_9_testing
########## Load the module, if any ###########
module purge
module load software/version
############ command Execution ###############
<executable_name> inputfile
To learn the meanings of different SBATCH directives, please refer to the Running Jobs page under ISAAC Next Gen drop-down menu.

Explore
Write
Chat
Call