AI Tennessee Initiative Resources and Access
The Office of Innovative Technologies (OIT) High Performance & Scientific Computing (HPSC) division is pleased to announce that high performance computing resources with current technology GPUs from NVIDIA are now available in the ISAAC Next Generation (ISAAC NG) cluster. These resources are provided to the University of Tennessee (UT) community by the University of Tennessee, Knoxville (UTK) Office of Research, Innovation, and Economic Development (ORIED) and the AI Tennessee Initiative. OIT HPSC worked with the Director of the AI Tennessee Initiative to co-host the May 17, 2023 Artificial Intelligence Research Computing Symposium which had faculty across UTK present on their AI research and to discuss AI resource needs. As a result of the Symposium, OIT HPSC specified a collection of hardware that was reviewed by the Director of the AI Tennessee Initiative and leading UT AI researchers. An order for this hardware was placed with Dell Technologies in July 2023 with funding for the equipment from the AI Tennessee Initiative.
Resources
Our vendor partner, Dell Technologies, delivered the equipment by September 2023 and OIT Operations and OIT HPSC staff prepared the equipment for operation. That equipment as of October 1, 2023 is now available and in operation as part of the ISAAC NG cluster. The equipment is comprised of seven Dell PowerEdge XE8640 rack mounted servers which are air-cooled servers that can have up to four NVIDIA H100 Tensor core 700W SXM GPUs for extreme performance, fully interconnected with NVIDIA NVLink technology. The hardware is described in the table below.
Initially, all the nodes will be part of the ISAAC Next Generation cluster with Linux installed and integrated with access to all the ISAAC NG cluster resources. One of the AI Tennessee Initiative GPU servers will be converted to use Windows Server 2022 as the operating system and not be an integrated part of the ISAAC Next Generation cluster. Staff will be doing initial installation and testing for the Windows Server installation process on an alternate node until the procedure for installing the node is complete. We will post the date when the one of the AI Tennessee Initiative resources will be transitioned to Windows as soon as we have the procedure completed and have an estimated schedule.
Node Type OS | CPU Type GPU Type | Nodes | Cores/Node GPUs/Node | RAM/Node NVMe/Node | Interconnect |
---|---|---|---|---|---|
Dell PowerEdge XE8640 Red Hat Enterprise 8.7 | CPU: Intel Platinum 8463Y+ GPU: NVIDIA H100 fully NVLink connected | 7 | Cores: 64 GPUs: 4xH100 | RAM: 1,024 GBs NVMe: 51.2 TBs | HDR (EDR Lustre) |
The current version of the NVIDIA GPU drivers are version 535.86.10 and Cuda 12.2 (as of Jan 27, 2024).
One XE8640 may be converted to the Windows operating system. If an AI Tennessee server is converted to Windows, notification will be sent out at least two weeks before transition.
Access
Access to the AI Tennessee Initiative resources will be from the ISAAC Next Generation cluster and one of the servers may be converted to run as a standalone system running the Microsoft Windows Server 2022 operating system. See below for access to the two types.
ISAAC NG Node Access
To gain access to the AI Tennessee Initiative resources in the ISAAC NG cluster each faculty, staff, researcher and/or student needs to have registered for an ISAAC account (see Requesting an Account) and an AI Tennessee Initiative research project must be requested and created on the ISAAC NG system. All AI Tennessee Initiative projects must be UT faculty led (identified as the project Principal Investigator). Information needed to create the the AI Tennessee Initiative project is the PI information, the project member information, and a short description (50 characters or less) of the AI research to be performed. This will allow OIT HPSC to provide AI Tennessee Initiative resource usage reports to the AI Tennessee Initiative Director. To request an AI Tennessee Initiative research project, place an HPSC service request (Submit HPSC Service Request) with the Principal Investigator’s (faculty’s) name and a short description (<26 characters) of the AI research project. The PI and project members need to have registered for an ISAAC account.
Once an AI Tennessee Initiative project is created, that project will be enabled in SLURM to use the AI Tennessee Initiative resources that are available in the ai-tenn partition in SLURM on ISAAC NG. The SLURM partition name for all the AI Tennessee Initiative resources is ai-tenn and the quality of service (qos) to use to access the AI Tennessee Initiative resources is also named ai-tenn. At least one GPU must be requested in the SLURM batch job or Open OnDemand request to gain access to resources in the ai-tenn partition. An example AI Tennessee Initiative batch job will be created in the /lustre/isaac/examples/jobs directory and an example Jupyter Notebook specification for an Open OnDemand job to use the AI Tennessee Initiative resources will be available in the Open OnDemand documentation web page.
For more information on running SLURM batch jobs on ISAAC NG see ISAAC NG Running Jobs webpage. For more information on using Open OnDemand on ISAAC NG see the ISAAC NG Open OnDemand webpage.
Windows Server Node Access
We will post the process and procedure for requesting access to the AI Tennessee Initiative Server that will be running Windows as soon as that node is configured with Windows and we have the process and procedure specified.