Responsible node sharing is a capability for the condos within the Open Research resources of the Infrastructure for Scientific Applications and Advanced Computing (ISAAC). Responsible node sharing allows for a user to submit a batch job with resource specifications that would be selected by the job scheduler to run on either the institutional or private condos and for that job to share a single compute node with other jobs. By running more than one job from different users and even different workgroups on the same compute node, the throughput of the facility and the efficiency of the use of resources are both significantly increased. In responsible node sharing all of the jobs that run on a single compute node would be managed by the job scheduler to allow only those jobs whose processor core and memory resources fit onto a single node. As the facility is made up of different node types with different numbers of processor cores and different amounts of memory, the scheduler has to manage all of the job requests and provision jobs with the resources needed for those diverse requests. Responsible node sharing will become the default behavior for the institutional condos. Private condo owners will be asked if their condos can participate in responsible node sharing to benefit other users not associated with their workgroup. Private condo owners will not be required to participate in responsible node sharing.
The scheduling of the institutional condos will be changed on April 15, 2020 to implement responsible node sharing. Once private condo owners are asked if responsible node sharing can be implemented on their private condos and they indicate that responsible node sharing is to be allowed on their private condos, then the scheduler will be modified at that time to allow sharing of private condo compute nodes with a maximum time limit of three hours. Once approved, private condo owners nodes will participate in the responsible node sharing but jobs run by others outside of the private condo owners project membership (workgroup) will be limited to three hours. So the maximum amount of time a private condo job from a workgroup member will have to wait for their job(s) to run will be three hours. The three hour time limit will only be for private condos and will not be applied to the institutional condos.
For an effective responsible node sharing capability, users need to accurately specify their job resources. At the time of this writing, users must specify the amount of nodes their job requires and the wallclock time for the job. Users should also specify the amount of cores their job requires with the correction SLURM option. If this option is not specified, the scheduler will allocate a single core to the job. Memory will be allocated to jobs based on the number of cores requested. If your job requires additional memory, users would need to specify more cores.
Users always have the option to specify that a compute node will be exclusively used by their job. Users can specify the SLURM “–exclusive” option on their batch jobs to setup this job requirement for exclusive node allocation/access.
Questions about responsible node sharing can be sent to the OIT Help Desk (see https://help.utk.edu) or click on our “Submit HPSC Service Request” in the menu to the left on this page.