Improving HPC System Performance by Predicting Job Resources via Supervised Machine Learning

Mohammed Tanash; Brandon Dunn; Daniel Andresen; William Hsu; Huichen Yang; Adedolapo Okanlawon

doi:10.1145/3332186.3333041

Improving HPC System Performance by Predicting Job Resources via Supervised Machine Learning

PEARC19 (2019). 2019 Jul:2019:69. doi: 10.1145/3332186.3333041. Epub 2019 Jul 28.

Authors

Mohammed Tanash¹, Brandon Dunn¹, Daniel Andresen¹, William Hsu¹, Huichen Yang¹, Adedolapo Okanlawon¹

Affiliation

¹ Kansas State University, Manhattan, Kansas.

Abstract

High-Performance Computing (HPC) systems are resources utilized for data capture, sharing, and analysis. The majority of our HPC users come from other disciplines than Computer Science. HPC users including computer scientists have difficulties and do not feel proficient enough to decide the required amount of resources for their submitted jobs on the cluster. Consequently, users are encouraged to over-estimate resources for their submitted jobs, so their jobs will not be killing due insufficient resources. This process will waste and devour HPC resources; hence, this will lead to inefficient cluster utilization. We created a supervised machine learning model and integrated it into the Slurm resource manager simulator to predict the amount of required memory resources (Memory) and the required amount of time to run the computation. Our model involves using different machine learning algorithms. Our goal is to integrate and test the proposed supervised machine learning model on Slurm. We used over 10000 tasks selected from our HPC log files to evaluate the performance and the accuracy of our integrated model. The purpose of our work is to increase the performance of the Slurm by predicting the amount of require jobs memory resources and the time required for each particular job in order to improve the utilization of the HPC system using our integrated supervised machine learning model. Our results indicate that for larger jobs our model helps dramatically reduce computational turnaround time (from five days to ten hours for large jobs), substantially increased utilization of the HPC system, and decreased the average waiting time for the submitted jobs.

Keywords: HPC; Performance; Scheduling; Slurm; Supervised Machine Learning; User Modeling.

Grants and funding

P20 GM113109/GM/NIGMS NIH HHS/United States