Energy-Efficient Runtime in HPC Systems with Machine Learning

A significant fraction of the budget spent on running large High Performance Computing (HPC) systems is for energy consumption, and this amount is likely to increase in the future. Adapting specific operational modes to characteristics of jobs submitted by users (e.g., by changing clock frequencies) has a high impact on power, and thus, it is crucial for saving energy and money. The analysis and exploitation of performance and sensor data collected during job executions may allow for better understanding of the system workload behaviour, offering opportunities for defining optimal operation modes of HPC platforms.

In this project you will analyze performance and sensor data using machine learning techniques and develop optimizers in collaboration with Intel to be integrated within the GEOPM runtime framework (https://geopm.github.io/). Intel GEOPM is an opensource runtime solution for optimizing HPC jobs depending on characteristics of applications, cluster, partition, CPUs and dynamic characteristics of hardware components of the environment of the HPC job. Optimization goals and algorithms can be selected via the extensible plugin infrastructure provided by the framework. The developed models will contribute to efficient management of users’ jobs, potentially allowing for optimisation of resource allocation, reduction of energy consumption and containment of operational costs.

Results: The results of this project are explained in detail in the final documentation and presentation.