Minisymposium Presentation
Datacenter Power Monitoring and Management Using MERIC SW Suite
Presenter
Ondrej Vysocky, Ph.D. (male) is a Senior researcher at IT4Innovations in the Infrastructure Research Lab, leading the Energy-efficient HPC research group. He was an investigator of the Horizon 2020 READEX project which dealt with the energy efficiency of parallel applications using dynamic tuning. Since that time, he has developed a MERIC energy-efficient software suite for datacenters power monitoring and management. Using MERIC tools, he is an investigator of several Horizon 2020 and Horizon Europe projects, including EuroHPC Centers of Excellence. He is also a member of the ETP4HPC (leading Energy efficiency & sustainability WG), EE HPC WG, and HPC PowerStack initiatives, which bring experts from around the world to focus on power consumption challenges in the HPC field.
Description
An HPC system can be optimized for energy efficiency at several levels, while the highest level of dynamicity comes from the power management of computing components controlled at the job level. Complex parallel applications show different hardware requirements during their execution. Energy-efficient runtime systems provide administrator- and user-friendly ways to perform dynamic hardware power knobs management without requiring a deep understanding of the topic. MERIC energy-efficient HPC software suite is a package of software tools for datacenter monitoring and optimization for power and energy consumption. The package includes cluster monitoring, job energy budgeting, cluster power capping applying a dynamic power limit adjusted based on the recent history of power consumption, and power management. Its component is also the EuroHPC Center of Excellence Performance Optimisation and Productivity (POP) flagship code MERIC, which is a job-level runtime system designed to provide a detailed analysis of application behavior, identify the optimal hardware settings concerning energy consumption and runtime, and provide dynamic tuning during the application runtime. Thanks to complex execution time coverage by regions of interest, tuning granularity at the level of tens of milliseconds, and a large set of controlled power knobs, it pushes the achievable energy savings to the limit.