Back

Minisymposium

MS6A - Improving Energy Efficiency of HPC Systems through SW

Fully booked
Wednesday, June 18, 2025
14:00
-
16:00
CEST
Room 5.0A52
Join session

Live streaming

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique. Duis cursus, mi quis viverra ornare, eros dolor interdum nulla, ut commodo diam libero vitae erat. Aenean faucibus nibh et justo cursus id rutrum lorem imperdiet. Nunc ut sem vitae risus tristique posuere.

Session Chair

Description

Energy and power challenges increase as High-Performance Computing and AI scale to meet rapid industry and research demands. These challenges include higher CO2 emissions, increased energy costs, and strain on the power infrastructure. HPC centers are looking to reduce energy consumption and enhance energy efficiency by optimizing resource utilization and managing their workloads more efficiently. Efforts to improve energy efficiency often focus on hardware advancements, such as microarchitectures, intra-core parallelism, vectorization, and accelerators for critical workloads. These innovations reduced idle power and improved execution but have also introduced challenges like swift power variations. Data center infrastructure, rack design, and cooling techniques have also progressed. Liquid cooling, especially direct hot-water cooling, has gained traction for its cost-saving potential. Although such hardware improvements are impressive, they cannot fully address energy challenges due to their limited adaptability to workloads. Complementary software solutions provide a global view of system status and energy usage, support dynamic adaptation across the stack, enable long-term predictions of resource use, and deliver actionable insights on workload optimizations to users. Research on power-steering runtimes and monitoring tools has contributed to user-facing analytics tools. The rapid progress of AI techniques opens additional opportunities for energy efficiency and optimization in HPC systems.

Presentations

14:00
-
14:30
CEST
Monitoring and Analysis of Energy Consumption in HPC Systems

Energy efficiency is a critical challenge in modern data centers facing an ever-growing scale and complexity. This talk presents energy monitoring and analysis strategies employed in the data center of the TU Dresden. The focus is on how energy consumption and other metrics are measured and analyzed across different levels, including the building infrastructure, HPC clusters and racks, as well as down to individual nodes. We will provide insights into practical challenges and solutions for monitoring HPC systems and offer a perspective on how such tools and techniques contribute to improving energy efficiency and sustainability in large-scale computing environments. We outline the methodology behind capturing comprehensive measurement data to better understand consumption patterns and enable system-level optimizations. This process includes integrating sensor data and metrics from various sources within the data center to provide a comprehensive view of energy usage. A key component of our approach is using MetricQ, an in-house developed, highly scalable, distributed metric data processing framework. MetricQ supports scalable, high-resolution data collection and real-time visualization, allowing us to analyze trends in order to identify inefficiencies accurately. Its responsiveness facilitates iterative exploration in many long-running data sets.

Mario Bielert (ZIH, CIDS, TU Dresden)
14:30
-
15:00
CEST
From Operational Data Monitoring to Operational Data Analytics Chatbots

With generative artificial intelligence challenging the computational demand supremacy of scientific computing, data centers are experiencing unprecedented growth in both scale and volume. Computing efficiency has never been more critical to humankind, the economy, and society. Operational Data Analytics (ODA) collects and stores data center telemetry in time-series databases for real-time visualization and post-mortem analysis. In this manuscript, I will introduce EXASAGE, the first ODA co-pilot to leverage a Knowledge Graph (KG)-based approach, addressing these LLM limitations and simplifying data retrieval tasks in data center facilities through a prototype implementation of a conversational LLM agent.

Andrea Bartolini (University of Bologna)
15:00
-
15:30
CEST
Datacenter Power Monitoring and Management Using MERIC SW Suite

An HPC system can be optimized for energy efficiency at several levels, while the highest level of dynamicity comes from the power management of computing components controlled at the job level. Complex parallel applications show different hardware requirements during their execution. Energy-efficient runtime systems provide administrator- and user-friendly ways to perform dynamic hardware power knobs management without requiring a deep understanding of the topic. MERIC energy-efficient HPC software suite is a package of software tools for datacenter monitoring and optimization for power and energy consumption. The package includes cluster monitoring, job energy budgeting, cluster power capping applying a dynamic power limit adjusted based on the recent history of power consumption, and power management. Its component is also the EuroHPC Center of Excellence Performance Optimisation and Productivity (POP) flagship code MERIC, which is a job-level runtime system designed to provide a detailed analysis of application behavior, identify the optimal hardware settings concerning energy consumption and runtime, and provide dynamic tuning during the application runtime. Thanks to complex execution time coverage by regions of interest, tuning granularity at the level of tens of milliseconds, and a large set of controlled power knobs, it pushes the achievable energy savings to the limit.

Ondrej Vysocky (IT4Innovations National Supercomputing Center)
15:30
-
16:00
CEST
SMART Energy Efficiency with EAR Software

Current Data Centres require sophisticated software tools for monitoring, management,optimization and data analytics. Several projects are currently addressing the topic of designingthe required software stack to provide all these features. Having all of them in single tools isvery complex because it includes topics and expertise from many different areas: from deepknowledge on architectural details to AI models for predictive maintenance, from parallelapplications to job scheduling, etc. However, some projects have succeeded in providing partof these services. This presentation introduces how the EAR software provides the core features andsupport in the path of having a full ODA environment. We will present our conceptual modeland services and how they fit in the required features and how they can be used to create theremaining ones on top of them.

Julita Corbalan (Barcelona Supercomputing Center, Energy Aware Solutions); Marco D'Amico (Energy Aware Solutions); and Oriol Visal (Barcelona Supercomputing Center, Energy Aware SOlutions)