Minisymposium

MS3F - Accelerating Sustainable Development through Coupled HPC Simulations and AI

Fully booked

Tuesday, June 17, 2025

11:30

13:30

CEST

Room 5.2D02

Live streaming recording

Session recording

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique. Duis cursus, mi quis viverra ornare, eros dolor interdum nulla, ut commodo diam libero vitae erat. Aenean faucibus nibh et justo cursus id rutrum lorem imperdiet. Nunc ut sem vitae risus tristique posuere.

Session Chair

Argonne National Laboratory

Description

High-performance computing (HPC) has a long history of driving scientific discovery through advances in hardware and numerical algorithms, but the adoption of artificial intelligence (AI) and machine learning (ML) is transforming this landscape. By integrating traditional simulations with AI/ML training and inference tasks into complex workflows, computational scientists are unlocking new HPC applications, from AI-driven design and optimization to online model fine-tuning and learning of dynamical systems, and revolutionizing how we tackle many of the UN’s sustainable development goals. However, building and efficiently deploying large-scale coupled workflows on HPC systems still poses significant software and hardware challenges, including managing massive datasets on distributed systems, making efficient use of the interconnect and local memory to avoid I/O bottlenecks, and ensuring reproducibility and provenance. In this minisymposium, speakers from leading hardware vendors, HPC centers, and universities share the latest software innovations, new learning methodologies developed, and successful practices adopted to address the issues faced by coupled simulation and AI workflows on modern HPC systems. Through applications in fields such as drug discovery and climate modeling, the talks will discuss lessons learned and the remaining challenges in adopting large-scale coupled workflows for scientific discovery in the exascale era of supercomputing.

Presentations

11:30

12:00

CEST

Five Years of SmartSim: The Fast Evolution of AI-Enhanced HPC Workflows

When SmartSim was first released in 2020, the intention was that of supporting workflows where large numerical software needed an ML boost. The ML models involved were small, fitting onto one single GPU, and could be easily re-trained and replaced at run-time. Nowadays, ML models are huge, spanning several accelerators, and necessitating ad hoc techniques to be deployed avoiding holding back the execution of large workflows. SmartSim is evolving, its implementation going down the stack to harness the totality of the resources offered by modern heterogeneous systems. In this talk, we will look at recent success stories, new developments, and at how SmartSim is morphing into a more open and collaborative project.

Alessandro Rigazzi and Andrew Shao (HPE)

12:00

12:30

CEST

MLDocking: Accelerated Drug Discovery with Transformer-Based Surrogate Models and In-Memory Workflows on Heterogeneous HPC Systems

The use of AI in drug discovery workflows has accelerated the task of screening billions of molecules to identify top candidates for binding to particular proteins. Typically, these workflows are composed of distinct tasks run sequentially on HPC clusters to iteratively screen through the list of compounds, identify top candidates, perform molecular dynamics simulations, and fine-tune the AI surrogate. The sequential nature of these offline workflows results in multiple job submissions with long queue times and heavy use of the parallel file system. In this talk, we present MLDocking – an automated drug discovery workflow which leverages a novel distributed run-time called Dragon specifically designed to manage dynamic processes, memory, and data on HPC systems. MLDocking automates the identification of top candidates by executing all workflow components concurrently, efficiently distributing tasks across CPU and GPU resources available on current heterogeneous HPC systems. Moreover, it limits the use of the file system by performing all data sharing operations through an in-memory distributed dictionary that features local memory or fast RDMA transfers across the system’s interconnect. The talk will cover results obtained scaling the workflow on the Aurora supercomputer and lessons learned in managing large datasets for in-situ workflows.

Riccardo Balin, Christine Simpson, Harikrishna Tummalapalli, Archit Vasan, and Venkat Vishwanath (Argonne National Laboratory) and Kent Lee, Yian Chen, Nick Hill, Colin Wahl, and Pete Mendygral (HPE)

12:30

13:00

CEST

Sustainable, Trustworthy Coupled HPC+AI for Molecular Simulation and Materials Design: Energy Consumption, Correctness, and Efficient Training on Leadership Platforms

The promise of accelerating and advancing molecular simulation and materials design efforts with coupled HPC and deep learning (DL) workflows has motivated an explosion in a variety of approaches. In particular, leadership computing facilities have supported a diverse set of large-scale efforts in this area. But with the increasing size of models and advanced active learning workflows for training, which are arising in response to the need for improvements in accuracy and reliability of model predictions, concerns emerge about excessive energy consumption and the sustainability of HPC+AI simulation efforts for science. In this talk, I will describe experiences developing, deploying and assessing the results of leadership-scale HPC+DL efforts in modeling for molecular and materials sciences, from biosciences to advanced materials and nuclear energy, and using several different national leadership supercomputing resources. Successes, pain points, and lessons learned will be described, as well as tools being developed to help monitor robustness, correctness and reproducibility as well as power and energy metrics across software stack layers and parallel resources.

Ada Sedova (Oak Ridge National Laboratory)

13:00

13:30

CEST

RAIN: Reinforcement Algorithms for Improving Numerical Weather and Climate Models

This study explores integrating reinforcement learning (RL) with idealised climate models to address key parameterisation challenges in climate science. Current climate models rely on complex mathematical parameterisations to represent sub-grid scale processes, which can introduce substantial uncertainties. RL offers capabilities to enhance these parameterisation schemes, including direct interaction, handling sparse or delayed feedback, continuous online learning, and long-term optimisation. We evaluate the performance of eight RL algorithms on two idealised environments: one for temperature bias correction, another for radiative-convective equilibrium (RCE) imitating real-world computational constraints. Results show different RL approaches excel in different climate scenarios with exploration algorithms performing better in bias correction, while exploitation algorithms proving more effective for RCE. These findings support the potential of RL-based parameterisation schemes to be integrated into global climate models, improving accuracy and efficiency in capturing complex climate dynamics. Overall, this work represents an important first step towards leveraging RL to enhance climate model accuracy, critical for improving climate understanding and predictions. Code accessible at https://github.com/p3jitnath/climate-rl.

Pritthijit Nath, Henry Moss, and Emily Shuckburgh (University of Cambridge) and Mark Webb (Met Office)

Bookmark
this session

Unbookmark
this session

Saving...