Minisymposium
MS3F - Accelerating Sustainable Development through Coupled HPC Simulations and AI
Live streaming
Session Chair
Description
High-performance computing (HPC) has a long history of driving scientific discovery through advances in hardware and numerical algorithms, but the adoption of artificial intelligence (AI) and machine learning (ML) is transforming this landscape. By integrating traditional simulations with AI/ML training and inference tasks into complex workflows, computational scientists are unlocking new HPC applications, from AI-driven design and optimization to online model fine-tuning and learning of dynamical systems, and revolutionizing how we tackle many of the UN’s sustainable development goals. However, building and efficiently deploying large-scale coupled workflows on HPC systems still poses significant software and hardware challenges, including managing massive datasets on distributed systems, making efficient use of the interconnect and local memory to avoid I/O bottlenecks, and ensuring reproducibility and provenance. In this minisymposium, speakers from leading hardware vendors, HPC centers, and universities share the latest software innovations, new learning methodologies developed, and successful practices adopted to address the issues faced by coupled simulation and AI workflows on modern HPC systems. Through applications in fields such as drug discovery and climate modeling, the talks will discuss lessons learned and the remaining challenges in adopting large-scale coupled workflows for scientific discovery in the exascale era of supercomputing.
Presentations
When SmartSim was first released in 2020, the intention was that of supporting workflows where large numerical software needed an ML boost. The ML models involved were small, fitting onto one single GPU, and could be easily re-trained and replaced at run-time. Nowadays, ML models are huge, spanning several accelerators, and necessitating ad hoc techniques to be deployed avoiding holding back the execution of large workflows. SmartSim is evolving, its implementation going down the stack to harness the totality of the resources offered by modern heterogeneous systems. In this talk, we will look at recent success stories, new developments, and at how SmartSim is morphing into a more open and collaborative project.
The use of AI in drug discovery workflows has accelerated the task of screening billions of molecules to identify top candidates for binding to particular proteins. Typically, these workflows are composed of distinct tasks run sequentially on HPC clusters to iteratively screen through the list of compounds, identify top candidates, perform molecular dynamics simulations, and fine-tune the AI surrogate. The sequential nature of these offline workflows results in multiple job submissions with long queue times and heavy use of the parallel file system. In this talk, we present MLDocking – an automated drug discovery workflow which leverages a novel distributed run-time called Dragon specifically designed to manage dynamic processes, memory, and data on HPC systems. MLDocking automates the identification of top candidates by executing all workflow components concurrently, efficiently distributing tasks across CPU and GPU resources available on current heterogeneous HPC systems. Moreover, it limits the use of the file system by performing all data sharing operations through an in-memory distributed dictionary that features local memory or fast RDMA transfers across the system’s interconnect. The talk will cover results obtained scaling the workflow on the Aurora supercomputer and lessons learned in managing large datasets for in-situ workflows.
The promise of accelerating and advancing molecular simulation and materials design efforts with coupled HPC and deep learning (DL) workflows has motivated an explosion in a variety of approaches. In particular, leadership computing facilities have supported a diverse set of large-scale efforts in this area. But with the increasing size of models and advanced active learning workflows for training, which are arising in response to the need for improvements in accuracy and reliability of model predictions, concerns emerge about excessive energy consumption and the sustainability of HPC+AI simulation efforts for science. In this talk, I will describe experiences developing, deploying and assessing the results of leadership-scale HPC+DL efforts in modeling for molecular and materials sciences, from biosciences to advanced materials and nuclear energy, and using several different national leadership supercomputing resources. Successes, pain points, and lessons learned will be described, as well as tools being developed to help monitor robustness, correctness and reproducibility as well as power and energy metrics across software stack layers and parallel resources.
This study explores integrating reinforcement learning (RL) with idealised climate models to address key parameterisation challenges in climate science. Current climate models rely on complex mathematical parameterisations to represent sub-grid scale processes, which can introduce substantial uncertainties. RL offers capabilities to enhance these parameterisation schemes, including direct interaction, handling sparse or delayed feedback, continuous online learning, and long-term optimisation. We evaluate the performance of eight RL algorithms on two idealised environments: one for temperature bias correction, another for radiative-convective equilibrium (RCE) imitating real-world computational constraints. Results show different RL approaches excel in different climate scenarios with exploration algorithms performing better in bias correction, while exploitation algorithms proving more effective for RCE. These findings support the potential of RL-based parameterisation schemes to be integrated into global climate models, improving accuracy and efficiency in capturing complex climate dynamics. Overall, this work represents an important first step towards leveraging RL to enhance climate model accuracy, critical for improving climate understanding and predictions. Code accessible at https://github.com/p3jitnath/climate-rl.