Minisymposium Presentation
Reimagining Performance and Reproducibility in the Post-Moore Era: Innovations in Checkpointing and Workflow Management
Description
In the post-Moore era, the quest for enhanced performance and reproducibility is more critical than ever. As researchers and engineers in high-performance computing (HPC) and scientific computing, reimagining key areas such as algorithms, hardware architecture, and software is essential to drive progress. In this talk, we will explore how performance engineering is evolving, focusing on checkpointing and the management of intermediate data in scientific workflows. We will first discuss the shift from traditional low-frequency checkpointing techniques to modern high-frequency approaches that require complete histories and efficient memory use. By breaking data into chunks, using hash functions to store only modified data, and leveraging Merkle-tree structures, we improve efficiency, scalability, and GPU utilization while addressing challenges like sparse data updates and limited I/O bandwidth. We will also examine the balance between performance and data persistence in workflows, where cloud infrastructures often sacrifice reproducibility for speed. To overcome this, we propose a persistent, scalable architecture that makes node-local data shareable across nodes. By rethinking checkpointing and cloud data architectures, we show how innovations in algorithms, hardware, and software can significantly advance both performance and reproducibility in the post-Moore era.