Minisymposium
MS6C - Fostering Sustainable Workflows in High Energy Physics: Developing Common Interfaces at Leadership Facilities to Enable Cross-Site Portability
Live streaming
Session Chair
Description
Next-generation High Energy Physics (HEP) experiments such as those at the High-Luminosity Large Hadron Collider (HL-LHC) and Deep Underground Neutrino Experiments (DUNE) will require significantly more computational resources to analyze orders-of-magnitude higher volumes of data in the next decade. This means that the experiments may need to tap into the large-scale computational resources offered at diverse supercomputing sites traditionally designed for high performance computing (HPC) workloads, instead of the high-throughput computing (HTC) sites HEP experiments are accustomed to. In addition to the challenges of adapting HEP workflows to run on the HPC systems, there are also issues related to authentication/authorization, access policies and reproducibility that need to be addressed. This minisymposium will focus on current status and challenges in developing common interfaces at large-scale computing facilities to enable cross-site workflow portability. Issues such as establishing standardized protocols and tools for data management, workflow execution, and resource allocation will be discussed. We intend to use this minisymposium as a forum to foster conversations and collaborations between the high energy physics and computer science communities towards developing portable workflow execution across computing sites.
Presentations
High Energy Physics (HEP) experiments, like ATLAS and DUNE, generate massive amounts of data requiring advanced simulation and data processing workflows. The increasing computational needs have necessitated the exploration of running these workflows on leadership-class HPC facilities (Perlmutter, Polaris, Frontier). However, the transition presents significant portability challenges, because each facility is unique with their architecture and software stacks. This talk explores strategies and tools for achieving cross-facility portability of large-scale HEP workflows, using DUNE and ATLAS simulation workflows as examples. Our main focus is on understanding how to coordinate emerging standards coming from various facility-specific APIs (e.g., Superfacility API, Globus Compute and others). Additionally, we examine detailed portability issues arising from heterogeneous CPU/GPU architectures, diverse library dependencies, varying shared file systems to understand possible solutions for sustainable workflow. Our experiences highlighted various portability challenges and revealed the need for closer collaboration between HEP application developers, HPC centers, and broader software communities. This study also underlines the crucial role of the Integrated Research Infrastructure (IRI), the development of which can benefit greatly from experiences of HEP workflows.
With Exaflop systems already here, the application communities are eager to leverage these large, heterogeneous and complex systems. Tools to simplify the development, execution, and reuse of workflows are needed, to support the reproducibility, portability and ease of use of complex workflows. COMPSs is a task-based programming environment for distributed computing that supports the easy development of workflows. Thanks to features such as task requirements specification, fault tolerance and task cancellation, the workflows can adapt to heterogeneous infrastructures and change their behaviour at runtime. COMPSs also includes the capacity of recording details of the application’s execution as metadata (Workflow Provenance). With workflow provenance, one is able to share not only the workflow application (i.e. the source code) but also the actual details of the workflow run (i.e. the datasets used as inputs, the outputs generated as results, and details on the environment of the run). The Provenance is generated in COMPSs using a lightweight approach that does not introduce overhead to the workflow execution. This feature enables sharing FAIR workflows in public repositories, enabling their reproducibility. In addition, the COMPSs Reproducibility Service, a tool that enables the automatic re-execution of previously shared experiments, will be also described.
The Integrated Research Infrastructure (IRI) program stands out with its innovative strategy to more effectively enable science across the United States Department of Energy (DOE) user facilities at scale. It is set to radically accelerate discovery and innovation within the DOE with a unique approach that empowers scientists to seamlessly and securely combine DOE research tools, infrastructure, and user facilities into their orchestrated workflows. A key component of this approach is the introduction of collaborative interfaces for users and orchestration tools, enabling workflows that run seamlessly across multiple user facilities. Today, despite the use of common toolchains and similar HPC providers, the landscape of novel HPC interfaces is still rather fragmented, with no common interface emerging that functions across all user facilities. This is a major hurdle for scientists aiming to build resilient workflows with multiple facilities in mind, locking them into one specific computing and data environment and exposing them to risks like major interruptions. Hence, the success of IRI hinges on the delivery of collaborative interfaces that scientists can use to build resilient and performant cross-facility workflows. This presentation will provide an overview of the DOE IRI program with a focus on the activities in the interfaces subcommittee.
As High Energy Physics (HEP) enters an era of unprecedented dataset size, ensuring data analysis reproducibility and preservation becomes a growing concern in the community. HEP physicists often manually manage complex analysis workflows, including job submissions and data management. The manual approach is both labor-intensive and prone to errors. Ultimately, it results in undocumented dependencies between different analysis steps, making analysis coordination, sharing, and reproducibility challenging. To address these issues, workflow management tools, such as Snakemake, Common Workflow Language, and Luigi Analysis Workflows, have been adopted in HEP community. A HEP-applicable workflow management system must be transparent, configurable, portable, and scalable to support the increasing use of HPC resources in data analysis. However, workflow tools are often perceived as only documentation rather than integral components of the analysis process, making adoption challenging for many physicists. In this talk, we will present our experience with workflow managing tools in HEP analysis, highlighting features of the workflow manager essential for the HEP application. We describe some typical issues physicists face when developing their workflows based on the feedback from analysis reproducibility training for HEP professionals. We emphasize the risk of blindly re-executing inherited workflows and advocate for integrated testing within workflow design.