Minisymposium Presentation
Portable Analysis Workflows for Data Reproducibility
Description
As High Energy Physics (HEP) enters an era of unprecedented dataset size, ensuring data analysis reproducibility and preservation becomes a growing concern in the community. HEP physicists often manually manage complex analysis workflows, including job submissions and data management. The manual approach is both labor-intensive and prone to errors. Ultimately, it results in undocumented dependencies between different analysis steps, making analysis coordination, sharing, and reproducibility challenging. To address these issues, workflow management tools, such as Snakemake, Common Workflow Language, and Luigi Analysis Workflows, have been adopted in HEP community. A HEP-applicable workflow management system must be transparent, configurable, portable, and scalable to support the increasing use of HPC resources in data analysis. However, workflow tools are often perceived as only documentation rather than integral components of the analysis process, making adoption challenging for many physicists. In this talk, we will present our experience with workflow managing tools in HEP analysis, highlighting features of the workflow manager essential for the HEP application. We describe some typical issues physicists face when developing their workflows based on the feedback from analysis reproducibility training for HEP professionals. We emphasize the risk of blindly re-executing inherited workflows and advocate for integrated testing within workflow design.