Paper
Scalable Genomic Context Analysis with GCsnap2 on HPC Clusters

Presenter
Since November 2024, I am a research assistant in the High Performance Computing (HPC) group and a Ph.D. student in the PhD Program Data Science at the University of Basel. I am working on improving scheduling in HPC systems using machine learning. I am also responsible for the μ-Cluster.In 2024, I received my M.Sc. degree from the University of Basel in Computer Science, with a major in Machine Intelligence. My master’s thesis focused on improving the performance of the genomic context analysis tool GCsnap.I received my Bachelor’s degree from the University of Basel in 2022. My bachelor thesis was on benchmarking DAPHNE, an integrated data analysis pipeline for large-scale data management, HPC, and machine learning.Before diving into computer science, I earned an M.Sc. in Business and Economics and worked as a regional economic forecaster.
Description
GCsnap2 Cluster is a scalable, high performance tool for genomic context analysis, developed to overcome the limitations of its predecessor, GCsnap1 Desktop. Leveraging distributed computing withmpi4py.futures, GCsnap2 Cluster achieved a 22× improvement in execution time and can now perform genomic context analysis for hundreds of thousands of input sequences in HPC clusters. Its modular architecture enables the creation of task-specific workflows and flexible deployment in various computational environments, making it well suited for bioinformatics studies of large-scale datasets.This work highlights the potential for applying similar approaches to solve scalability challenges in other scientific domains that rely on large-scale data analysis pipelines.