Back

Paper

Scalable Genomic Context Analysis with GCsnap2 on HPC Clusters

Monday, June 16, 2025
17:00
-
17:30
CEST
Climate, Weather and Earth Sciences
Climate, Weather and Earth Sciences
Climate, Weather and Earth Sciences
Chemistry and Materials
Chemistry and Materials
Chemistry and Materials
Computer Science and Applied Mathematics
Computer Science and Applied Mathematics
Computer Science and Applied Mathematics
Humanities and Social Sciences
Humanities and Social Sciences
Humanities and Social Sciences
Engineering
Engineering
Engineering
Life Sciences
Life Sciences
Life Sciences
Physics
Physics
Physics

Presenter

Reto
Krummenacher
-
University of Basel

Since November 2024, I am a research assistant in the High Performance Computing (HPC) group and a Ph.D. student in the PhD Program Data Science at the University of Basel. I am working on improving scheduling in HPC systems using machine learning. I am also responsible for the μ-Cluster.In 2024, I received my M.Sc. degree from the University of Basel in Computer Science, with a major in Machine Intelligence. My master’s thesis focused on improving the performance of the genomic context analysis tool GCsnap.I received my Bachelor’s degree from the University of Basel in 2022. My bachelor thesis was on benchmarking DAPHNE, an integrated data analysis pipeline for large-scale data management, HPC, and machine learning.Before diving into computer science, I earned an M.Sc. in Business and Economics and worked as a regional economic forecaster.

Description

GCsnap2 Cluster is a scalable, high performance tool for genomic context analysis, developed to overcome the limitations of its predecessor, GCsnap1 Desktop. Leveraging distributed computing withmpi4py.futures, GCsnap2 Cluster achieved a 22× improvement in execution time and can now perform genomic context analysis for hundreds of thousands of input sequences in HPC clusters. Its modular architecture enables the creation of task-specific workflows and flexible deployment in various computational environments, making it well suited for bioinformatics studies of large-scale datasets.This work highlights the potential for applying similar approaches to solve scalability challenges in other scientific domains that rely on large-scale data analysis pipelines.

Authors