Minisymposium

MS2D - Challenges in Systems Design for Omics

Fully booked

Monday, June 16, 2025

14:30

16:30

CEST

Room 6.0D13

Live streaming recording

Session recording

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique. Duis cursus, mi quis viverra ornare, eros dolor interdum nulla, ut commodo diam libero vitae erat. Aenean faucibus nibh et justo cursus id rutrum lorem imperdiet. Nunc ut sem vitae risus tristique posuere.

Session Chair

Bertil

Schmidt

Johannes Gutenberg University Mainz

Description

This minisymposium aims to address the critical challenges faced in the design and implementation of systems for omics research. As the field of omics, encompassing various disciplines such as genomics, proteomics, and metabolomics, continues to expand rapidly, there is an increasing demand for hardware-software co-design and robust computational systems that can handle large datasets, provide accurate analyses, and facilitate meaningful biological insights. The enormous data growth continuously shifts the life sciences from model-driven towards data-driven science driving the adoption of deep neural network models, massively parallel accelerators such as GPUs, and vendor-independent portability frameworks. This session will bring together experts from both computational and life sciences to discuss innovative approaches to systems design that meet the unique needs of omics workloads. Topics will include advanced algorithms for data processing in genomics and proteomics, novel data representations that achieve superior memory efficiency, and hardware-software co-design to improve performance and energy efficiency. Mechanisms that enable real-time analysis of genomic data by analyzing electrical signals as raw sequencing data, lessons learned from GPU acceleration of computations in widely used bioinformatics tools, and an outlook on future software and hardware trends that will likely impact computational biology will be shared.

Presentations

14:30

15:00

CEST

Computational Biology Patterns as a Co-Design Resource and Proposed Technology Roadmap for Modernizing Workhorse Biomedical Codes

Application proxies in high-performance computing play an important role for software/hardware co-design. To broaden the types of computation available for co-design, we are developing a suite of proxy apps based on MetaHipMer2 (mhm2), a DOE-developed, scalable, de novo metagenome assembler. MetaHipMer2 is implemented in C++, and offloads several routines to GPU (K-mer Analysis, Alignment, Local Assembly). It has been used to assemble large (> 50 Terabase), complex metagenomes on exascale-class machines (e.g., Summit). Our first proxy, "mhm2-kmer-analysis," focuses on the expensive K-mer Analysis step, which we have implemented in the Kokkos performance portability framework. In this this talk, we give an overview of mhm2, its execution phases, and their correlation to common “big data” computational patterns. Further, we make the case for modernization of codes via vendor-independent portability frameworks, such as Kokkos, and discuss our porting experience, including vignettes on expressing common CUDA idioms in Kokkos. Finally, we give an outlook on future software and hardware trends that will likely impact computational biology. Sandia National Laboratories is managed and operated by NTESS under DOE NNSA contract DE-NA0003525.

Amy Powell (Sandia National Laboratories, University of New Mexico); Logan Williams (North Carolina State University); Jan Ciesko (Sandia National Laboratories); and Gavin C. Conan (North Carolina State University)

15:00

15:30

CEST

Building Ultra-Large Pangenomes

Pangenomics is an emerging field that is allowing us to accurately and comprehensively study the within-species genetic diversity and its relationship to physical traits (phenotypes) by using a collection of genomes of a species instead of a single reference genome. Future pangenomics applications would require analyzing ultra-large and ever-growing collections of genomes. While existing pangenome data formats can represent the genetic variation in a collection of genomes, they do not store their shared evolutionary and mutational histories and are also unlikely to keep up with the speed and volume of genome sequencing data. In this talk, I will present ongoing work from my lab on a novel pangenomic data representation that achieves significant improvements in memory efficiency and the representative power of pangenomes. I will then discuss how we are leveraging GPUs and HPC systems to construct massive pangenomes consisting of millions of sequences. While the focus will be on microbial genomes, I will also discuss how these approaches can be extended to more complex genomes.

Yatish Turakhia (University of California San Diego)

15:30

16:00

CEST

Accelerating AI-based Genome Analysis via Algorithm-Architecture Co-Design

Analyzing genomic data provides critical insights for understanding and treating diseases, outbreak tracing, evolutionary studies, agriculture, and many other areas of the life sciences and personalized medicine. Modern genome sequencing devices can rapidly generate large amounts of genomic data at a low cost. However, genome analysis is bottlenecked by the computational and data movement overheads of existing systems and algorithms, causing significant limitations in terms of speed, accuracy, application scope, and energy efficiency of the analysis. In this talk, we will focus on substantially improving the speed and energy efficiency of a computationally costly machine learning (ML) technique used in many important genomics applications. We will introduce ApHMM, which resolves significant inefficiencies that make an expectation-maximization technique costly for profile Hidden Markov Models (pHMMs) on general-purpose processors. ApHMM achieves this by effectively co-designing both hardware and algorithm. As a result, ApHMM provides substantial improvements in performance (up to two orders of magnitude) and energy efficiency (up to three orders of magnitude) compared to CPUs and GPUs.

Can Firtina (ETH Zurich)

16:00

16:30

CEST

Accelerating Protein Homology Search for AlphaFold on GPUs

The enormous data growth continuously shifts the life sciences from model-driven towards data-driven science. The need for efficient processing has led to the adoption of massively parallel accelerators such as GPUs. As a consequence, genomics and proteomics method development nowadays often heavily depends on the effective use of these powerful technologies. Furthermore, progress in both computational techniques and architectures continues to be highly dynamic including novel deep neural network models and AI accelerators. For example, contemporary groundbreaking AI-tools like AlphaFold can generate highly accurate 3D protein structure predictions. In this talk, I present two novel tools for accelerating large-scale protein homology search on modern GPU systems: CUDASW++4.0 and MMseqs2-GPU, which advance the state-of-the-art in this area. For example, MMSeqs2-GPU can be used to significantly accelerate the computation of multiple sequence alignments in the ColabFold server for protein structure prediction, which is one of the most frequently used bioinformatics tools worldwide.

Bertil Schmidt (Johannes Gutenberg University Mainz)

Bookmark
this session

Unbookmark
this session

Saving...