Paper
AP2B - ACM Papers Session 2B
Live streaming
Achieving net-positive fusion energy and its commercialization requires not only engineering marvels but also state-of-the-art, massively parallel codes that can handle reactor-scale simulations. The GENE-X code is a global continuum gyrokinetic turbulence code designed to predict energy confinement and heat exhaust for future fusion reactors. GENE-X is capable of simulating plasma turbulence from the core region to the wall of a magnetic confinement fusion (MCF) device. Originally written in Fortran 2008, GENE-X leverages MPI+OpenMP for parallel computing. In this paper, we augment the Fortran-based compute operators in GENE-X to a C++-17 layer exposing them to a wide array of C++-compatible tools. Here we focus on offloading the augmented operators to GPUs via directive-based programming models such as OpenACC and OpenMP offload. The performance of GENE-X is comprehensively characterized, e.g., by roofline analysis on a single GPU and scaling analysis on multi-GPUs. The major compute operators achieve significant performance improvements, shifting the bottleneck to inter-GPU communications. We discuss additional opportunities to enhance further the performance, such as by reducing memory traffic and improving memory utilization efficiency.
The Particle-In-Cell (PIC) algorithm coupled with binary collision modules is a widely applicable method to simulate plasmas over a broad range of regimes (from the collisionless kinetic regime to the collisional regime). While several popular PIC codes implement binary collision modules, their performance on GPUs can be constrained by the default parallelization strategy, which assigns one GPU thread per simulation cell. This approach can underutilize GPU resources for simulations with many macroparticles per cell, and relatively few cells per GPU. To address this limitation, we propose an alternative parallelization strategy that instead GPU distributes threads based on independent pairs of colliding particles. Our proposed strategy shows a speed improvement of up to $\sim 4 \times$ for cases with relatively few cells per GPU and a similar performance otherwise.