P42 - Towards a Sparse BLAS Standard for Triangular Solvers on ARM Architectures
Description
Sparse matrix computations are critical in scientific simulations and engineering, with the Sparse BLAS standard playing a growing role as a benchmark for performance and portability across diverse hardware, including x86 CPUs, GPUs, and ARM architectures. However, standardizing sparse matrix operations remains challenging due to differences in storage formats, accuracy requirements, and hardware-specific optimizations and will, therefore, require an iterative refinement process. Recent updates to the Arm Performance Libraries, such as the introduction of functions for sparse triangular solves and sparse vector operations, reflect significant industry efforts towards such standardization. This poster contributes to these ongoing efforts by highlighting the benefits of supernodal sparse matrix representations. Supernodes group columns with identical sparsity patterns into dense blocks, enabling efficient utilization of dense BLAS/LAPACK operations and thereby delivering substantial performance gains. We are collaborating with Arm to integrate supernodal representations into the Arm Performance Libraries, showcasing improved performance on ARM systems powered by state-of-the-art processors from the Ampere Altra Max, Azure Cobalt, and AWS Graviton series.