Minisymposium
MS5F - Fast and Accurate Numerical Linear Algebra on Low-Precision Hardware: Algorithms and Error Analysis
Live streaming
Session Chair
Description
This minisymposium will address the state of the computer arithmetic algorithmic technique thatallows to simulate accurate floating-point computations by using low-precision floating-point orinteger operations. The progress of this research area is important to hardware manufacturersbecause it allows high-performance computers to reduce the number of complex high-precisionfloating-point units on the chip and increase the number of low-precision floating-point units whichare especially useful for machine learning; since efficient algorithms are available to simulate high-precision computations, traditional applications not tolerant of errors associated with low precisiondo not suffer. These techniques are increasingly researched internationally, and this minisymposium includes four speakers from UK, Japan, and US.
Presentations
High-performance computing hardware now supports many different floating-point formats, from 64 bits to only 4 bits. While the effects of reducing precision in numerical linear algebra computations have been extensively studied, some of these low precision formats also possess a very narrow range of representable values, meaning underflow and overflow are very likely. The goal of this article is to analyze the consequences of this narrow range on the accuracy of matrix multiplication. We describe a simple scaling that can prevent overflow while minimizing underflow. We carry out an error analysis to bound the underflow errors and show that they should remain dominated by the rounding errors in most practical scenarios. We also show that this conclusion remains true when multiword arithmetic is used. We perform extensive numerical experiments that confirm that the narrow range of low precision arithmetics should not significantly affect the accuracy of matrix multiplication—provided a suitable scaling is used.
We introduce a new algorithm for high-precision computations of matrix multiplication. While hardware-supported floating-point operations are fast, they suffer from rounding errors due to their finite precision. When the accuracy of computed results is not satisfactory, high-precision computation may be considered. One option is to use multi-precision arithmetic, such as MPFR. However, if extending the range of the exponent part is unnecessary, an alternative is to represent numbers as the sum of floating-point numbers and perform operations on those sums. Examples include pair arithmetic by Lange and Rump and double-word arithmetic by Bailey.In this talk, we introduce an algorithm that leverages this structure for fused multiply-add operations and applies it to matrix multiplication. As a result, we have designed a computational method that is less costly than pair arithmetic or double-word arithmetic, allowing for a slight degradation in accuracy. Finally, we demonstrate the performance of the proposed method through numerical experiments. Additionally, we compare the performance of the proposed method with the GEMM-based emulation method known as the Ozaki scheme.
Modern architectures are equipped with high-performance matrix engines optimized for low-precision matrix multiplications used in machine learning models. Fully leveraging these architectures is the key to achieving superior performance in numerical algorithms. This study aims to design methods for emulating DGEMM using int8 matrix engines to achieve superior performance on modern architectures. The Ozaki scheme, a highly accurate matrix multiplication algorithm using error-free transformations, enables higher-precision matrix multiplication to be performed through multiple lower-precision matrix multiplications and higher-precision matrix additions. Ootomo et al. implemented the Ozaki scheme using int8 matrix engines with the aim of achieving both sufficient accuracy and high performance. We propose alternative approaches to improving performance by reducing the numbers of lower-precision matrix multiplications and higher-precision matrix additions. Numerical experiments demonstrate the accuracy of the results and conduct performance benchmarks of the proposed approaches. These approaches are expected to yield more efficient results in next-generation architectures. We also provide a rounding error analysis of the proposed methods.
Over the last decade GPU architectures have dramatically improved in both performance and energy efficiency. Due largely to the rising importance of artificial intelligence (AI), especially in the areas of large language models (LLMs) and generative AI, this growth has been most pronounced in reduced precision matrix multiplication capacity, where the introduction of Tensor Cores and new datatypes has sparked a wave of innovative techniques at the juncture of AI and scientific computing. In addition to extending the reach and impact of mixed-precision algorithms, these hardware riches have sparked the development of new floating point emulation algorithms across a wide range of precisions, including but not limited to single- and double-precision. In this talk, we will look at the accuracy, performance, and energy efficiency of these methods and provide insights into the challenges and opportunities involved in making them broadly available to the scientific computing community.