
Whereas some studies simply accelerate the computation not requiring high-precision by utilizing low-precision hardware, the present study attempts more accurate computations by utilizing low-precision hardware. Reproducible: The method obtains the same (bitwise identical) result for the same input, even when the number of cores and threads differs in each execution.Īdaptable: The concept is adaptable to other precisions. Productive: Being built upon the cublasGemmEx routine in cuBLAS Footnote 1 provided by NVIDIA, the method incurs a low development cost.Īccurate: The method achieves higher accuracy than standard SGEMM and DGEMM even with correct-rounding. The advantages of this method are listed below. The proposed method is based on an accurate matrix multiplication algorithm based on error-free transformation for matrix multiplication, proposed by Ozaki et al. GEMM is one of the kernel operations of many scientific workloads, as well as high-performance Linpack. This paper presents a method for computing a general matrix multiply routine (GEMM) in level-3 basic linear algebra subprograms (BLAS) on FP64 (DGEMM) and FP32 (SGEMM) using Tensor Cores. Many studies have exploited this tremendous performance of Tensor Cores in general tasks.


The Tensor Cores operate up to eight times faster than standard FP32 floating-point units (FPUs) on CUDA Cores. Here, a and b are FP16 values, and c and d are FP32. Although Tensor Cores support several data formats and computational precisions, the present paper focuses on FP16 computations with FP32 precision mode, which compute \(d=a\times b+c\) with FP32 precision (Fig. One of the most widely used examples is Tensor Cores introduced in the Volta architecture, which computes a \(4\times 4\) matrix multiplication per clock with fused multiply-add operations. The hardware instead supports fast, low-precision operations such as binary16 (known as half-precision or FP16, with a 5-bit exponent and a 10-bit fraction) and 8/16-bit integer operations. The kernel of such tasks is matrix multiplication, which does not require high-precision such as IEEE 754-2008 binar圓2 (known as single-precision or FP32, with an 8-bit exponent and a 23-bit fraction) and binary64 (known as double-precision or FP64, with an 11-bit exponent and a 52-bit fraction). The increasing number of deep learning applications has triggered the development of special processing units such as Tensor Cores on NVIDIA’s graphics processing units (GPUs) and Google’s Tensor Processing Units (TPUs) in recent years. Our results reveal the possibility of utilizing hardware with limited FP32/FP64 resources and fast low-precision processing units (such as AI-oriented processors) for general-purpose workloads.

For example, when the matrices were initialized with random numbers over a dynamic range of 1E+9, our DGEMM-equivalent implementation achieved up to approximately 980 GFlops of FP64 operation on the Titan RTX GPU (with 130 TFlops on Tensor Cores), although cublasDgemm can achieve only 539 GFlops on FP64 floating-point units. The achievable performance of the method depends on the absolute-value range of each element of the input matrices. The proposed method has three prominent advantages: first, it can be built upon the cublasGemmEx routine using Tensor Core operations second, it can achieve higher accuracy than standard DGEMM, including the correctly-rounded result third, it ensures bit-level reproducibility even for different numbers of cores and threads. The proposed method adopts the Ozaki scheme, an accurate matrix multiplication algorithm based on error-free transformation for matrix multiplication. Tensor Cores are special processing units that perform \(4\times 4\) matrix multiplications on FP16 inputs with FP32 precision, and return the result on FP32. This paper proposes a method for implementing dense matrix multiplication on FP64 (DGEMM) and FP32 (SGEMM) using Tensor Cores on NVIDIA’s graphics processing units (GPUs).
