Usage
  • 206 views
  • 424 downloads

Compiler-Only Code Generation for Performant and Modular Matrix-Multiplication Micro Kernels Using Matrix Engines

  • Author / Creator
    Kuzma, Braedy
  • General Matrix-Matrix Multiplication (GEMM) is used widely in many high-performance application domains. In many cases, these applications repeatedly execute their matrix-multiplication subroutine, as is the case in the implementation of a particle-physics simulator or the repeated convolutions of many deep-learning models. This reliance on repeated executions causes matrix-multiplication operations to be a computational bottleneck in these applications, creating a strong motivation to improve the performance of GEMM.

    The state of the art for the efficient computation of GEMM consists of manual, programmer-directed replacement of matrix multiplication with calls to highly optimised Basic Linear Algebra Subprograms (BLAS)-like libraries which contain kernels painstakingly written in assembly. Beyond a clear expertise barrier for porting each kernel to each iteration of a specific platform -- and thus a maintenance issue -- such a replacement creates a dependency on external code over which a developer has no control. Moreover, calls to an unknowable library function disable critical optimisations such as inlining and loop fusion that can enable further optimisations in the calling code.

    The solution to these issues is to provide an alternative for the computation of matrix-multiplication, with competitive performance, directly within the compiler. An implementation in this style automatically generates a matrix-multiplication kernel that benefits from all applicable code transformations available in the compiler. This thesis addresses the lack of an efficient compiler-only path to generate code for GEMM by investigating and implementing a high-performance matrix-multiplication kernel implementation directly within the LLVM™ compiler framework. Furthermore, the proposed solution integrates emerging technologies, namely the matrix engine, that provide hardware assistance for the computation of matrix multiplication. In particular, the recent POWER10 processor features one such extension named Matrix Math Assist (MMA). Its unique design choice to implement matrix multiplication through the computation of outer products presents new opportunities to improve performance.

    The generation of efficient code for matrix multiplication in the LLVM compiler framework is divided into two levels: the macro kernel and the micro kernel. The main goal of the macro-kernel code generation is to make the best use of the memory hierarchy when bringing the operands from the main memory to the highest-level of cache memory. The focus of the micro-kernel code generation is to make efficient use of Single Instruction, Multiple Data (SIMD) functional units and to reduce the memory-register data-transfer requirements by increasing data reuse. This thesis focuses on the micro-kernel code generation, though a compiler-only macro-kernel code generation developed as part of a large work is available.

    This thesis also contributes a detailed performance study that indicates that this new code-generation strategy results in speed improvements between 3.1 and 15.8 times when compared with the closest alternative compiler-only code-generation implementation for some data types. There is also strong indication that, given several improvements in the compiler assembly-code generation, the compiler-generated kernel can match the performance of an expert's handcrafted solution. This thesis also features a detailed analysis of the experimental results that reveals opportunities for changes in the compiler that have the potential to lead to improvements in the entire POWER compilation stack.

  • Subjects / Keywords
  • Graduation date
    Fall 2021
  • Type of Item
    Thesis
  • Degree
    Master of Science
  • DOI
    https://doi.org/10.7939/r3-3jb7-pp81
  • License
    This thesis is made available by the University of Alberta Libraries with permission of the copyright owner solely for non-commercial purposes. This thesis, or any portion thereof, may not otherwise be copied or reproduced without the written consent of the copyright owner, except to the extent permitted by Canadian copyright law.