Efficient Memory Hierarchy Utilization for Matrix Multiplication and Convolution on CPUs

  • Author / Creator
    Korostelev, Ivan
  • Matrix multiplication is a key operation both in high-performance computing and in deep-learning applications. Generic building blocks have been introduced to abstract a matrix-multiplication operation and to enable the generation of efficient code for multiple architectures. This thesis presents three solutions to create software stacks that connect user-level code with the use of these building blocks. The first one uses pattern recognition at the intermediate representation level to discover matrix routines and replace them with efficient hand-crafted implementations that exist in numerical libraries. The second solution is a compiler-only code-generation path for matrix multiplication that targets multiple platforms. The third one is a novel convolution algorithm that replaces the traditional image-to-column data transformation with a custom cache-utilization strategy. Common to all three solutions are the use of generic building blocks that are target agnostic and the focus on making efficient use of the memory hierarchy of each machine.
    The first solution, KernelFaRer, is a compiler optimization pass that searches the source code for linear algebra routines, such as matrix multiplica- tion and rank-2 matrix update, and replaces these occurrences with efficient library implementations. KernelFaRer operates on the LLVM’s intermediate representation language which makes the tool source-language agnostic and granular enough to recognize complex data dependency patterns. In comparison with other pattern matching approaches, KernelFaRer is more robust and carries less compilation time overhead.
    The second solution brings the ideas for efficient matrix multiplication from high-performance libraries into a production LLVM compiler. The evaluation study shows that this approach delivers performance matching that of the high- performance libraries. This approach allows efficient linear algebra algorithms before high-performance libraries officially release them, as was the case for the IBM POWER10™ architecture.
    The third solution, YaConv, is a novel convolution algorithm that re- purposes the building blocks for matrix multiplication from a popular high- performance library. The custom cache utilization strategy reduces the number of cache accesses by 5× and achieves mean speedup of 15% over the standard im2col-based convolution. Moreover, by integrating placement of elements into cache with their use by vector instructions, the new algorithm requires only a constant buffer of the size of cache, as compared to huge memory consumption of the traditional convolution algorithms.

  • Subjects / Keywords
  • Graduation date
    Fall 2022
  • Type of Item
  • Degree
    Master of Science
  • DOI
  • License
    This thesis is made available by the University of Alberta Library with permission of the copyright owner solely for non-commercial purposes. This thesis, or any portion thereof, may not otherwise be copied or reproduced without the written consent of the copyright owner, except to the extent permitted by Canadian copyright law.