Efficient Memory Hierarchy Utilization for Matrix Multiplication and Convolution on CPUs

Korostelev, Ivan

doi:doi:10.7939/r3-qrhw-ge40

This decommissioned ERA site remains active temporarily to support our final migration steps to https://ualberta.scholaris.ca, ERA's new home. All new collections and items, including Spring 2025 theses, are at that site. For assistance, please contact erahelp@ualberta.ca.

View

Download

Communities and Collections

Graduate and Postdoctoral Studies (GPS), Faculty of / Theses and Dissertations

Usage

160 views
608 downloads

Efficient Memory Hierarchy Utilization for Matrix Multiplication and Convolution on CPUs

Author / Creator

Korostelev, Ivan
Matrix multiplication is a key operation both in high-performance computing and in deep-learning applications. Generic building blocks have been introduced to abstract a matrix-multiplication operation and to enable the generation of efficient code for multiple architectures. This thesis presents three solutions to create software stacks that connect user-level code with the use of these building blocks. The first one uses pattern recognition at the intermediate representation level to discover matrix routines and replace them with efficient hand-crafted implementations that exist in numerical libraries. The second solution is a compiler-only code-generation path for matrix multiplication that targets multiple platforms. The third one is a novel convolution algorithm that replaces the traditional image-to-column data transformation with a custom cache-utilization strategy. Common to all three solutions are the use of generic building blocks that are target agnostic and the focus on making efficient use of the memory hierarchy of each machine.
The first solution, KernelFaRer, is a compiler optimization pass that searches the source code for linear algebra routines, such as matrix multiplica- tion and rank-2 matrix update, and replaces these occurrences with efficient library implementations. KernelFaRer operates on the LLVM’s intermediate representation language which makes the tool source-language agnostic and granular enough to recognize complex data dependency patterns. In comparison with other pattern matching approaches, KernelFaRer is more robust and carries less compilation time overhead.
The second solution brings the ideas for efficient matrix multiplication from high-performance libraries into a production LLVM compiler. The evaluation study shows that this approach delivers performance matching that of the high- performance libraries. This approach allows efficient linear algebra algorithms before high-performance libraries officially release them, as was the case for the IBM POWER10™ architecture.
The third solution, YaConv, is a novel convolution algorithm that re- purposes the building blocks for matrix multiplication from a popular high- performance library. The custom cache utilization strategy reduces the number of cache accesses by 5× and achieves mean speedup of 15% over the standard im2col-based convolution. Moreover, by integrating placement of elements into cache with their use by vector instructions, the new algorithm requires only a constant buffer of the size of cache, as compared to huge memory consumption of the traditional convolution algorithms.
Subjects / Keywords
Graduation date

Fall 2022
Type of Item

Thesis
Degree

Master of Science
DOI

https://doi.org/10.7939/r3-qrhw-ge40
License

This thesis is made available by the University of Alberta Library with permission of the copyright owner solely for non-commercial purposes. This thesis, or any portion thereof, may not otherwise be copied or reproduced without the written consent of the copyright owner, except to the extent permitted by Canadian copyright law.

Language

English
Institution

University of Alberta
Degree level

Master's
Department
- Department of Computing Science
Supervisor / co-supervisor and their department(s)
- Jose Nelson Amaral (Computing Science)