Mixed Low-bit Quantization for Model Compression with Layer Importance and Gradient Estimations

Liu, Hongyang

doi:doi:10.7939/r3-n9my-f856

This decommissioned ERA site remains active temporarily to support our final migration steps to https://ualberta.scholaris.ca, ERA's new home. All new collections and items, including Spring 2025 theses, are at that site. For assistance, please contact erahelp@ualberta.ca.

View

Download

Communities and Collections

Graduate and Postdoctoral Studies (GPS), Faculty of / Theses and Dissertations

Usage

234 views
325 downloads

Mixed Low-bit Quantization for Model Compression with Layer Importance and Gradient Estimations

Author / Creator

Liu, Hongyang
Deep neural networks (DNNs) have been widely used in the modern world in recent years. However, due to the substantial memory consumption and high computational power use of DNNs, deploying them on devices with limited resources is challenging. Model compression methods can provide us with a remedy here. Among those techniques, neural network quantization has achieved a high compression rate using a low bitwidth representation of weights and activations while maintaining the accuracy of the high-precision original network. However, mixed precision (per-layer bit-width precision) quantization requires careful tuning to maintain accuracy while achieving further compression and higher granularity than fixed precision quantization. In this thesis, We propose an accuracy-aware criterion to quantify the layer’s importance rank. Our method applies imprinting per layer, which acts as a proxy module for accuracy estimation in an efficient way. We rank the layers based on the accuracy gain from previous modules and iteratively quantize those with less accuracy. Previous mixed-precision methods either rely on expensive search techniques such as reinforcement learning (RL) or end-to-end optimization with a lack of interpretation
to the quantization configuration scheme. Our method is a one-shot, efficient, accuracy-aware information estimation and thus draws better interpretability to the selected bit-width configuration. We have also pointed out the problem of
the Straight-Through Estimator (STE), which is commonly used for gradients estimation in the quantization field. We’ve discussed some ways to address the problem of using STE.
Subjects / Keywords
- Model Compression
- Quantization
Graduation date

Spring 2022
Type of Item

Thesis
Degree

Master of Science
DOI

https://doi.org/10.7939/r3-n9my-f856
License

This thesis is made available by the University of Alberta Libraries with permission of the copyright owner solely for non-commercial purposes. This thesis, or any portion thereof, may not otherwise be copied or reproduced without the written consent of the copyright owner, except to the extent permitted by Canadian copyright law.

Language

English
Institution

University of Alberta
Degree level

Master's
Department
- Department of Computing Science
Supervisor / co-supervisor and their department(s)
- Ray, Nilanjan (Computing Science)