Neural Networks Model Compression The Static, the Dynamic and the Shallow

Elkerdawy, Sara

doi:doi:10.7939/r3-tcf6-7s02

This decommissioned ERA site remains active temporarily to support our final migration steps to https://ualberta.scholaris.ca, ERA's new home. All new collections and items, including Spring 2025 theses, are at that site. For assistance, please contact erahelp@ualberta.ca.

View

Download

Communities and Collections

Graduate and Postdoctoral Studies (GPS), Faculty of / Theses and Dissertations

Usage

254 views
336 downloads

Neural Networks Model Compression The Static, the Dynamic and the Shallow

Author / Creator

Elkerdawy, Sara
Deep neural networks (DNN) have emerged as the state-of-the-art method in several research areas. DNN is yet to fully permeate resource-constrained computing platforms, such as mobile phones. Accurate DNN models being deeper and wider take considerable memory and time to execute on small devices posing challenges to many significant real-time applications, e.g., robotics and augmented reality applications. Considerations for memory and power consumption are as important for low-end devices as they are for cloud-based with multiple graphical processing units (GPUs). In cloud-based solutions, factors such as performance-per-watt, performance-per-dollar, and throughput are important. Recently, different techniques were proposed to tackle the computational and memory issues inherent in DNN. We focus on neural network model pruning and distillation for inference and training acceleration respectively.

First, early work in model pruning often relied on performing sensitivity analysis before pruning to set the pruning ratio per layer. This process is computationally expensive and hinders scalability for deeper, larger, and more connectivity complex models. We propose to train a binary mask for each convolutional filter that acts as a learnable pruning gate. In training, we encourage smaller models by inducing sparsity by minimizing the ℓ1-norm of the masks. The task and pruning loss are trained jointly to allow for end-to-end fine-tuning and pruning.

Second, we present a layer pruning framework for hardware-friendly pruned models optimized for latency reduction. Our layer pruning framework is a twofold contribution. One, we present a one-shot accuracy approximation by imprinting for layer ranking. We rank layers based on the difference between their approximated accuracy and that of the previous layer. Second, we adopt statistical criteria from filter pruning literature for layer ranking and examined both iterative filter pruning and layer pruning training paradigms under similar importance criteria in terms of accuracy and latency reduction.

Third, we propose a dynamic filter pruning inference method to tackle diminishing accuracy gain from adding more neurons. Motivated by the popular saying in neuroscience: “neurons that fire together wire together”, we propose to equip each convolution layer with a binary mask predictor that selects a handful of filters to process in the next layer given the input feature maps. We pose the problem as a supervised binary classification problem. Each mask predictor module is trained to estimate the log-likelihood for each filter in the next layer to belong to the top-k activated filters.

Finally, we propose a distillation pipeline to accelerate the training of vision transformers. We adopt 1) self-distillation loss, and 2) query efficient teacher-study distillation loss. In self-distillation training, early layers mimic the output of the final layer within the same model. This achieves 2.8x speedup in comparison to teacher-student distillation with matched accuracy in many cases. We also propose a simple yet effective query-efficient distillation in case a trained teacher is available to further boost the accuracy. We query the teacher model (CNN) only when the student (transformer) fails to predict the correct output. This simple criterion not only saves computational resources but also achieves higher accuracy than a full query teacher-student.
Subjects / Keywords
Graduation date

Fall 2022
Type of Item

Thesis
Degree

Doctor of Philosophy
DOI

https://doi.org/10.7939/r3-tcf6-7s02
License

This thesis is made available by the University of Alberta Library with permission of the copyright owner solely for non-commercial purposes. This thesis, or any portion thereof, may not otherwise be copied or reproduced without the written consent of the copyright owner, except to the extent permitted by Canadian copyright law.

Language

English
Institution

University of Alberta
Degree level

Doctoral
Department
- Department of Computing Science
Supervisor / co-supervisor and their department(s)
- Ray, Nilanjan (Computing Science)
- Zhang, Hong (Computing Science)