Automated Data Engineering for Deep Learning

Zhao, Mingjun

doi:doi:10.7939/r3-b7nc-t224

View

Download

Communities and Collections

Graduate and Postdoctoral Studies (GPS), Faculty of / Theses and Dissertations

Usage

76 views
103 downloads

Automated Data Engineering for Deep Learning

Author / Creator

Zhao, Mingjun
Due to the advancement of computational hardware and the abundance and variety of data, deep learning has achieved significant success in natural language processing, computer vision, recommendation systems, and other domains in recent years. Because of this, recent large models have gained the ability to learn from massive amounts of data and are capable of handling tasks such as generating pictures based on user instructions or producing code with clear comments. However, effective exploitation of massive amounts of data is not a straightforward process, given that the quality and diversity of data can greatly influence the model performance.

Data engineering is the practice of enabling the collection and usage of data, and plays a critical role in making data accessible to deep learning by collecting, managing and converting raw data into useful information. Commonly used data engineering techniques in deep learning include data collection, filtering, preprocessing, cleaning, and modification. However, manually crafted data engineering policies require expert knowledge and a significant amount of tedious work, and can hardly reach the full potential of these techniques.

In order to avoid human intervention and unleash the potential of data, we automate the data engineering process to facilitate model learning and inference. In particular, we learn automated training curriculums and data augmentation strategies that enable the model to learn better from diverse and high-quality data samples. Moreover, we also improve the inference process by pruning and accelerating the redundant and unimportant part of input data to reduce computation cost. Specifically, we make the following contributions:

First, we focus on data selection by analyzing the problem of curriculum learning in neural machine translation (NMT) with the goal of improving a pre-trained NMT model. To achieve this, we propose a data selection framework based on reinforcement learning, which learns to re-select influential data samples from the original training set by identifying the most effective sample in a mini-batch. By simple fine-tuning, the selected subset of data can further improve the performance of the pre-trained model when original batch training reaches its ceiling, without utilizing additional new training data.

Second, we propose a label-aware auto-augmentation algorithm, to automatically learn augmentation policies separately for samples of different labels to overcome the limitation of sample-invariant augmentation. Our algorithm incorporates a predictor-based Bayesian optimizer to identify effective augmentations for each label, and constructs complementary augmentation policies based on minimum-redundancy maximum-reward principle. It produces effective label-aware augmentation policies which achieve significant performance boosts on image recognition tasks at a low search cost.

Third, we introduce a frame selection framework for the task of video action recognition to extract the most informative and representative frames to help a model better understand video content. We propose a Search-Map-Search learning paradigm which combines the advantages of heuristic search and supervised learning to select the best combination of frames from a video as one entity. By combining search with learning, the proposed method can better capture frame interactions while incurring a low inference overhead.

Finally, we study the embedding dimension pruning problem for recommendation systems. Specifically, we propose a low-cost embedding dimension search approach for recommender systems. By assessing information overlapping between the dimensions within each feature field and pruning unimportant and redundant dimensions progressively during model training via a two-level polarization regularizer, our method efficiently reduces the model parameters, and achieves strong recommendation performance while introducing minimum overhead.
Subjects / Keywords
- Automated Data Engineering
- Data Engineering
Graduation date

Fall 2023
Type of Item

Thesis
Degree

Doctor of Philosophy
DOI

https://doi.org/10.7939/r3-b7nc-t224
License

This thesis is made available by the University of Alberta Libraries with permission of the copyright owner solely for non-commercial purposes. This thesis, or any portion thereof, may not otherwise be copied or reproduced without the written consent of the copyright owner, except to the extent permitted by Canadian copyright law.

Language

English
Institution

University of Alberta
Degree level

Doctoral
Department
- Department of Electrical and Computer Engineering
Specialization
- Software Engineering and Intelligent Systems
Supervisor / co-supervisor and their department(s)
- Niu, Di (Computer Engineering)