Exploring Preferential Label Smoothing for Neural Network based classifiers

  • Author / Creator
    Goyal, Paritosh
  • Overfitting is a phenomenon when a machine learning system learns the patterns in training data so well that it starts to inauspiciously affect the model performance on unseen data. In practice, machine learning systems that overfit are not deployable rather systems that generalize well and do well on both train and test data are deployed. One of the strategy used to prevent overfitting and help models generalize well is regularization. For neural networks based machine learning systems, regularization can be applied using any of the neural network architecture, the loss function and the training algorithm.
    One of the losses used to train the neural network based classifiers is Cross Entropy Loss (CE). When using such a loss, the loss for a given data sample is computed solely using that sample’s ground truth label, i.e., keeping full concentration on the ground truth label and neglecting the effect of other labels, this makes the classifier overconfident for the data sample on one ground truth label and degrades generalization. One method of regularization is to take some of the concentration (called Smoothing Ratio (SR)) from the data sample’s ground truth label and distribute it uniformly among all the other labels. This method is called label smoothing and is found to be quite effective. For brevity, we call the approach of distributing SR uniformly as Uniform Label Smoothing (ULS).
    In this work, we explore what happens if we distribute the SR to the non-ground truth labels based on how closely they are related to the ground truth label. The relation between the labels may come from an external source-learnt from external data or provided by a subject matter expert. We call this approach of distributing the SR based on relation between labels as Preferential Label Smoothing (PLS). PLS represents a more unified approach of doing label smoothing because even ULS is a special case of PLS. Previous works on ULS suggest that ULS becomes redundant when the number of labels is high. Consider the case when there are only two labels (i.e., binary classification) then there is no point of using PLS. So, we investigate the effects of PLS when the number of labels in the dataset is high. Another gap that we study in this work is about the effects of PLS and ULS on the training dynamics and how are training dynamics different from when no label smoothing is used. We demonstrate our study on image classification and text classification. Experimenting on text classification fills in one more gap in the previous works, that ULS was not studied in the context of text classification.

  • Subjects / Keywords
  • Graduation date
    Fall 2022
  • Type of Item
  • Degree
    Master of Science
  • DOI
  • License
    This thesis is made available by the University of Alberta Library with permission of the copyright owner solely for non-commercial purposes. This thesis, or any portion thereof, may not otherwise be copied or reproduced without the written consent of the copyright owner, except to the extent permitted by Canadian copyright law.