Usage
  • 242 views
  • 169 downloads

The Interplay of Search and Gradient Descent in Semi-stationary Learning Problems

  • Author / Creator
    Shibhansh Dohare
  • We explore the interplay of generate-and-test and gradient-descent techniques for solving online supervised learning problems. The task in supervised learning is to learn a function using samples of inputs to output pairs. This function is called the target function. The standard way to learn non-linear target functions is to use artificial neural networks, where the weights of the network are learned via the backpropagation algorithm. The conventional backpropagation algorithm consists of two parts: initializing weights with small random numbers, and gradient descent at every time step.
    We consider a case that differs slightly from most supervised learning in two ways. First, it is conventionally assumed that the samples are independently and identically distributed, whereas we focus on the case where the samples are temporally coherent. Second, it is often assumed that the learner has sufficient capacity to closely approximate the target function, whereas we assume that the target function is more complex than the learner. Our case is interesting because the real world is often temporally coherent and extremely complex. Temporal coherence means that samples at consecutive time steps are not independent; rather, the new sample depends on the previous samples. We call the class of problems with stationary target functions but temporally coherent input, semi-stationary learning problems. We focus on semi-stationary problems where the target function is more complex than the learner. We use a novel idealized problem to study various solution methods. In the problem, the inputs follow a Markov chain, and the target function is represented by a multi-layered network, which allows us to control the relative complexity of the target function and the approximator.
    In our idealized problem, the best approximation continually changes because of temporal coherence and the high complexity of the target function. Because of this, there is a need for continual learning/adaptation. We use the conventional backpropagation algorithm to track the best approximation. However, this algorithm is temporally asymmetric in that it treats the beginning of time differently, as the computation to generate small random weights only happens at the first time-step. Surprisingly, this makes conventional backpropagation unsuitable for continual learning. We show that backpropagation performs well initially on our idealized problem, but that its performance decays substantially over time and it loses the ability to adapt.
    Finally, we propose a solution to the decaying adaptiveness of backpropagation that continually injects random features alongside gradient descent. We use a generate-and-test process to inject random features. Our generate-and-test process replaces low utility features with random features from the initial distribution. We find that this continual injection of randomness significantly improves the adaptiveness and performance of gradient descent.

  • Subjects / Keywords
  • Graduation date
    Fall 2020
  • Type of Item
    Thesis
  • Degree
    Master of Science
  • DOI
    https://doi.org/10.7939/r3-9ac0-9823
  • License
    Permission is hereby granted to the University of Alberta Libraries to reproduce single copies of this thesis and to lend or sell such copies for private, scholarly or scientific research purposes only. Where the thesis is converted to, or otherwise made available in digital form, the University of Alberta will advise potential users of the thesis of these terms. The author reserves all other publication and other rights in association with the copyright in the thesis and, except as herein before provided, neither the thesis nor any substantial portion thereof may be printed or otherwise reproduced in any material form whatsoever without the author's prior written permission.