A Unified View of Multi-step Temporal Difference Learning

  • Author / Creator
    Kristopher De Asis
  • Temporal-difference (TD) learning is an important approach for predictive knowledge representation and sequential decision making. Within TD learning exists multi-step methods which unify one-step TD learning and Monte Carlo methods in a way where intermediate algorithms can outperform either extreme. Multi-step TD methods allows a practitioner to address a bias-variance trade-off between reliance on current estimates, which could be poor, and incorporating longer sequences of sampled information, which could have large variance. In this dissertation, we investigate an extension of multi-step TD learning aimed at reducing the variance in the estimates, and provide a unified view of the space of multi-step TD algorithms. In Monte Carlo methods, information about the error of a known quantity is sometimes incorporated in an attempt to reduce the error in the estimation of an unknown quantity. This is known as the method of control variates, and has not been extensively explored in TD learning. We show that control variates can be formulated in multi-step TD learning, and demonstrate their improvement in terms of learning speed and accuracy. We then show how the inclusion of control variates provides a deeper understanding of how n-step TD methods relate to TD(λ) algorithms. We then look at a previously proposed method to unify the space of $n$-step TD algorithms, the n-step Q(σ) algorithm. We provide empirical results and analyze properties of this algorithm, suggest an improvement based on insight from the control variates, and derive the TD(λ) version of the algorithm. This generalization can recover existing multi-step TD algorithms as special cases, providing an alternative, unified view of them. Lastly, we bring attention to the discount rate in TD learning. The discount rate is typically used to specify the horizon of interest in sequential decision making problems, but we introduce an alternate view of the parameter with insight from digital signal processing. By allowing the discount rate to take on complex numbers within the complex unit circle, we can extend the types of knowledge learnable by a TD agent into the frequency domain. This allows for online and incremental estimation of the extent at which particular frequencies exist in a signal, with the standard discounting framework corresponding to the zero frequency case.

  • Subjects / Keywords
  • Graduation date
    Fall 2018
  • Type of Item
  • Degree
    Master of Science
  • DOI
  • License
    Permission is hereby granted to the University of Alberta Libraries to reproduce single copies of this thesis and to lend or sell such copies for private, scholarly or scientific research purposes only. Where the thesis is converted to, or otherwise made available in digital form, the University of Alberta will advise potential users of the thesis of these terms. The author reserves all other publication and other rights in association with the copyright in the thesis and, except as herein before provided, neither the thesis nor any substantial portion thereof may be printed or otherwise reproduced in any material form whatsoever without the author's prior written permission.