A Unified View of Multi-step Temporal Difference Learning

Kristopher De Asis

doi:doi:10.7939/R3GH9BR75

This decommissioned ERA site remains active temporarily to support our final migration steps to https://ualberta.scholaris.ca, ERA's new home. All new collections and items, including Spring 2025 theses, are at that site. For assistance, please contact erahelp@ualberta.ca.

View

Download

Communities and Collections

Graduate and Postdoctoral Studies (GPS), Faculty of / Theses and Dissertations

Usage

358 views
355 downloads

A Unified View of Multi-step Temporal Difference Learning

Author / Creator

Kristopher De Asis
Temporal-difference (TD) learning is an important approach for predictive knowledge representation and sequential decision making. Within TD learning exists multi-step methods which unify one-step TD learning and Monte Carlo methods in a way where intermediate algorithms can outperform either extreme. Multi-step TD methods allows a practitioner to address a bias-variance trade-off between reliance on current estimates, which could be poor, and incorporating longer sequences of sampled information, which could have large variance. In this dissertation, we investigate an extension of multi-step TD learning aimed at reducing the variance in the estimates, and provide a unified view of the space of multi-step TD algorithms.
In Monte Carlo methods, information about the error of a known quantity is sometimes incorporated in an attempt to reduce the error in the estimation of an unknown quantity. This is known as the method of control variates, and has not been extensively explored in TD learning. We show that control variates can be formulated in multi-step TD learning, and demonstrate their improvement in terms of learning speed and accuracy. We then show how the inclusion of control variates provides a deeper understanding of how n-step TD methods relate to TD(λ) algorithms.
We then look at a previously proposed method to unify the space of $n$-step TD algorithms, the n-step Q(σ) algorithm. We provide empirical results and analyze properties of this algorithm, suggest an improvement based on insight from the control variates, and derive the TD(λ) version of the algorithm. This generalization can recover existing multi-step TD algorithms as special cases, providing an alternative, unified view of them.
Lastly, we bring attention to the discount rate in TD learning. The discount rate is typically used to specify the horizon of interest in sequential decision making problems, but we introduce an alternate view of the parameter with insight from digital signal processing. By allowing the discount rate to take on complex numbers within the complex unit circle, we can extend the types of knowledge learnable by a TD agent into the frequency domain. This allows for online and incremental estimation of the extent at which particular frequencies exist in a signal, with the standard discounting framework corresponding to the zero frequency case.
Subjects / Keywords
Graduation date

Fall 2018
Type of Item

Thesis
Degree

Master of Science
DOI

https://doi.org/10.7939/R3GH9BR75
License

Permission is hereby granted to the University of Alberta Libraries to reproduce single copies of this thesis and to lend or sell such copies for private, scholarly or scientific research purposes only. Where the thesis is converted to, or otherwise made available in digital form, the University of Alberta will advise potential users of the thesis of these terms. The author reserves all other publication and other rights in association with the copyright in the thesis and, except as herein before provided, neither the thesis nor any substantial portion thereof may be printed or otherwise reproduced in any material form whatsoever without the author's prior written permission.

Language

English
Institution

University of Alberta
Degree level

Master's
Department
- Department of Computing Science
Supervisor / co-supervisor and their department(s)
- Sutton, Rich (Computing Science)