Gradient Temporal-Difference Learning Algorithms

Maei, Hamid Reza

doi:doi:10.7939/R30Q50

This decommissioned ERA site remains active temporarily to support our final migration steps to https://ualberta.scholaris.ca, ERA's new home. All new collections and items, including Spring 2025 theses, are at that site. For assistance, please contact erahelp@ualberta.ca.

View

Download

Communities and Collections

Graduate and Postdoctoral Studies (GPS), Faculty of / Theses and Dissertations

Usage

3046 views
3022 downloads

Gradient Temporal-Difference Learning Algorithms

Author / Creator

Maei, Hamid Reza
We present a new family of gradient temporal-difference (TD) learning methods with function approximation whose complexity, both in terms of
memory and per-time-step computation, scales linearly with the number of learning parameters. TD methods are powerful prediction techniques, and with function approximation form a core part of modern reinforcement learning (RL). However, the most popular TD methods, such as TD(lambda), Q-learning and Sarsa, may become unstable and diverge when combined with function approximation. In particular, convergence cannot be guaranteed for these methods when they are used with off-policy training. Off-policy training---training on data from one policy in order to learn the value of another---is useful in dealing with the exploration-exploitation tradeoff. As function approximation is needed for large-scale applications, this stability problem is a key impediment to extending TD methods to real-world large-scale problems. The new family of TD algorithms, also called gradient-TD methods, are based on stochastic gradient-descent in a Bellman error objective function. We provide convergence proofs for general settings, including off-policy learning with unrestricted features, and nonlinear function approximation. Gradient-TD algorithms are on-line, incremental, and extend conventional TD methods to off-policy learning while retaining a convergence guarantee and only doubling computational requirements.
Our empirical results suggest that many members of the gradient-TD algorithms may be slower than conventional TD on the subset of training cases in which conventional TD methods are sound. Our latest gradient-TD algorithms are ``hybrid" in that they become equivalent to conventional TD---in terms of asymptotic rate of convergence---in on-policy problems.
Subjects / Keywords
Graduation date

Fall 2011
Type of Item

Thesis
Degree

Doctor of Philosophy
DOI

https://doi.org/10.7939/R30Q50
License

This thesis is made available by the University of Alberta Libraries with permission of the copyright owner solely for non-commercial purposes. This thesis, or any portion thereof, may not otherwise be copied or reproduced without the written consent of the copyright owner, except to the extent permitted by Canadian copyright law.

Language

English
Institution

University of Alberta
Degree level

Doctoral
Department
- Department of Computing Science
Supervisor / co-supervisor and their department(s)
- Richard S. Sutton (Computing Science)
Examining committee members and their departments
- Csaba Szepesvari (Computing Science)
- Geoffrey J. Gordon (Machine Learning Department, Carnegie Mellon University)
- Marek Reformat (Electrical and Computer Engineering)
- Dale Schuurmans (Computing Science)