Download the full-sized PDF of Incremental Off-policy Reinforcement Learning AlgorithmsDownload the full-sized PDF



Permanent link (DOI):


Export to: EndNote  |  Zotero  |  Mendeley


This file is in the following communities:

Graduate Studies and Research, Faculty of


This file is in the following collections:

Theses and Dissertations

Incremental Off-policy Reinforcement Learning Algorithms Open Access


Other title
reinforcement learning
incremental learning
off-policy learning
Type of item
Degree grantor
University of Alberta
Author or creator
Mahmood, Ashique
Supervisor and department
Sutton, Richard S (Computing Science)
Examining committee member and department
Munos, Remi (Google DeepMind)
Szepesvari, Csaba (Computing Science)
Pilarski, Patrick (Medicine)
Bowling, Michael (Computing Science)
Department of Computing Science
Statistical Machine Learning
Date accepted
Graduation date
2017-11:Fall 2017
Doctor of Philosophy
Degree level
Model-free off-policy temporal-difference (TD) algorithms form a powerful component of scalable predictive knowledge representation due to their ability to learn numerous counter- factual predictions in a computationally scalable manner. In this dissertation, we address and overcome two shortcomings of off-policy TD algorithms that preclude a widespread use in knowledge representation: they often produce high variance estimates and may become unstable. The issue of high variance occurs when off-policy TD algorithms use importance sampling (IS), a standard approach to adjusting estimates when the sample and the target distributions are different. IS performs this adjustment by scaling the estimate by the likelihood ratio, also known as the importance sampling ratio. However, this ratio also often produces high variance in the estimates. We provide two different approaches to overcome this shortcoming. The first approach relies on a well-known alternative to the ordinary form of IS, called the Weighted Importance Sampling (WIS), which is known to produce much less variance but has been under-utilized in reinforcement learning tasks. The main obstacle to using WIS in large-scale reinforcement learning tasks is that it is not known how to systematically extend a simple lookup-table-based estimator such as WIS to parametric function approximation with real-time computationally scalable updates. We overcome this obstacle by developing new techniques for deriving computationally efficient equivalent TD updates. This also allows us to simplify the derivations of some existing computationally efficient algorithms as well as produce new ones. In the second approach, we directly address the main factor underlying the high variance issue: the use of the IS ratios. Ratios play a key role in producing high variance estimation as they may vary drastically from one sample to another. We show that the ratios can be eliminated from the updates by varying the amount of bootstrapping, a sophisticated technique for allowing a spectrum of multi-step TD algorithms. The core idea is to allow the bootstrapping parameter, which controls the amount of bootstrapping, vary and shrink in a specific way. The elimination of ratios from the updates using this approach results directly in the reduction of variance compared to IS-based algorithms while still retaining the multi-step advantage of TD algorithms. We empirically evaluate the algorithms based on both approaches using several idealized experiments. We show that the WIS-based algorithms outperform exist- ing off-policy learning algorithms often with an order of magnitude margin. On the other hand, the bootstrapping-based methods can retain more effectively the multi-step advantage compared to the IS-based methods despite occasionally shrinking the bootstrapping parameter. The issue of instability with off-policy TD algorithms may appear when both function approximation and bootstrapping are used together, a desirable combination in reinforcement learning tasks. We discuss the key factors responsible for instability by analyzing the deterministic counterpart of TD updates. We illustrate that the on-policy TD algorithm may also be unstable if the updates are scaled or the bootstrapping parameter is set differently for different states but remains stable with a state-dependent discount factor. The TD algorithm can be made stable in all these cases by ensuring that a certain matrix involved in the deterministic TD update is always positive definite. We provide a systematic approach to ensuring positive definiteness of this matrix, which amounts to selectively emphasizing and de-emphasizing the updates of the stochastic counterpart in a particular manner. The resulting algorithm, which we call the emphatic TD algorithm, is stable, do not introduce extra tunable parameters and its additional memory or computational complexity is negligible. Finally, we provide computationally efficient methods to perform a search over features for improving the representation of learning systems. These search methods enable continual discovery and maintenance of relevant and useful predictions that can be used as a building block of scalable predictive knowledge representations in long-lived systems.
This thesis is made available by the University of Alberta Libraries with permission of the copyright owner solely for the purpose of private, scholarly or scientific research. This thesis, or any portion thereof, may not otherwise be copied or reproduced without the written consent of the copyright owner, except to the extent permitted by Canadian copyright law.
Citation for previous publication
Mahmood, A. R., Sutton, R. S. (2013). Representation search through generate and test. In AAAI Workshop: Learning Rich Representations from Low-Level Sensors.Mahmood, A. R., van Hasselt, H., Sutton, R. S. (2014). Weighted importance sampling for off-policy learning with linear function approximation. In Advances in Neural Information Processing Systems 27, Montreal, Canada.Mahmood, A. R., Sutton, R. S. (2015). Off-policy learning based on weighted importance sampling with linear computational complexity. In Proceedings of the 31st Conference on Uncertainty in Artificial Intelligence.Mahmood, A. R., Yu, H., White, M., Sutton, R. S. (2015). Emphatic temporal-difference learning. European Workshop on Reinforcement Learning 12,arXiv preprint ArXiv:1507.01569.Mahmood, A. R., Yu, H., Sutton, R. S. (2017). Multi-step off-policy learning without importance sampling ratios. arXiv preprint arXiv:1702.03006.Sutton, R. S., Mahmood, A. R., White, M. (2016). An emphatic approach to the problem of off-policy temporal-difference learning. Journal of Machine Learning Research 17, (73):1–29.

File Details

Date Uploaded
Date Modified
Audit Status
Audits have not yet been run on this file.
File format: pdf (Portable Document Format)
Mime type: application/pdf
File size: 27264746
Last modified: 2017:11:08 17:48:17-07:00
Filename: Mahmood_Ashique_201709_PhD.pdf
Original checksum: 435030e201ac7bd65e7b39013b6f7564
Well formed: false
Valid: false
Status message: Invalid page tree node offset=743269
Status message: Unexpected error in findFonts java.lang.ClassCastException: edu.harvard.hul.ois.jhove.module.pdf.PdfSimpleObject cannot be cast to edu.harvard.hul.ois.jhove.module.pdf.PdfDictionary offset=6866
Status message: Invalid name tree offset=27229895
Status message: Invalid name tree offset=27229895
Status message: Invalid name tree offset=27229895
File title: List of Figures
Activity of users you follow
User Activity Date