Incremental Off-policy Reinforcement Learning Algorithms

  • Author / Creator
    Mahmood, Ashique
  • Model-free off-policy temporal-difference (TD) algorithms form a powerful component of scalable predictive knowledge representation due to their ability to learn numerous counter- factual predictions in a computationally scalable manner. In this dissertation, we address and overcome two shortcomings of off-policy TD algorithms that preclude a widespread use in knowledge representation: they often produce high variance estimates and may become unstable. The issue of high variance occurs when off-policy TD algorithms use importance sampling (IS), a standard approach to adjusting estimates when the sample and the target distributions are different. IS performs this adjustment by scaling the estimate by the likelihood ratio, also known as the importance sampling ratio. However, this ratio also often produces high variance in the estimates. We provide two different approaches to overcome this shortcoming. The first approach relies on a well-known alternative to the ordinary form of IS, called the Weighted Importance Sampling (WIS), which is known to produce much less variance but has been under-utilized in reinforcement learning tasks. The main obstacle to using WIS in large-scale reinforcement learning tasks is that it is not known how to systematically extend a simple lookup-table-based estimator such as WIS to parametric function approximation with real-time computationally scalable updates. We overcome this obstacle by developing new techniques for deriving computationally efficient equivalent TD updates. This also allows us to simplify the derivations of some existing computationally efficient algorithms as well as produce new ones. In the second approach, we directly address the main factor underlying the high variance issue: the use of the IS ratios. Ratios play a key role in producing high variance estimation as they may vary drastically from one sample to another. We show that the ratios can be eliminated from the updates by varying the amount of bootstrapping, a sophisticated technique for allowing a spectrum of multi-step TD algorithms. The core idea is to allow the bootstrapping parameter, which controls the amount of bootstrapping, vary and shrink in a specific way. The elimination of ratios from the updates using this approach results directly in the reduction of variance compared to IS-based algorithms while still retaining the multi-step advantage of TD algorithms. We empirically evaluate the algorithms based on both approaches using several idealized experiments. We show that the WIS-based algorithms outperform exist- ing off-policy learning algorithms often with an order of magnitude margin. On the other hand, the bootstrapping-based methods can retain more effectively the multi-step advantage compared to the IS-based methods despite occasionally shrinking the bootstrapping parameter. The issue of instability with off-policy TD algorithms may appear when both function approximation and bootstrapping are used together, a desirable combination in reinforcement learning tasks. We discuss the key factors responsible for instability by analyzing the deterministic counterpart of TD updates. We illustrate that the on-policy TD algorithm may also be unstable if the updates are scaled or the bootstrapping parameter is set differently for different states but remains stable with a state-dependent discount factor. The TD algorithm can be made stable in all these cases by ensuring that a certain matrix involved in the deterministic TD update is always positive definite. We provide a systematic approach to ensuring positive definiteness of this matrix, which amounts to selectively emphasizing and de-emphasizing the updates of the stochastic counterpart in a particular manner. The resulting algorithm, which we call the emphatic TD algorithm, is stable, do not introduce extra tunable parameters and its additional memory or computational complexity is negligible. Finally, we provide computationally efficient methods to perform a search over features for improving the representation of learning systems. These search methods enable continual discovery and maintenance of relevant and useful predictions that can be used as a building block of scalable predictive knowledge representations in long-lived systems.

  • Subjects / Keywords
  • Graduation date
    Fall 2017
  • Type of Item
  • Degree
    Doctor of Philosophy
  • DOI
  • License
    This thesis is made available by the University of Alberta Libraries with permission of the copyright owner solely for non-commercial purposes. This thesis, or any portion thereof, may not otherwise be copied or reproduced without the written consent of the copyright owner, except to the extent permitted by Canadian copyright law.
  • Language
  • Institution
    University of Alberta
  • Degree level
  • Department
  • Specialization
    • Statistical Machine Learning
  • Supervisor / co-supervisor and their department(s)
  • Examining committee members and their departments
    • Pilarski, Patrick (Medicine)
    • Munos, Remi (Google DeepMind)
    • Szepesvari, Csaba (Computing Science)
    • Bowling, Michael (Computing Science)