Download the full-sized PDF of Toward Practical Reinforcement Learning Algorithms: Classification Based Policy Iteration and Model-Based LearningDownload the full-sized PDF



Permanent link (DOI):


Export to: EndNote  |  Zotero  |  Mendeley


This file is in the following communities:

Graduate Studies and Research, Faculty of


This file is in the following collections:

Theses and Dissertations

Toward Practical Reinforcement Learning Algorithms: Classification Based Policy Iteration and Model-Based Learning Open Access


Other title
Model-Based Reinforcement Learning
Reinforcement Learning
Machine Learning
Factored Linear Models
Factored Semi-linear Models
Classification-Based Policy Iteration
Classification Calibration
Type of item
Degree grantor
University of Alberta
Author or creator
Ávila Pires, Bernardo
Supervisor and department
Csaba Szepesvári (Computing Science)
Examining committee member and department
Dale Schuurmans (Computing Science)
András György (Electrical and Electronic Engineering, Imperial College London)
Ivan Mizera (Mathematical and Statistical Sciences)
Ambuj Tewari (Statistics, University of Michigan)
Department of Computing Science
Statistical Machine Learning
Date accepted
Graduation date
2017-06:Spring 2017
Doctor of Philosophy
Degree level
In this dissertation, we advance the theoretical understanding of two families of Reinforcement Learning (RL) methods: Classification-based policy iteration (CBPI) and model-based reinforcement learning (MBRL) with factored semi-linear models. In contrast to generalized policy iteration, CBPI does not rely on value-function estimates (value estimates for all states and actions). Instead, CBPI uses a classifier to learn how to discriminate value-maximizing actions based on value estimates for a set of observed states and actions. This creates the potential for learning effective policies in settings where estimating value functions is challenging, but where good value estimates can be obtained and good actions can be distinguished from sub-optimal ones. Previous theoretical work on CBPI has required classifiers that are computed by solving a combinatorial problem, which we can expect to be computationally hard (with minimization of the infamous 0-1 loss as a special case). In contrast, classifiers that are computed by solving convex minimization problems (which can be done efficiently) enjoy limited or no performance guarantees, namely, bounds on the cost-sensitive generalization of the misclassification probability, the so-called "true risk". Therefore, we only have instances of CBPI that enjoy theoretical guarantees but cannot be used in practice, or vice-versa. We present a theoretical analysis of CBPI that fills this gap, by providing theoretical guarantees for instances of CBPI that can be used in practice (by virtue of the classification methods used). Our analysis extends an existing theoretical analysis of CBPI, and incorporates performance guarantees for classification methods that minimize a cost-sensitive multiclass convex surrogate loss that generalizes the hinge loss. The hinge loss has been widely used in the design of classification methods, including the popular support vector machines (SVMs). As part of our analysis, we also present results for cost-sensitive multiclass classification: Novel expected surrogate loss (surrogate-risk) bounds, as well as tools for converting surrogate risk bounds into true risk bounds. This is done with the help of novel calibration functions that are specific to cost-sensitive multiclass classification losses. Moreover, our analysis of CBPI can be easily adjusted to accommodate for different classification methods, provided that the corresponding surrogate risk bounds and calibration functions are available. We also present policy error bounds for MBRL methods that use factored semi-linear models. The factored semi-linear model framework generalizes the factored linear model framework and many existing MBRL methods. Notably, factored semi-linear models generalize a recent trend of MBRL methods that depart from learning "traditional" MDP models in order to achieve flexibility, computational efficiency, data efficiency and good empirical performance. As such, factored semi-linear models are both flexible and geared toward efficient policy computation, with instances that have been shown to be promising in practice. The policy error bounds that we present improve the previously existing bounds by relaxing conditions, refining the bounds themselves, and increasing the scope of models that they apply to---namely, factored semi-linear models. These bounds allow us to understand the policy error in norms other than the overly-harsh supremum norm. For example, our Lp(mu) norm results allow us to see that policy error bounds for MBRL methods with factored semi-linear models are less sensitive to covariate-shift than policy error bounds for competing methods, such as approximate linear programming or approximate dynamic programming methods. This robustness suggests that MBRL methods with factored semi-linear models have much potential to be a valid alternative to popular non-model-based RL methods.
This thesis is made available by the University of Alberta Libraries with permission of the copyright owner solely for the purpose of private, scholarly or scientific research. This thesis, or any portion thereof, may not otherwise be copied or reproduced without the written consent of the copyright owner, except to the extent permitted by Canadian copyright law.
Citation for previous publication
Ávila Pires, B. and Szepesvári, C. (2016a). Multiclass classi cation calibration functions. CoRR, arXiv:0902.0885.Ávila Pires, B. and Szepesvári, C. (2016b). Policy error bounds for model- based reinforcement learning with factored linear models. In Feldman, V. and Rakhlin, A., editors, Proceedings of the 29th Annual Conference on Learning Theory, volume 49, pages 1–31.

File Details

Date Uploaded
Date Modified
Audit Status
Audits have not yet been run on this file.
File format: pdf (Portable Document Format)
Mime type: application/pdf
File size: 1228172
Last modified: 2017:06:13 12:24:24-06:00
Filename: AvilaPires_Bernardo_201611_PhD.pdf
Original checksum: 1a08f75f95d11f175621742835ab1140
Copyright note: Copyright © 2016 ``Bernardo Ávila Pires''
Well formed: true
Valid: true
File title: Introduction
File title: Toward Practical Reinforcement Learning Algorithms: Classification Based Policy Iteration and Model-Based Learning
File author: Bernardo Ávila Pires
File author: Bernardo vila Pires
Page count: 127
Activity of users you follow
User Activity Date