Toward Practical Reinforcement Learning Algorithms: Classification Based Policy Iteration and Model-Based Learning

Ávila Pires, Bernardo

doi:doi:10.7939/R38P5VN6W

This decommissioned ERA site remains active temporarily to support our final migration steps to https://ualberta.scholaris.ca, ERA's new home. All new collections and items, including Spring 2025 theses, are at that site. For assistance, please contact erahelp@ualberta.ca.

View

Download

Communities and Collections

Graduate and Postdoctoral Studies (GPS), Faculty of / Theses and Dissertations

Usage

320 views
905 downloads

Toward Practical Reinforcement Learning Algorithms: Classification Based Policy Iteration and Model-Based Learning

Author / Creator

Ávila Pires, Bernardo
In this dissertation, we advance the theoretical understanding of two families of Reinforcement Learning (RL) methods: Classification-based policy iteration (CBPI) and model-based reinforcement learning (MBRL) with factored semi-linear models. In contrast to generalized policy iteration, CBPI does not rely on value-function estimates (value estimates for all states and actions). Instead, CBPI uses a classifier to learn how to discriminate value-maximizing actions based on value estimates for a set of observed states and actions. This creates the potential for learning effective policies in settings where estimating value functions is challenging, but where good value estimates can be obtained and good actions can be distinguished from sub-optimal ones. Previous theoretical work on CBPI has required classifiers that are computed by solving a combinatorial problem, which we can expect to be computationally hard (with minimization of the infamous 0-1 loss as a special case). In contrast, classifiers that are computed by solving convex minimization problems (which can be done efficiently) enjoy limited or no performance guarantees, namely, bounds on the cost-sensitive generalization of the misclassification probability, the so-called "true risk". Therefore, we only have instances of CBPI that enjoy theoretical guarantees but cannot be used in practice, or vice-versa. We present a theoretical analysis of CBPI that fills this gap, by providing theoretical guarantees for instances of CBPI that can be used in practice (by virtue of the classification methods used). Our analysis extends an existing theoretical analysis of CBPI, and incorporates performance guarantees for classification methods that minimize a cost-sensitive multiclass convex surrogate loss that generalizes the hinge loss. The hinge loss has been widely used in the design of classification methods, including the popular support vector machines (SVMs). As part of our analysis, we also present results for cost-sensitive multiclass classification: Novel expected surrogate loss (surrogate-risk) bounds, as well as tools for converting surrogate risk bounds into true risk bounds. This is done with the help of novel calibration functions that are specific to cost-sensitive multiclass classification losses. Moreover, our analysis of CBPI can be easily adjusted to accommodate for different classification methods, provided that the corresponding surrogate risk bounds and calibration functions are available. We also present policy error bounds for MBRL methods that use factored semi-linear models. The factored semi-linear model framework generalizes the factored linear model framework and many existing MBRL methods. Notably, factored semi-linear models generalize a recent trend of MBRL methods that depart from learning "traditional" MDP models in order to achieve flexibility, computational efficiency, data efficiency and good empirical performance. As such, factored semi-linear models are both flexible and geared toward efficient policy computation, with instances that have been shown to be promising in practice. The policy error bounds that we present improve the previously existing bounds by relaxing conditions, refining the bounds themselves, and increasing the scope of models that they apply to---namely, factored semi-linear models. These bounds allow us to understand the policy error in norms other than the overly-harsh supremum norm. For example, our Lp(mu) norm results allow us to see that policy error bounds for MBRL methods with factored semi-linear models are less sensitive to covariate-shift than policy error bounds for competing methods, such as approximate linear programming or approximate dynamic programming methods. This robustness suggests that MBRL methods with factored semi-linear models have much potential to be a valid alternative to popular non-model-based RL methods.
Subjects / Keywords
Graduation date

Spring 2017
Type of Item

Thesis
Degree

Doctor of Philosophy
DOI

https://doi.org/10.7939/R38P5VN6W
License

This thesis is made available by the University of Alberta Libraries with permission of the copyright owner solely for non-commercial purposes. This thesis, or any portion thereof, may not otherwise be copied or reproduced without the written consent of the copyright owner, except to the extent permitted by Canadian copyright law.

Language

English
Institution

University of Alberta
Degree level

Doctoral
Department
- Department of Computing Science
Specialization
- Statistical Machine Learning
Supervisor / co-supervisor and their department(s)
- Csaba Szepesvári (Computing Science)
Examining committee members and their departments
- Dale Schuurmans (Computing Science)
- Ivan Mizera (Mathematical and Statistical Sciences)
- András György (Electrical and Electronic Engineering, Imperial College London)
- Ambuj Tewari (Statistics, University of Michigan)