Model-based Reinforcement Learning with State and Action Abstractions

Yao,Hengshuai

doi:doi:10.7939/R3NS0M57R

This decommissioned ERA site remains active temporarily to support our final migration steps to https://ualberta.scholaris.ca, ERA's new home. All new collections and items, including Spring 2025 theses, are at that site. For assistance, please contact erahelp@ualberta.ca.

View

Download

Communities and Collections

Graduate and Postdoctoral Studies (GPS), Faculty of / Theses and Dissertations

Usage

271 views
837 downloads

Model-based Reinforcement Learning with State and Action Abstractions

Author / Creator

Yao,Hengshuai
In model-based reinforcement learning a model is learned which is then used to find good actions. What model to learn? We investigate these questions in the context of two different approaches to model-based reinforcement learning. We also investigate how one should learn and plan when the reward function may change or may not be specified during learning. We propose an off-line API algorithm that uses linear action models to find an approximate policy. We show that the new algorithm performs comparably to LSPI, and often converges much quicker. We propose a so-called pseudo-MDPs framework. In this framework, we learn an optimal policy in the pseudo-MDP and then pull it back to the original MDP. We give a performance error bound for the approach. Surprisingly, the error bound shows that the quality of the policy derived from an optimal policy of the pseudo-MDP is governed only by the policy evaluation errors of an optimal policy in the original MDP and the ``pull-back'' policy of an optimal policy in the pseudo-MDP. The performance error bound of the recent kernel embedding AVI can be derived using our error bound. The pseudo-MDP framework is interesting because it not only includes the kernel embedding model but also opens the door to new models. We introduce a so-called universal option model. The problem we address is temporal abstract planning in an environment where there are multiple reward functions. A traditional approach for this setting requires a significant amount of computation of option return for each reward function. The new model we propose enables a very efficient and simple generation of option returns. We provide algorithms of learning this model as well as planning algorithms for generating returns and value functions. We also prove the convergence of these algorithms.
Subjects / Keywords
Graduation date

Spring 2016
Type of Item

Thesis
Degree

Doctor of Philosophy
DOI

https://doi.org/10.7939/R3NS0M57R
License

This thesis is made available by the University of Alberta Libraries with permission of the copyright owner solely for non-commercial purposes. This thesis, or any portion thereof, may not otherwise be copied or reproduced without the written consent of the copyright owner, except to the extent permitted by Canadian copyright law.

Language

English
Institution

University of Alberta
Degree level

Doctoral
Department
- Department of Computing Science
Supervisor / co-supervisor and their department(s)
Examining committee members and their departments
- Damien Eanst (Department of Electrical Engineering and Computer Science, University of Liège)
- Marek Reformat (Computing Science)