Download the full-sized PDF of Model-based Reinforcement Learning with State and Action AbstractionsDownload the full-sized PDF



Permanent link (DOI):


Export to: EndNote  |  Zotero  |  Mendeley


This file is in the following communities:

Graduate Studies and Research, Faculty of


This file is in the following collections:

Theses and Dissertations

Model-based Reinforcement Learning with State and Action Abstractions Open Access


Other title
Approximate policy iteration,
reinforcement learning
Approximate value iteration
Type of item
Degree grantor
University of Alberta
Author or creator
Supervisor and department
Csaba Szepesvari (Computing Science)
Dale Schuurmans (Computing Science)
Patrick Pilarski (Computing Science)
Examining committee member and department
Damien Eanst (Department of Electrical Engineering and Computer Science, University of Liège)
Marek Reformat (Computing Science)
Department of Computing Science

Date accepted
Graduation date
Doctor of Philosophy
Degree level
In model-based reinforcement learning a model is learned which is then used to find good actions. What model to learn? We investigate these questions in the context of two different approaches to model-based reinforcement learning. We also investigate how one should learn and plan when the reward function may change or may not be specified during learning. We propose an off-line API algorithm that uses linear action models to find an approximate policy. We show that the new algorithm performs comparably to LSPI, and often converges much quicker. We propose a so-called pseudo-MDPs framework. In this framework, we learn an optimal policy in the pseudo-MDP and then pull it back to the original MDP. We give a performance error bound for the approach. Surprisingly, the error bound shows that the quality of the policy derived from an optimal policy of the pseudo-MDP is governed only by the policy evaluation errors of an optimal policy in the original MDP and the ``pull-back'' policy of an optimal policy in the pseudo-MDP. The performance error bound of the recent kernel embedding AVI can be derived using our error bound. The pseudo-MDP framework is interesting because it not only includes the kernel embedding model but also opens the door to new models. We introduce a so-called universal option model. The problem we address is temporal abstract planning in an environment where there are multiple reward functions. A traditional approach for this setting requires a significant amount of computation of option return for each reward function. The new model we propose enables a very efficient and simple generation of option returns. We provide algorithms of learning this model as well as planning algorithms for generating returns and value functions. We also prove the convergence of these algorithms.
This thesis is made available by the University of Alberta Libraries with permission of the copyright owner solely for the purpose of private, scholarly or scientific research. This thesis, or any portion thereof, may not otherwise be copied or reproduced without the written consent of the copyright owner, except to the extent permitted by Canadian copyright law.
Citation for previous publication
Yao, H., Szepesvari, C., Sutton, R., Bhatnagar, S., and Modayil, J. (2014b). Uni- versal option models. In Advances in Neural Information Processing Systems 27, pages 990–998.Yao, H., Szepesva ́ri, C., Pires, B. A., and Zhang, X. (2014a). Pseudo-mdps and factored linear action models. In IEEE Symposium on Adaptive Dynamic Pro- gramming and Reinforcement Learning (ADPRL), 2014, pages 1–9. IEEE.Yao, H. and Szepesva ́ri, Cs. (2012). Approximate policy iteration with linear action models. In Hoffmann, J. and Selman, B., editors, Proceedings of the 26th AAAI Conference on Artificial Intelligence (AAAI-12), pages 1212–1217. AAAI Press.

File Details

Date Uploaded
Date Modified
Audit Status
Audits have not yet been run on this file.
File format: pdf (PDF/A)
Mime type: application/pdf
File size: 2869048
Last modified: 2016:06:16 17:08:47-06:00
Filename: yao_hengshuai_201601_PhD.pdf
Original checksum: f74cee49e7cfef62f2c130e82b83bcd6
Activity of users you follow
User Activity Date