Model-based Reinforcement Learning with State and Action Abstractions

  • Author / Creator
  • In model-based reinforcement learning a model is learned which is then used to find good actions. What model to learn? We investigate these questions in the context of two different approaches to model-based reinforcement learning. We also investigate how one should learn and plan when the reward function may change or may not be specified during learning. We propose an off-line API algorithm that uses linear action models to find an approximate policy. We show that the new algorithm performs comparably to LSPI, and often converges much quicker. We propose a so-called pseudo-MDPs framework. In this framework, we learn an optimal policy in the pseudo-MDP and then pull it back to the original MDP. We give a performance error bound for the approach. Surprisingly, the error bound shows that the quality of the policy derived from an optimal policy of the pseudo-MDP is governed only by the policy evaluation errors of an optimal policy in the original MDP and the ``pull-back'' policy of an optimal policy in the pseudo-MDP. The performance error bound of the recent kernel embedding AVI can be derived using our error bound. The pseudo-MDP framework is interesting because it not only includes the kernel embedding model but also opens the door to new models. We introduce a so-called universal option model. The problem we address is temporal abstract planning in an environment where there are multiple reward functions. A traditional approach for this setting requires a significant amount of computation of option return for each reward function. The new model we propose enables a very efficient and simple generation of option returns. We provide algorithms of learning this model as well as planning algorithms for generating returns and value functions. We also prove the convergence of these algorithms.

  • Subjects / Keywords
  • Graduation date
  • Type of Item
  • Degree
    Doctor of Philosophy
  • DOI
  • License
    This thesis is made available by the University of Alberta Libraries with permission of the copyright owner solely for non-commercial purposes. This thesis, or any portion thereof, may not otherwise be copied or reproduced without the written consent of the copyright owner, except to the extent permitted by Canadian copyright law.