Usage
  • 92 views
  • 127 downloads

Dyna with Options: Incorporating Temporal Abstraction into Planning

  • Author / Creator
    Mihucz, Gábor
  • The motivation to incorporate planning, temporal abstraction and value function approximation in reinforcement learning (RL) algorithms is to reduce the amount of interaction with the environment needed to learn a near-optimal policy. Although each of these concepts has been under intense scrutiny for decades, less is known about their interplay, and specifically: under what circumstances does planning with options provide significant benefits over planning with only primitive actions or model-free alternatives? In this thesis we examine this question by endowing the background planning algorithm, Dyna with access to options with (near)-optimal option policies in two environments: a non-stationary tabular one, in which the changing reward function necessitates rapid value function updates, and in a deterministic, stationary, continuous-state environment that requires value function approximation, a setting in which planning with primitive actions is known to be suboptimal compared to model-free approaches. We find that in the non-stationary environment without a state visitation bonus, all planning algorithms perform significantly better than the model-free Q-learning algorithm; planning with only options (Dyno) performs better than planning with both actions and options (Dyna+options) or planning with actions only (Dyna), the latter two have comparable performance. When a state-visitation bonus is added, each algorithm performs similarly near-optimally, and satisfactory performance can be achieved by restricting the state visitation bonus to goal states. In the value function approximation realm, we find that Dyno outperforms DDQN in terms of speed and robustness at the beginning of learning, but later on, its performance degrades to that of DDQN's in the instances examined. Dyna+options performs better than Dyna and comparably to DDQN during much of the learning process but with higher variance and occasional dips. We conclude that having access to options with (near-)optimal option policies alone is not sufficient to combat the suboptimality arising from planning with inaccurate primitive models and argue that more sophisticated planning architectures are necessary that bypass the reliance on primitive models.

  • Subjects / Keywords
  • Graduation date
    Fall 2022
  • Type of Item
    Thesis
  • Degree
    Master of Science
  • DOI
    https://doi.org/10.7939/r3-4wvz-rf10
  • License
    This thesis is made available by the University of Alberta Library with permission of the copyright owner solely for non-commercial purposes. This thesis, or any portion thereof, may not otherwise be copied or reproduced without the written consent of the copyright owner, except to the extent permitted by Canadian copyright law.