Dyna with Options: Incorporating Temporal Abstraction into Planning

Mihucz, Gábor

doi:doi:10.7939/r3-4wvz-rf10

This decommissioned ERA site remains active temporarily to support our final migration steps to https://ualberta.scholaris.ca, ERA's new home. All new collections and items, including Spring 2025 theses, are at that site. For assistance, please contact erahelp@ualberta.ca.

View

Download

Communities and Collections

Graduate and Postdoctoral Studies (GPS), Faculty of / Theses and Dissertations

Usage

175 views
253 downloads

Dyna with Options: Incorporating Temporal Abstraction into Planning

Author / Creator

Mihucz, Gábor
The motivation to incorporate planning, temporal abstraction and value function approximation in reinforcement learning (RL) algorithms is to reduce the amount of interaction with the environment needed to learn a near-optimal policy. Although each of these concepts has been under intense scrutiny for decades, less is known about their interplay, and specifically: under what circumstances does planning with options provide significant benefits over planning with only primitive actions or model-free alternatives? In this thesis we examine this question by endowing the background planning algorithm, Dyna with access to options with (near)-optimal option policies in two environments: a non-stationary tabular one, in which the changing reward function necessitates rapid value function updates, and in a deterministic, stationary, continuous-state environment that requires value function approximation, a setting in which planning with primitive actions is known to be suboptimal compared to model-free approaches. We find that in the non-stationary environment without a state visitation bonus, all planning algorithms perform significantly better than the model-free Q-learning algorithm; planning with only options (Dyno) performs better than planning with both actions and options (Dyna+options) or planning with actions only (Dyna), the latter two have comparable performance. When a state-visitation bonus is added, each algorithm performs similarly near-optimally, and satisfactory performance can be achieved by restricting the state visitation bonus to goal states. In the value function approximation realm, we find that Dyno outperforms DDQN in terms of speed and robustness at the beginning of learning, but later on, its performance degrades to that of DDQN's in the instances examined. Dyna+options performs better than Dyna and comparably to DDQN during much of the learning process but with higher variance and occasional dips. We conclude that having access to options with (near-)optimal option policies alone is not sufficient to combat the suboptimality arising from planning with inaccurate primitive models and argue that more sophisticated planning architectures are necessary that bypass the reliance on primitive models.
Subjects / Keywords
Graduation date

Fall 2022
Type of Item

Thesis
Degree

Master of Science
DOI

https://doi.org/10.7939/r3-4wvz-rf10
License

This thesis is made available by the University of Alberta Library with permission of the copyright owner solely for non-commercial purposes. This thesis, or any portion thereof, may not otherwise be copied or reproduced without the written consent of the copyright owner, except to the extent permitted by Canadian copyright law.

Language

English
Institution

University of Alberta
Degree level

Master's
Department
- Department of Computing Science
Supervisor / co-supervisor and their department(s)
- Martha White (Computing Science)