Chasing Hallucinated Value: A Pitfall of Dyna Style  Algorithms with Imperfect Environment Models

Jafferjee, Taher

doi:doi:10.7939/r3-ayzf-pv64

This decommissioned ERA site remains active temporarily to support our final migration steps to https://ualberta.scholaris.ca, ERA's new home. All new collections and items, including Spring 2025 theses, are at that site. For assistance, please contact erahelp@ualberta.ca.

View

Download

Communities and Collections

Graduate and Postdoctoral Studies (GPS), Faculty of / Theses and Dissertations

Usage

365 views
964 downloads

Chasing Hallucinated Value: A Pitfall of Dyna Style Algorithms with Imperfect Environment Models

Author / Creator

Jafferjee, Taher
In Dyna style algorithms, reinforcement learning (RL) agents use a model of the environment to generate simulated experience. By updating on this simulated experience, Dyna style algorithms allow agents to potentially learn control policies in fewer environment interactions than agents that use model-free RL algorithms. Dyna, therefore, is an attractive approach to developing sample efficient RL agents. In many RL problems, however, it is seldom possible to learn a perfectly accurate model of environment dynamics. This thesis explores what happens when Dyna is coupled with an imperfect environment model.

We present the Hallucinated Value Hypothesis. We hypothesise that Dyna style algorithms coupled with imperfect environment models may fail to learn control policies if they update Q-values of observed states towards values of simulated states. We argue this occurs because the imperfect model may erroneously generate fictitious states that do not correspond to real, reachable states of the environment. These fictitious states may have arbitrary Q-values, and temporal difference updates toward them may lead to the propagation of this misleading values through the value function. Consequently, agents may end up incorrectly chasing hallucinated value.

We present three Dyna style algorithms that may update real state values toward simulated state values and one which is designed not to. We evaluate these algorithms on Bordered Gridworld --- a simple setting designed to carefully test the hypothesis. Furthermore, we study whether the hypothesis holds in a range of standard RL benchmarks: Cartpole, Catcher, and Puddleworld.

Experimental evidence supports the Hallucinated Value Hypothesis. The algorithms which update real state values toward simulated state values struggle to improve their control performance. On the other hand, n-step predecessor Dyna, our algorithm which does not perform such updates, seems to be robust to model error on the tested domains. Furthermore, it enjoys speed-ups in learning over its competitors.
Subjects / Keywords
Graduation date

Spring 2020
Type of Item

Thesis
Degree

Master of Science
DOI

https://doi.org/10.7939/r3-ayzf-pv64
License

Permission is hereby granted to the University of Alberta Libraries to reproduce single copies of this thesis and to lend or sell such copies for private, scholarly or scientific research purposes only. Where the thesis is converted to, or otherwise made available in digital form, the University of Alberta will advise potential users of the thesis of these terms. The author reserves all other publication and other rights in association with the copyright in the thesis and, except as herein before provided, neither the thesis nor any substantial portion thereof may be printed or otherwise reproduced in any material form whatsoever without the author's prior written permission.

Language

English
Institution

University of Alberta
Degree level

Master's
Department
- Department of Computing Science
Specialization
- Statistical Machine Learning
Supervisor / co-supervisor and their department(s)
- Bowling, Michael (Computing Science)
- White, Martha (Computing Science)