- 186 views
- 549 downloads
Chasing Hallucinated Value: A Pitfall of Dyna Style Algorithms with Imperfect Environment Models
- Author / Creator
- Jafferjee, Taher
In Dyna style algorithms, reinforcement learning (RL) agents use a model of the environment to generate simulated experience. By updating on this simulated experience, Dyna style algorithms allow agents to potentially learn control policies in fewer environment interactions than agents that use model-free RL algorithms. Dyna, therefore, is an attractive approach to developing sample efficient RL agents. In many RL problems, however, it is seldom possible to learn a perfectly accurate model of environment dynamics. This thesis explores what happens when Dyna is coupled with an imperfect environment model.
We present the Hallucinated Value Hypothesis. We hypothesise that Dyna style algorithms coupled with imperfect environment models may fail to learn control policies if they update Q-values of observed states towards values of simulated states. We argue this occurs because the imperfect model may erroneously generate fictitious states that do not correspond to real, reachable states of the environment. These fictitious states may have arbitrary Q-values, and temporal difference updates toward them may lead to the propagation of this misleading values through the value function. Consequently, agents may end up incorrectly chasing hallucinated value.
We present three Dyna style algorithms that may update real state values toward simulated state values and one which is designed not to. We evaluate these algorithms on Bordered Gridworld --- a simple setting designed to carefully test the hypothesis. Furthermore, we study whether the hypothesis holds in a range of standard RL benchmarks: Cartpole, Catcher, and Puddleworld.
Experimental evidence supports the Hallucinated Value Hypothesis. The algorithms which update real state values toward simulated state values struggle to improve their control performance. On the other hand, n-step predecessor Dyna, our algorithm which does not perform such updates, seems to be robust to model error on the tested domains. Furthermore, it enjoys speed-ups in learning over its competitors.
- Subjects / Keywords
- Graduation date
- Spring 2020
- Type of Item
- Master of Science
- Permission is hereby granted to the University of Alberta Libraries to reproduce single copies of this thesis and to lend or sell such copies for private, scholarly or scientific research purposes only. Where the thesis is converted to, or otherwise made available in digital form, the University of Alberta will advise potential users of the thesis of these terms. The author reserves all other publication and other rights in association with the copyright in the thesis and, except as herein before provided, neither the thesis nor any substantial portion thereof may be printed or otherwise reproduced in any material form whatsoever without the author's prior written permission.