Efficient Exploration in Reinforcement Learning through Time-Based Representations

Cholodovskis Machado, Marlos

doi:doi:10.7939/r3-zaq3-vs36

This decommissioned ERA site remains active temporarily to support our final migration steps to https://ualberta.scholaris.ca, ERA's new home. All new collections and items, including Spring 2025 theses, are at that site. For assistance, please contact erahelp@ualberta.ca.

View

Download

Communities and Collections

Graduate and Postdoctoral Studies (GPS), Faculty of / Theses and Dissertations

Usage

1430 views
2174 downloads

Efficient Exploration in Reinforcement Learning through Time-Based Representations

Author / Creator

Cholodovskis Machado, Marlos
In the reinforcement learning (RL) problem an agent must learn how to act optimally through trial-and-error interactions with a complex, unknown, stochastic environment. The actions taken by the agent influence not just the immediate reward it observes but also the future states and rewards it will observe, implicitly requiring the agent to deal with the trade-off between short-term and long-term consequences. In this context, the problem of exploration is the problem of selecting appropriate actions to explore the state space to gather information while taking this trade-off into consideration.In this dissertation I advocate that agents' exploration strategy can be guided by the process of representation learning. I support this claim by introducing different exploration approaches for RL algorithms that are applicable to complex environments with sparse rewards. They all use learned time-based representations, state representations that capture the temporal aspect of RL problems, implicitly encoding the temporal proximity of states. The two instantiations of time-based representations I use are proto-value functions (PVFs) and the successor representation (SR).The first approaches I introduce are based on the idea of option-based exploration. Option-based exploration hinges on the assumption that an agent that exhibits purposeful behavior is more likely to visit states that are far from its current state than an agent that randomly selects actions at every time step. I model this purposefulness through options, which, in reinforcement learning, represent temporally extended courses of actions over different time scales. I then introduce algorithms capable of discovering options autonomously through PVFs and the SR.I also introduce count-based exploration approaches, which are based on the idea of keeping state visitation counts to ensure all states (or abstractions of a state) are visited a proper number of times. I show that the norm of the SR, while it is being learned, incorporates state visitation counts and I use this result to introduce RL algorithms that achieve state-of-the-art results in large domains that require function approximation.I evaluate my algorithms in both tabular domains and Atari 2600 games. I use tabular domains such as the 4-room domain, RiverSwim, and SixArms in order to develop a better intuition about the proposed algorithms and to compare the proposed approaches to classic baselines in the field. I use Atari 2600 games to evaluate the scalability and generality of the proposed approaches since the state space of Atari 2600 games is too large, requiring function approximation. I discuss approaches based on linear and non-linear function approximation.
Subjects / Keywords
Graduation date

Spring 2019
Type of Item

Thesis
Degree

Doctor of Philosophy
DOI

https://doi.org/10.7939/r3-zaq3-vs36
License

Permission is hereby granted to the University of Alberta Libraries to reproduce single copies of this thesis and to lend or sell such copies for private, scholarly or scientific research purposes only. Where the thesis is converted to, or otherwise made available in digital form, the University of Alberta will advise potential users of the thesis of these terms. The author reserves all other publication and other rights in association with the copyright in the thesis and, except as herein before provided, neither the thesis nor any substantial portion thereof may be printed or otherwise reproduced in any material form whatsoever without the author's prior written permission.

Language

English
Institution

University of Alberta
Degree level

Doctoral
Department
- Department of Computing Science
Supervisor / co-supervisor and their department(s)
- Bowling, Michael (Computing Science)
- Bellemare, Marc (Google Brain)