Learning What to Remember: Strategies for Selective External Memory in Online Reinforcement Learning Agents

Young, Kenneth

doi:doi:10.7939/r3-rwvt-zj65

This decommissioned ERA site remains active temporarily to support our final migration steps to https://ualberta.scholaris.ca, ERA's new home. All new collections and items, including Spring 2025 theses, are at that site. For assistance, please contact erahelp@ualberta.ca.

View

Download

Communities and Collections

Graduate and Postdoctoral Studies (GPS), Faculty of / Theses and Dissertations

Usage

380 views
375 downloads

Learning What to Remember: Strategies for Selective External Memory in Online Reinforcement Learning Agents

Author / Creator

Young, Kenneth
In realistic environments, intelligent agents must learn to integrate information from their past to inform present decisions. An agent's immediate observations are often limited, and some degree of memory is necessary to complete many everyday tasks. However, an agent cannot remember everything it observes. The history of observations may be arbitrarily long, making it impractical to store and process. In this thesis, we will develop a novel method, called online policy gradient over a reservoir (OPGOR), for selecting what to remember from the stream of observation. We will also explore a number of alternative methods for handling this selective memory problem.OPGOR operates within the framework of external memory mechanisms for selective memory, which provide an agent with read/write access to a memory consisting of a fixed number of slots. Such mechanisms give rise to three key questions: what to read from memory, what to write to memory, and what to drop from memory when something is written.We will focus on the question of how to learn to prioritize which information is written to and retained in an external memory. We focus on the online case, where a single agent acts and learns concurrently, with a limited amount of memory and compute time. In doing so, we hope to produce agents that can learn to perform well, while storing much less information. Our primary approach, OPGOR, will apply policy gradient to the process of selecting which state variables to store in memory from the entire trajectory. Naively applying policy gradient to draw a subset of the full history of state variables would require us to store the full history of state variables and then draw a sample. This is not feasible for an online method. However, a variety of algorithms exist which maintain a fixed sized sample with particular statistical properties from a stream observed one item at a time. Such algorithms are called reservoir sampling algorithms, named for the fact that they maintain a fixed size sample, or reservoir, of items drawn from a stream.OPGOR will use a reservoir sampling algorithm to maintain an external memory where the inclusion probability for each state variable in the history is given by a differentiable, closed form expression. This allows us to efficiently train our memory to maintain useful state variables.We test OPGOR, along with a number of alternative selective memory strategies, on a set of psychology inspired problems, simplified to focus on the specific aspects of the problem we aim to investigate. In doing so, we explore the challenges of deciding what to retain in memory and to what degree various methods handle them.
Subjects / Keywords
Graduation date

Spring 2019
Type of Item

Thesis
Degree

Master of Science
DOI

https://doi.org/10.7939/r3-rwvt-zj65
License

Permission is hereby granted to the University of Alberta Libraries to reproduce single copies of this thesis and to lend or sell such copies for private, scholarly or scientific research purposes only. Where the thesis is converted to, or otherwise made available in digital form, the University of Alberta will advise potential users of the thesis of these terms. The author reserves all other publication and other rights in association with the copyright in the thesis and, except as herein before provided, neither the thesis nor any substantial portion thereof may be printed or otherwise reproduced in any material form whatsoever without the author's prior written permission.

Language

English
Institution

University of Alberta
Degree level

Master's
Department
- Department of Computing Science
Supervisor / co-supervisor and their department(s)
- Sutton, Richard S. (Computing Science)