- 33 views
- 64 downloads
Leveraging Generic Problem Structure for Efficient Reinforcement Learning
-
- Author / Creator
- Young, Kenneth J.
-
In this dissertation, I investigate how we can exploit generic problem structure to make reinforcement learning algorithms more efficient. Generic problem structure means basic structure that exists in a wide range of problems (e.g., an action taken in the present does not influence the past), as opposed to structure which is specific to a particular problem (e.g., heuristics or theorems about which actions are superior in a particular game). My investigation is broken down into three major contributions.
The first contribution is to demonstrate, empirical and theoretically, that given some prior knowledge of the structure of the world, reinforcement learning methods which learn a world model can do a better job of exploiting that knowledge, to learn from experience, compared to model-free methods which learn a value function directly from experience. This validates the belief that model-based reinforcement learning improves sample efficiency by synthesizing imagined experience which generalizes beyond the data. While this belief is widely held, model generalization is an insufficient explanation because learned value functions also generalize. I address this gap with theoretical and empirical results illustrating how world model generalization is, in a sense, inherently more powerful than value function generalization.
The second contribution is an algorithm that exploits knowledge of network structure to improve credit assignment in networks of discrete stochastic neurons, where each neuron is treated as a reinforcement learning agent. Training neural networks with discrete stochastic units is challenging as backpropagation is not directly applicable, nor are the reparameterization tricks often used in networks with continuous stochastic variables. I propose Hindsight Network Credit Assignment (HNCA) for gradient estimation in networks of discrete stochastic neurons. HNCA can be seen as a middle-ground between backpropagation (which is intractable for nontrivial stochastic networks) and REINFORCE (which tends to be high variance). HNCA produces unbiased gradient estimates with provably lower variance than REINFORCE. The computational cost of HNCA is on the same order as a forward pass through the network, hence learning is not a significant bottleneck. Empirical results demonstrate that HNCA significantly reduces variance in the gradient estimates compared to REINFORCE, which in turn significantly improves performance.
The third contribution is an approach to option discovery motivated by the idea that, as a consequence of the spatiotemporal locality structure of the world, optimal actions in temporally contiguous states will tend to be strongly interdependent. Motivated by this idea, I propose an approach called Option Iteration (OptIt) which distills a set of options from the results of a computationally expensive search procedure. Intuitively, OptIt aims to discover a set of options such that for every trajectory segment of some length, at least one option in the set is a good match to the improved policy which results from running a search procedure in each state in the segment. This leads to options that capture relationships among the best actions in temporally contiguous states while allowing for uncertainty in which option is best in a given situation. The resulting set of options guides the search procedure resulting in a process of iterative improvement where better options lead to better search, which in turn facilitates the discovery of better options.
There is good reason to believe that prior knowledge of problem structure is necessary to make meaningful learning possible. However, the last several decades of research have shown that encoding specific human expertise into our systems tends to lose out in the long run compared to methods that scale well with computation and data. Acknowledging both these points, it makes sense to focus our efforts on developing methods which exploit structure that is as generic as possible, allowing the agent to learn more specific world knowledge from experience and computation. By virtue of being broadly applicable, generic structure prunes the search space of solutions more and remains relevant as more computation and data are applied to a problem. On the other hand, although specific structure might improve performance early on, it can quickly become irrelevant as it may be deduced by a combination of learning and generic structure. While it's not always clear where the line should be drawn between generic and specific, acknowledging that a trade-off exists provides a useful guideline for which research directions are worth pursuing.
-
- Graduation date
- Spring 2024
-
- Type of Item
- Thesis
-
- Degree
- Doctor of Philosophy
-
- License
- This thesis is made available by the University of Alberta Libraries with permission of the copyright owner solely for non-commercial purposes. This thesis, or any portion thereof, may not otherwise be copied or reproduced without the written consent of the copyright owner, except to the extent permitted by Canadian copyright law.