Towards Sample Efficient Reinforcement Learning with Function Approximation

Ayoub, Alex

doi:doi:10.7939/r3-hzz0-xj30

This decommissioned ERA site remains active temporarily to support our final migration steps to https://ualberta.scholaris.ca, ERA's new home. All new collections and items, including Spring 2025 theses, are at that site. For assistance, please contact erahelp@ualberta.ca.

View

Download

Communities and Collections

Graduate and Postdoctoral Studies (GPS), Faculty of / Theses and Dissertations

Usage

286 views
403 downloads

Towards Sample Efficient Reinforcement Learning with Function Approximation

Author / Creator

Ayoub, Alex
This thesis proposes novel algorithmic ideas in reinforcement learning for regret minimization. These algorithmic ideas enjoy nice theoretical guarantees and are more practical in large problems than their alternatives. We focus on finite-horizon episodic RL. We propose model-based and model-free RL algorithms that are based on the optimism principle which allows us to derive regret bounds for our algorithms. In each episode the model-based algorithm constructs the set of models that are ‘consistent’ with the data collected. The criterion of consistency is based on the total squared error that the model incurs on the task of predicting state values as determined by the last value estimate along the transitions. The next value function is then chosen by solving the optimistic planning problem with the constructed set of models. We also propose a model-free algorithm inspired by the randomized least squares value iteration algorithm. Unlike existing upper-confidence-bound based approaches this algorithm drives exploration by simply perturbing the training data with judiciously chosen independent and identically distributed scalar noises. To attain optimistic value function estimation without resorting to a UCB-style bonus, we introduce a reward sampling procedure that guarantees optimism in the value estimates. For the model based case, we provide regret bounds on our algorithm and highlight its attractive properties through numerical experiments. For the model-free case, we show that randomizing the history multiple times and adding a regularizer, or data, that ensures the under explored regions have sufficient coverage is enough to get sublinear regret. Thus, making significant progress toward more computationally efficient RL algorithms that also guarantee sublinear regret.
Subjects / Keywords
Graduation date

Fall 2021
Type of Item

Thesis
Degree

Master of Science
DOI

https://doi.org/10.7939/r3-hzz0-xj30
License

This thesis is made available by the University of Alberta Libraries with permission of the copyright owner solely for non-commercial purposes. This thesis, or any portion thereof, may not otherwise be copied or reproduced without the written consent of the copyright owner, except to the extent permitted by Canadian copyright law.

Language

English
Institution

University of Alberta
Degree level

Master's
Department
- Department of Computing Science
Supervisor / co-supervisor and their department(s)
- Szepesvari, Csaba (Computing Science