Towards Sample Efficient Reinforcement Learning with Function Approximation

  • Author / Creator
    Ayoub, Alex
  • This thesis proposes novel algorithmic ideas in reinforcement learning for regret minimization. These algorithmic ideas enjoy nice theoretical guarantees and are more practical in large problems than their alternatives. We focus on finite-horizon episodic RL. We propose model-based and model-free RL algorithms that are based on the optimism principle which allows us to derive regret bounds for our algorithms. In each episode the model-based algorithm constructs the set of models that are ‘consistent’ with the data collected. The criterion of consistency is based on the total squared error that the model incurs on the task of predicting state values as determined by the last value estimate along the transitions. The next value function is then chosen by solving the optimistic planning problem with the constructed set of models. We also propose a model-free algorithm inspired by the randomized least squares value iteration algorithm. Unlike existing upper-confidence-bound based approaches this algorithm drives exploration by simply perturbing the training data with judiciously chosen independent and identically distributed scalar noises. To attain optimistic value function estimation without resorting to a UCB-style bonus, we introduce a reward sampling procedure that guarantees optimism in the value estimates. For the model based case, we provide regret bounds on our algorithm and highlight its attractive properties through numerical experiments. For the model-free case, we show that randomizing the history multiple times and adding a regularizer, or data, that ensures the under explored regions have sufficient coverage is enough to get sublinear regret. Thus, making significant progress toward more computationally efficient RL algorithms that also guarantee sublinear regret.

  • Subjects / Keywords
  • Graduation date
    Fall 2021
  • Type of Item
  • Degree
    Master of Science
  • DOI
  • License
    This thesis is made available by the University of Alberta Libraries with permission of the copyright owner solely for non-commercial purposes. This thesis, or any portion thereof, may not otherwise be copied or reproduced without the written consent of the copyright owner, except to the extent permitted by Canadian copyright law.