Usage
  • 32 views
  • 49 downloads

Towards Practical Offline Reinforcement Learning: Sample Efficient Policy Selection and Evaluation

  • Author / Creator
    Liu, Vincent
  • Offline reinforcement learning (RL) involves learning policies from datasets, rather than online interaction. The dissertation first investigates a critical component in offline RL: offline policy selection (OPS). Given that most offline RL algorithms require careful hyperparameter tuning, we need to select the best policy amongst a set of candidate policies before deployment. In the first part of the dissertation, we provide clarity on when OPS is sample efficient by building a clear connection to off-policy policy evaluation (OPE) and Bellman error estimation. This dissertation then presents algorithms to leverage offline data. We begin by examining environments that include exogenous variables with limited agent impact and endogenous variables under full agent control. We show that policy evaluation and selection become straightforward under such conditions. Additionally, we present an algorithm based on Fitted-Q Iteration with data augmentation and show its ability to find nearly optimal policies with polynomial sample complexity. We then study OPE in non-stationary environments and introduce the regression-assisted doubly robust estimator, which effectively incorporates the past data without introducing a large bias and improves on existing OPE estimators with the use of auxiliary information and a regression approach. We evaluate our algorithms across a variety of problems, some built using real-world datasets, including optimal order execution, inventory management, hybrid car control and recommendation systems.

  • Subjects / Keywords
  • Graduation date
    Spring 2024
  • Type of Item
    Thesis
  • Degree
    Doctor of Philosophy
  • DOI
    https://doi.org/10.7939/r3-vre2-b756
  • License
    This thesis is made available by the University of Alberta Libraries with permission of the copyright owner solely for non-commercial purposes. This thesis, or any portion thereof, may not otherwise be copied or reproduced without the written consent of the copyright owner, except to the extent permitted by Canadian copyright law.