Usage
  • 232 views
  • 300 downloads

A Framework for Safe Evaluation of Offline Learning

  • Author / Creator
    Radi, Hager
  • The world offers unprecedented amounts of data in real-world domains, from which we can develop successful decision-making systems. It is possible for reinforcement learning (RL) to learn control policies offline from such data but challenging to deploy an agent during learning in safety-critical domains. Although an environment is essential, offline RL learns from historical data without access to an environment. Therefore, it is essential to find a technique for estimating how a newly-learned agent will perform when deployed in the real environment before actually deploying it. For instance, in medical domains, we cannot afford to deploy a bad policy. Moreover, if data is costly, we would like to know how much data is needed to learn a good enough policy so that we stop paying for data and let the new policy take over. \\To achieve this, we introduce a framework for safe evaluation of offline learning using approximate high-confidence off-policy evaluation (HCOPE). We focus on safety so that the probability that our agent performs below a baseline is approximately $\delta$, where $\delta$ specifies how much risk is reasonable. In our setting, we assume access to data, which we split into a train-set to learn an offline policy, and a test-set to estimate a lower-bound on the offline policy using off-policy evaluation with bootstrap confidence intervals. A lower-bound estimate allows us to decide when to deploy our learned policy with minimal risk of overestimation. We verify our proposed framework on a range of tasks as well as real-world medical data. Since current offline RL methods rely on some environment for evaluation, this thesis fills a real gap on how offline agents can be evaluated while learning given data only and with high confidence.

  • Subjects / Keywords
  • Graduation date
    Spring 2022
  • Type of Item
    Thesis
  • Degree
    Master of Science
  • DOI
    https://doi.org/10.7939/r3-xz84-9v08
  • License
    This thesis is made available by the University of Alberta Libraries with permission of the copyright owner solely for non-commercial purposes. This thesis, or any portion thereof, may not otherwise be copied or reproduced without the written consent of the copyright owner, except to the extent permitted by Canadian copyright law.