A Framework for Safe Evaluation of Offline Learning

Radi, Hager

doi:doi:10.7939/r3-xz84-9v08

This decommissioned ERA site remains active temporarily to support our final migration steps to https://ualberta.scholaris.ca, ERA's new home. All new collections and items, including Spring 2025 theses, are at that site. For assistance, please contact erahelp@ualberta.ca.

View

Download

Communities and Collections

Graduate and Postdoctoral Studies (GPS), Faculty of / Theses and Dissertations

Usage

362 views
649 downloads

A Framework for Safe Evaluation of Offline Learning

Author / Creator

Radi, Hager
The world offers unprecedented amounts of data in real-world domains, from which we can develop successful decision-making systems. It is possible for reinforcement learning (RL) to learn control policies offline from such data but challenging to deploy an agent during learning in safety-critical domains. Although an environment is essential, offline RL learns from historical data without access to an environment. Therefore, it is essential to find a technique for estimating how a newly-learned agent will perform when deployed in the real environment before actually deploying it. For instance, in medical domains, we cannot afford to deploy a bad policy. Moreover, if data is costly, we would like to know how much data is needed to learn a good enough policy so that we stop paying for data and let the new policy take over. \\To achieve this, we introduce a framework for safe evaluation of offline learning using approximate high-confidence off-policy evaluation (HCOPE). We focus on safety so that the probability that our agent performs below a baseline is approximately $\delta$, where $\delta$ specifies how much risk is reasonable. In our setting, we assume access to data, which we split into a train-set to learn an offline policy, and a test-set to estimate a lower-bound on the offline policy using off-policy evaluation with bootstrap confidence intervals. A lower-bound estimate allows us to decide when to deploy our learned policy with minimal risk of overestimation. We verify our proposed framework on a range of tasks as well as real-world medical data. Since current offline RL methods rely on some environment for evaluation, this thesis fills a real gap on how offline agents can be evaluated while learning given data only and with high confidence.
Subjects / Keywords
Graduation date

Spring 2022
Type of Item

Thesis
Degree

Master of Science
DOI

https://doi.org/10.7939/r3-xz84-9v08
License

This thesis is made available by the University of Alberta Libraries with permission of the copyright owner solely for non-commercial purposes. This thesis, or any portion thereof, may not otherwise be copied or reproduced without the written consent of the copyright owner, except to the extent permitted by Canadian copyright law.

Language

English
Institution

University of Alberta
Degree level

Master's
Department
- Department of Computing Science
Supervisor / co-supervisor and their department(s)
- Matthew E. Taylor, Department of Computing Science
- Josiah P. Hanna, COMPUTER SCIENCES DEPARTMENT UNIVERSITY OF WISCONSIN -- MADISON