To Ask or Explore: A Systematic Approach to Advice

Sattarifard, Amirmohsen

doi:doi:10.7939/r3-ncc1-8s39

This decommissioned ERA site remains active temporarily to support our final migration steps to https://ualberta.scholaris.ca, ERA's new home. All new collections and items, including Spring 2025 theses, are at that site. For assistance, please contact erahelp@ualberta.ca.

View

Download

Communities and Collections

Graduate and Postdoctoral Studies (GPS), Faculty of / Theses and Dissertations

Usage

90 views
81 downloads

To Ask or Explore: A Systematic Approach to Advice

Author / Creator

Sattarifard, Amirmohsen
Reinforcement learning (RL) has shown great promise in sequential decision-making tasks. However, one of the significant challenges RL faces is poor sample efficiency, which restricts its applicability in many real-world scenarios. Addressing this challenge has the potential to expand the reach of RL techniques. One class of problems that offer insights into this challenge is the multi-armed bandits (MAB) setting, which can be viewed as a simplified version of RL. By investigating and addressing sample efficiency in MAB, we hope to gain insights that can later be generalized to broader RL contexts.

In this thesis, our primary focus is the Beta-Bernoulli Bayesian multi-armed bandit with an online finite horizon setting. We adopt two strategies to tackle the sample efficiency challenge: 1- Knowledge reuse through advice and 2- Near-optimal exploration. While the advice-seeking approach has been touched upon in earlier research, it has largely been unstructured. Our work aims to provide a more systematic approach to the problem of advice in MAB, addressing two essential questions: “when to ask for advice?” and “what arm to ask about?”. Finally, we investigate the problem of near-optimal exploration. We provide a myopic approximation to the Bayes-optimal policy and show that we can reach near-optimal performance using a myopic approximation.

Our key contributions include:
Assuming an advice budget of one, we derive an optimal solution for which arm the agent should seek advice, using the value of information (VOI) as a metric for each advice's utility.
Providing and investigating some approximations of the VOI based on our myopic horizon approximation to the Bayes-optimal policy.
Drawing parallels between the Bayes-optimal policy and the value of information, and revealing the intertwined nature of these two aspects.
Demonstrating that in some cases, it is beneficial for an agent to postpone seeking advice, even when it is free, and providing the optimal solution for such delays in a Bayesian bandit setting.
Investigating a myopic horizon approximation to the Bayes-optimal policy and showing its efficacy in achieving near-optimal results.
Subjects / Keywords
Graduation date

Spring 2024
Type of Item

Thesis
Degree

Master of Science
DOI

https://doi.org/10.7939/r3-ncc1-8s39
License

This thesis is made available by the University of Alberta Libraries with permission of the copyright owner solely for non-commercial purposes. This thesis, or any portion thereof, may not otherwise be copied or reproduced without the written consent of the copyright owner, except to the extent permitted by Canadian copyright law.

Language

English
Institution

University of Alberta
Degree level

Master's
Department
- Department of Computing Science
Supervisor / co-supervisor and their department(s)
- Taylor, Matt (Computing Science) and Wright, James (Computing Science)