- 66 views
- 63 downloads
To Ask or Explore: A Systematic Approach to Advice
-
- Author / Creator
- Sattarifard, Amirmohsen
-
Reinforcement learning (RL) has shown great promise in sequential decision-making tasks. However, one of the significant challenges RL faces is poor sample efficiency, which restricts its applicability in many real-world scenarios. Addressing this challenge has the potential to expand the reach of RL techniques. One class of problems that offer insights into this challenge is the multi-armed bandits (MAB) setting, which can be viewed as a simplified version of RL. By investigating and addressing sample efficiency in MAB, we hope to gain insights that can later be generalized to broader RL contexts.
In this thesis, our primary focus is the Beta-Bernoulli Bayesian multi-armed bandit with an online finite horizon setting. We adopt two strategies to tackle the sample efficiency challenge: 1- Knowledge reuse through advice and 2- Near-optimal exploration. While the advice-seeking approach has been touched upon in earlier research, it has largely been unstructured. Our work aims to provide a more systematic approach to the problem of advice in MAB, addressing two essential questions: “when to ask for advice?” and “what arm to ask about?”. Finally, we investigate the problem of near-optimal exploration. We provide a myopic approximation to the Bayes-optimal policy and show that we can reach near-optimal performance using a myopic approximation.
Our key contributions include:
Assuming an advice budget of one, we derive an optimal solution for which arm the agent should seek advice, using the value of information (VOI) as a metric for each advice's utility.
Providing and investigating some approximations of the VOI based on our myopic horizon approximation to the Bayes-optimal policy.
Drawing parallels between the Bayes-optimal policy and the value of information, and revealing the intertwined nature of these two aspects.
Demonstrating that in some cases, it is beneficial for an agent to postpone seeking advice, even when it is free, and providing the optimal solution for such delays in a Bayesian bandit setting.
Investigating a myopic horizon approximation to the Bayes-optimal policy and showing its efficacy in achieving near-optimal results. -
- Graduation date
- Spring 2024
-
- Type of Item
- Thesis
-
- Degree
- Master of Science
-
- License
- This thesis is made available by the University of Alberta Libraries with permission of the copyright owner solely for non-commercial purposes. This thesis, or any portion thereof, may not otherwise be copied or reproduced without the written consent of the copyright owner, except to the extent permitted by Canadian copyright law.