- 76 views
- 131 downloads
Reinforcement Learning for Continuing Problems Using Average Reward
-
- Author / Creator
- Naik, Abhishek
-
This dissertation develops simple and practical learning algorithms from first principles for long-lived agents. Formally, the algorithms are developed within the reinforcement learning framework for continuing (non-episodic) problems, in which the agent-environment interaction goes on ad infinitum, with the goal of maximizing the average reward obtained per step. The average-reward formulation is under-studied in reinforcement learning with several important open problems.
The first contribution of this dissertation involves the development of foundational one-step average-reward learning methods for prediction and control. The central idea involves using the TD error to estimate the average reward, which enables proofs for convergence in both the on- and off-policy tabular settings. Experimental results show that the algorithms' performance is robust to the values of their parameters.
Next, we extend the above one-step prediction algorithm to make multi-step updates using eligibility traces, because multi-step methods can be more sample-efficient. Based on the analysis of a related algorithm, we prove convergence in the on-policy setting with linear function approximation. We also show the first convergence proof in the off-policy setting for a multi-step tabular average-reward prediction algorithm.
Finally, we show that standard discounted algorithms can be significantly improved if their rewards are centered by subtracting out the rewards' empirical average, which could be changing with time in the control problem. We discuss two ways of estimating the average reward that can be used with any standard discounted algorithm and demonstrate the benefits of reward centering with tabular, linear, and non-linear function approximation.
-
- Subjects / Keywords
-
- Graduation date
- Spring 2024
-
- Type of Item
- Thesis
-
- Degree
- Doctor of Philosophy
-
- License
- This thesis is made available by the University of Alberta Libraries with permission of the copyright owner solely for non-commercial purposes. This thesis, or any portion thereof, may not otherwise be copied or reproduced without the written consent of the copyright owner, except to the extent permitted by Canadian copyright law.