Reinforcement Learning for Continuing Problems Using Average Reward

Naik, Abhishek

doi:doi:10.7939/r3-na41-w708

This decommissioned ERA site remains active temporarily to support our final migration steps to https://ualberta.scholaris.ca, ERA's new home. All new collections and items, including Spring 2025 theses, are at that site. For assistance, please contact erahelp@ualberta.ca.

View

Download

Communities and Collections

Graduate and Postdoctoral Studies (GPS), Faculty of / Theses and Dissertations

Usage

565 views
718 downloads

Reinforcement Learning for Continuing Problems Using Average Reward

Author / Creator

Naik, Abhishek
This dissertation develops simple and practical learning algorithms from first principles for long-lived agents. Formally, the algorithms are developed within the reinforcement learning framework for continuing (non-episodic) problems, in which the agent-environment interaction goes on ad infinitum, with the goal of maximizing the average reward obtained per step. The average-reward formulation is under-studied in reinforcement learning with several important open problems.

The first contribution of this dissertation involves the development of foundational one-step average-reward learning methods for prediction and control. The central idea involves using the TD error to estimate the average reward, which enables proofs for convergence in both the on- and off-policy tabular settings. Experimental results show that the algorithms' performance is robust to the values of their parameters.

Next, we extend the above one-step prediction algorithm to make multi-step updates using eligibility traces, because multi-step methods can be more sample-efficient. Based on the analysis of a related algorithm, we prove convergence in the on-policy setting with linear function approximation. We also show the first convergence proof in the off-policy setting for a multi-step tabular average-reward prediction algorithm.

Finally, we show that standard discounted algorithms can be significantly improved if their rewards are centered by subtracting out the rewards' empirical average, which could be changing with time in the control problem. We discuss two ways of estimating the average reward that can be used with any standard discounted algorithm and demonstrate the benefits of reward centering with tabular, linear, and non-linear function approximation.
Subjects / Keywords
- reinforcement learning
- average reward
Graduation date

Spring 2024
Type of Item

Thesis
Degree

Doctor of Philosophy
DOI

https://doi.org/10.7939/r3-na41-w708
License

This thesis is made available by the University of Alberta Libraries with permission of the copyright owner solely for non-commercial purposes. This thesis, or any portion thereof, may not otherwise be copied or reproduced without the written consent of the copyright owner, except to the extent permitted by Canadian copyright law.

Language

English
Institution

University of Alberta
Degree level

Doctoral
Department
- Department of Computing Science
Supervisor / co-supervisor and their department(s)
- Sutton, Richard S. (Computing Science)