Toward Emphatic Reinforcement Learning

Ni, Jingjiao

doi:doi:10.7939/r3-jgba-7h12

This decommissioned ERA site remains active temporarily to support our final migration steps to https://ualberta.scholaris.ca, ERA's new home. All new collections and items, including Spring 2025 theses, are at that site. For assistance, please contact erahelp@ualberta.ca.

View

Download

Communities and Collections

Graduate and Postdoctoral Studies (GPS), Faculty of / Theses and Dissertations

Usage

167 views
163 downloads

Toward Emphatic Reinforcement Learning

Author / Creator

Ni, Jingjiao
Emphatic-Temporal-Difference (Emphatic-TD) learning algorithms were recently proposed based on the most central and widely used reinforcement learning algorithms, Temporal-Difference (TD) methods. Emphatic-TD learning algorithms were originally designed to solve the divergence problem of conventional TD methods when they are applied to off-policy training. However, recent studies on Emphatic-TD learning have shown that Emphatic-TD meth- ods can outperform conventional TD methods even in the on-policy case on prediction problems. Thus we are interested in how Emphatic-TD methods can in general be extended for on-policy control in this thesis. Also, Emphatic-TD methods are sensitive to the step-size parameter, and inappropriate step-size parameters will lead to divergence, which is called “the sensitivity problem” in this thesis. We encountered this problem during the empirical studies on Emphatic-TD methods, thus we provide a solution to this sensitivity problem.
In this thesis, we will make contributions in two separate but correlated areas. First, we proposed new heuristics for reliably adapting step sizes to emphatic methods for the sensitivity problem of the step-size parameter. Second, we extended the idea of emphatic methods to the on-policy control methods and proposed the new n-step Emphatic-Sarsa method and Emphatic-Sarsa(λ) method. We also conducted some empirical studies for them and our empirical results showed that both the step-size heuristics and the new on-policy emphatic control methods worked, and the new on-policy emphatic control
methods outperformed the corresponding non-emphatic methods in some particular cases. A limitation of our work is that our empirical studies on the on-policy control methods is only designed in the equal-interest case.
Subjects / Keywords
Graduation date

Spring 2021
Type of Item

Thesis
Degree

Master of Science
DOI

https://doi.org/10.7939/r3-jgba-7h12
License

This thesis is made available by the University of Alberta Libraries with permission of the copyright owner solely for non-commercial purposes. This thesis, or any portion thereof, may not otherwise be copied or reproduced without the written consent of the copyright owner, except to the extent permitted by Canadian copyright law.

Language

English
Institution

University of Alberta
Degree level

Master's
Department
- Department of Computing Science
Supervisor / co-supervisor and their department(s)
- Sutton, Richard S. (Computing Science)