Online Off-policy Prediction

Sina Ghiassian

doi:doi:10.7939/r3-mwcp-1873

This decommissioned ERA site remains active temporarily to support our final migration steps to https://ualberta.scholaris.ca, ERA's new home. All new collections and items, including Spring 2025 theses, are at that site. For assistance, please contact erahelp@ualberta.ca.

View

Download

Communities and Collections

Graduate and Postdoctoral Studies (GPS), Faculty of / Theses and Dissertations

Usage

316 views
334 downloads

Online Off-policy Prediction

Author / Creator

Sina Ghiassian
In this dissertation, we study online off-policy temporal-difference learning algorithms, a class of reinforcement learning algorithms that can learn predictions in an efficient and scalable manner. The contributions of this dissertation are one of the two kinds: (1) empirically studying existing off-policy learning algorithms, or, (2) exploring new algorithmic ideas.

In reinforcement learning, it is not uncommon for an agent to learn about one policy, called the target policy, while behaving using a different policy, called the behavior policy. When the target and behavior policies are different, learning is `off' the policy in the sense that the data is from a different source than the target policy.

Our first contribution is a novel Gradient-TD algorithm called TDRC. Gradient-TD algorithms are one of the most important families of off-policy learning algorithms due to their favorable convergence guarantees. Previous research showed that Gradient-TD algorithms learn slower than some other unsound algorithms such as Off-policy TD (Maei, 2011). In addition, Gradient-TD algorithms are not as easy to use as Off-policy TD because they have two learned weight vectors and two step-size parameters. Our contribution is a new algorithm called temporal-difference learning with regularized corrections (TDRC), that behaves as well as Off-policy TD, when Off-policy TD performs well but is sound in cases where Off-policy TD diverges. TDRC provides a standard way of setting the second step-size parameter and is easier to use than other Gradient-TD algorithms. We empirically investigate the performance of TDRC across a range of problems, for both prediction and control, and for both linear and non-linear function approximation, and show that it consistently outperforms other Gradient-TD algorithms. We derive the control variant of TDRC and show that it could be a better alternative to existing algorithms for solving complex control problems with neural networks.

One of the important contributions of this thesis is a comprehensive empirical study of off-policy prediction learning algorithms. The study is one of the most thorough studies of off-policy prediction learning algorithms to date in terms of the number of algorithms it includes and is unique in that the performance of algorithms with respect to each of the algorithms parameters is assessed individually. We present empirical results with eleven prominent off-policy learning algorithms that use linear function approximation: five Gradient-TD methods, two Emphatic-TD methods, Off-policy TD, V-trace, Tree Backup, and ABQ. We assess the performance of the algorithms in terms of their learning speed, asymptotic error, and sensitivity to the step-size and bootstrapping parameters on a simple problem, called the Collision task, that has eight states and two actions. On the Collision task, we found that the two Emphatic-TD algorithms learned the fastest, reached the lowest errors, and were robust to parameter settings. V-trace, Tree Backup, and ABQ were no faster and had higher asymptotic error than other algorithms.

Another one of the main contributions of this thesis is the first empirical study of off-policy prediction learning algorithms with a focus on one of the most important challenges in off-policy learning: slow learning. When learning off-policy, importance sampling ratios are used to correct for the differences between the target and behavior policies. These ratios can be large and of high variance. The step-size parameter should be set small to control this high variance, which in turn slows down learning. In this study, we found that the algorithms' performance is highly affected by the variance induced by the importance sampling ratios. Data shows that algorithms that adapt the bootstrapping parameter during learning, such as V-trace are less affected by the high variance than the other algorithms. We observed that Emphatic TD tends to have lower asymptotic error than other algorithms, but exhibits high variance and does not learn well when the product of importance sampling ratios is large.

The final contribution of this thesis is a step-size adaptation algorithm, called Step-size Ratchet, that when combined with Emphatic TD, significantly improves its learning speed. The proposed algorithm keeps the step-size parameter as large as possible and only ratchets it down when necessary. We establish the effectiveness of the combination of Emphatic TD and Step-size Ratchet by comparing it with the combination of Emphatic TD and other step-size adaptation algorithms on tasks where learning fast is challenging. (Abridged abstract)
Subjects / Keywords
Graduation date

Spring 2022
Type of Item

Thesis
Degree

Doctor of Philosophy
DOI

https://doi.org/10.7939/r3-mwcp-1873
License

This thesis is made available by the University of Alberta Libraries with permission of the copyright owner solely for non-commercial purposes. This thesis, or any portion thereof, may not otherwise be copied or reproduced without the written consent of the copyright owner, except to the extent permitted by Canadian copyright law.

Language

English
Institution

University of Alberta
Degree level

Doctoral
Department
- Department of Computing Science
Supervisor / co-supervisor and their department(s)
- Sutton, Richard (Computing Science)
- White, Adam (Computing Science)