Improving Sample Efficiency of Online Temporal Difference Learning

Pan, Yangchen

doi:doi:10.7939/r3-f7jr-6k05

This decommissioned ERA site remains active temporarily to support our final migration steps to https://ualberta.scholaris.ca, ERA's new home. All new collections and items, including Spring 2025 theses, are at that site. For assistance, please contact erahelp@ualberta.ca.

View

Download

Communities and Collections

Graduate and Postdoctoral Studies (GPS), Faculty of / Theses and Dissertations

Usage

1088 views
756 downloads

Improving Sample Efficiency of Online Temporal Difference Learning

Author / Creator

Pan, Yangchen
A common scientific challenge for putting a reinforcement learning agent into practice is how to improve sample efficiency as much as possible with limited computational or memory resources. Such available physical resources may vary in different applications. My thesis introduces some approaches to flexibly balance sample efficiency and physical resource for prediction and control problems in an online reinforcement learning setting. Our methods can significantly improve sample efficiency with reasonable computational power and storage demand.

We draw on two key optimization strategies that are known to improve convergence rates: second-order optimizations and prioritized sampling of what data to update with. In this thesis, we mainly focus on the policy evaluation problem, though we also introduce effective sampling distribution for control tasks. Particularly, in policy evaluation problems, we develop an approximate second-order method to minimize Mean Squared Projected Bellman Error (MSPBE). Our method scales sub-quadratically with feature dimension in terms of computational and memory cost. We propose two techniques to efficiently and incrementally approximate the preconditioning matrix in the second-order updating rule: truncated singular value decomposition and sketching via random projection. We further introduce a simple regularization method to theoretically guarantee the unbiased convergence of our algorithm, under certain assumptions.

In control problems, we focus on studying effective sampling distributions to sample imagined experiences in model-based reinforcement learning (MBRL). Specifically, in a classic MBRL architecture called Dyna, we design novel search-control strategies, which refer to the mechanisms of generating states from which we query an environment model to acquire imagined experiences to improve the policy during the planning phase. We provide both theoretical and empirical evidence to verify that our methods improve sample efficiency.
Subjects / Keywords
Graduation date

Fall 2021
Type of Item

Thesis
Degree

Doctor of Philosophy
DOI

https://doi.org/10.7939/r3-f7jr-6k05
License

This thesis is made available by the University of Alberta Libraries with permission of the copyright owner solely for non-commercial purposes. This thesis, or any portion thereof, may not otherwise be copied or reproduced without the written consent of the copyright owner, except to the extent permitted by Canadian copyright law.

Language

English
Institution

University of Alberta
Degree level

Doctoral
Department
- Department of Computing Science
Supervisor / co-supervisor and their department(s)
- White, Martha (Computing Science)
- Farahmand, Amir-massoud (Computer Science, University of Toronto)