Improving Sample Efficiency of Online Temporal Difference Learning

  • Author / Creator
    Pan, Yangchen
  • A common scientific challenge for putting a reinforcement learning agent into practice is how to improve sample efficiency as much as possible with limited computational or memory resources. Such available physical resources may vary in different applications. My thesis introduces some approaches to flexibly balance sample efficiency and physical resource for prediction and control problems in an online reinforcement learning setting. Our methods can significantly improve sample efficiency with reasonable computational power and storage demand.

    We draw on two key optimization strategies that are known to improve convergence rates: second-order optimizations and prioritized sampling of what data to update with. In this thesis, we mainly focus on the policy evaluation problem, though we also introduce effective sampling distribution for control tasks. Particularly, in policy evaluation problems, we develop an approximate second-order method to minimize Mean Squared Projected Bellman Error (MSPBE). Our method scales sub-quadratically with feature dimension in terms of computational and memory cost. We propose two techniques to efficiently and incrementally approximate the preconditioning matrix in the second-order updating rule: truncated singular value decomposition and sketching via random projection. We further introduce a simple regularization method to theoretically guarantee the unbiased convergence of our algorithm, under certain assumptions.

    In control problems, we focus on studying effective sampling distributions to sample imagined experiences in model-based reinforcement learning (MBRL). Specifically, in a classic MBRL architecture called Dyna, we design novel search-control strategies, which refer to the mechanisms of generating states from which we query an environment model to acquire imagined experiences to improve the policy during the planning phase. We provide both theoretical and empirical evidence to verify that our methods improve sample efficiency.

  • Subjects / Keywords
  • Graduation date
    Fall 2021
  • Type of Item
  • Degree
    Doctor of Philosophy
  • DOI
  • License
    This thesis is made available by the University of Alberta Libraries with permission of the copyright owner solely for non-commercial purposes. This thesis, or any portion thereof, may not otherwise be copied or reproduced without the written consent of the copyright owner, except to the extent permitted by Canadian copyright law.