Unifying n-Step Temporal-Difference Action-Value Methods

Juan Fernando Hernandez Garcia

doi:doi:10.7939/r3-6k85-6p33

This decommissioned ERA site remains active temporarily to support our final migration steps to https://ualberta.scholaris.ca, ERA's new home. All new collections and items, including Spring 2025 theses, are at that site. For assistance, please contact erahelp@ualberta.ca.

View

Download

Communities and Collections

Graduate and Postdoctoral Studies (GPS), Faculty of / Theses and Dissertations

Usage

455 views
299 downloads

Unifying n-Step Temporal-Difference Action-Value Methods

Author / Creator

Juan Fernando Hernandez Garcia
Unifying seemingly disparate algorithmic ideas to produce better performing algorithms has been a longstanding goal in reinforcement learning. As a primary example, the TD(λ) algorithm elegantly unifies temporal difference (TD) methods with Monte Carlo methods through the use of eligibility traces and the trace-decay parameter λ. The same type of unification is achievable with n-step algorithms, a simpler version of multi-step TD methods where updates consist of a single backup of length n instead of a geometric average of several backups of different lengths.In this work, we present a new n-step algorithm named Q(σ) that unifies two of the existing n-step algorithms for estimating action-value functions — Sarsa and Tree Backup. The fundamental difference between Sarsa and Tree Backup is that the former samples a single action at every step of the backup, whereas the latter takes an expectation over all the possible actions. We introduce a new parameter, σ ∈ [0, 1], that allows the degree of sampling performed by the algorithm at each step to be continuously varied. This creates a new family of algorithms that span a continuum between Sarsa (full sampling, σ = 1) and Tree Backup (pure expectation, σ = 0). Our results show that our algorithm can perform better when using intermediate values of σ instead of any of the extremes. Moreover, if we decay σ over time from one to zero, we obtain an algorithm that outperforms other variants of Q(σ) with a fixed σ over a variety of tasks.This work has three main contributions. First, we introduce our new algorithm, n-step Q(σ) and provide empirical evaluations of the algorithm in the tabular case. Second, we extend n-step Q(σ) to the linear function approximation case and demonstrate its performance in the environment mountain cliff. Third, we combined n-step Q(σ) with the DQN architecture and tested the performance of our new architecture — named the Q(σ) network — in the mountain car environment.Throughout our empirical evaluations, we found that the parameter σ often serves as a trade-off between initial and final performance. Moreover, we found that the decaying σ algorithm performed better than algorithms with fixed values of σ in terms of initial and final performance. We also found that in some domains n-step Q(σ) with an intermediate value of σ performed better than either of the extreme values corresponding to n-step Tree Backup and Sarsa. Our results represent a compelling argument for using n-step Q(σ) over n-step Sarsa or Tree Backup. n-step Q(σ) offers a flexible framework that can be adapted to the specifics of the learning task in order to improve performance.
Subjects / Keywords
Graduation date

Spring 2019
Type of Item

Thesis
Degree

Master of Science
DOI

https://doi.org/10.7939/r3-6k85-6p33
License

Permission is hereby granted to the University of Alberta Libraries to reproduce single copies of this thesis and to lend or sell such copies for private, scholarly or scientific research purposes only. Where the thesis is converted to, or otherwise made available in digital form, the University of Alberta will advise potential users of the thesis of these terms. The author reserves all other publication and other rights in association with the copyright in the thesis and, except as herein before provided, neither the thesis nor any substantial portion thereof may be printed or otherwise reproduced in any material form whatsoever without the author's prior written permission.

Language

English
Institution

University of Alberta
Degree level

Master's
Department
- Department of Computing Science
Specialization
- Statistical Machine Learning
Supervisor / co-supervisor and their department(s)
- Richard Sutton (Computing Science)