Download the full-sized PDF of Policy Gradient Reinforcement Learning Without RegretDownload the full-sized PDF



Permanent link (DOI):


Export to: EndNote  |  Zotero  |  Mendeley


This file is in the following communities:

Graduate Studies and Research, Faculty of


This file is in the following collections:

Theses and Dissertations

Policy Gradient Reinforcement Learning Without Regret Open Access


Other title
Policy Gradient
Reinforcement Learning
Type of item
Degree grantor
University of Alberta
Author or creator
Dick, Travis B
Supervisor and department
Sutton, Richard (Computing Science)
Gyorgy, Andras (Computing Science)
Examining committee member and department
Schmuland, Byron (Mathematical and Statistical Sciences)
Gyorgy, Andras (Computing Science)
Sutton, Richard (Computing Science)
Department of Computing Science

Date accepted
Graduation date
Master of Science
Degree level
This thesis consists of two independent projects, each contributing to a central goal of artificial intelligence research: to build computer systems that are capable of performing tasks and solving problems without problem-specific direction from us, their designers. I focus on two formal learning problems that have a strong mathematical grounding. Many real-world learning problems can be cast as instances of one of these two problems. Whenever our translation from the real to the formal accurately captures the character of the problem, then the mathematical arguments we make about algorithms in the formal setting will approximately hold in the real-world as well. The first project focuses on an open question in the theory of policy gradient reinforcement learning methods. These methods learn by trial and error and decide whether a trial was good or bad by comparing its outcome to a given baseline. The baseline has no impact on the formal asymptotic guarantees for policy gradient methods, but it does alter their finite-time behaviour. This immediately raises the question: which baseline should we use? I propose that the baseline should be chosen such that a certain estimate used internally by policy gradient methods has the smallest error. I prove that, under slightly idealistic assumptions, this baseline gives a good upper bound on the regret of policy gradient methods. I derive closed-form expressions for this baseline in terms of properties of the formal learning problem and the computer's behaviour. The quantities appearing in the closed form expressions are often unknown, so I also propose two algorithms for estimating this baseline from only known quantities. Finally, I present an empirical comparison of commonly used baselines that demonstrates improved performance when using my proposed baseline. The second project focuses on a recently proposed class of formal learning problems that is in the intersection of two fields of computing science research: reinforcement learning and online learning. The considered problems are sometimes called online Markov decision processes, or Markov decision processes with changing rewards. The unique property of this class is that it assumes the computer's environment is adversarial, as though it were playing a game against the computer. This is in contrast to the more common assumption that the environment's behaviour is determined entirely by stochastic models. I propose three new algorithms for learning in Markov decision processes with changing rewards under various conditions. I prove theoretical performance guarantees for each algorithm that either complement or improve the best existing results and that often hold even under weaker assumptions. This comes at the cost of increased (but still polynomial) computational complexity. Finally, in the development and analysis of these algorithms, it was necessary to analyze an approximate version of a well-known optimization algorithm called online mirror ascent. To the best of my knowledge, this is the first rigorous analysis of this algorithm and it is of independent interest.
Permission is hereby granted to the University of Alberta Libraries to reproduce single copies of this thesis and to lend or sell such copies for private, scholarly or scientific research purposes only. Where the thesis is converted to, or otherwise made available in digital form, the University of Alberta will advise potential users of the thesis of these terms. The author reserves all other publication and other rights in association with the copyright in the thesis and, except as herein before provided, neither the thesis nor any substantial portion thereof may be printed or otherwise reproduced in any material form whatsoever without the author's prior written permission.
Citation for previous publication
Dick, T., Gyorgy, A., & Szepesvari, Cs., "Online learning in Markov Decision Processes with Changing Cost Sequences" in Proceedings of The 31st International Conference on Machine Learning (2014): 512-520

File Details

Date Uploaded
Date Modified
Audit Status
Audits have not yet been run on this file.
File format: pdf (Portable Document Format)
Mime type: application/pdf
File size: 1145029
Last modified: 2015:10:21 00:54:49-06:00
Filename: Dick_Travis_B_201501_MSc.pdf
Original checksum: 56ca948394a2539fa1a7bc2e2d3863cf
Well formed: true
Valid: true
File title: Abstract
Page count: 108
Activity of users you follow
User Activity Date