Policy Gradient Reinforcement Learning Without Regret

  • Author / Creator
    Dick, Travis B
  • This thesis consists of two independent projects, each contributing to a central goal of artificial intelligence research: to build computer systems that are capable of performing tasks and solving problems without problem-specific direction from us, their designers. I focus on two formal learning problems that have a strong mathematical grounding. Many real-world learning problems can be cast as instances of one of these two problems. Whenever our translation from the real to the formal accurately captures the character of the problem, then the mathematical arguments we make about algorithms in the formal setting will approximately hold in the real-world as well. The first project focuses on an open question in the theory of policy gradient reinforcement learning methods. These methods learn by trial and error and decide whether a trial was good or bad by comparing its outcome to a given baseline. The baseline has no impact on the formal asymptotic guarantees for policy gradient methods, but it does alter their finite-time behaviour. This immediately raises the question: which baseline should we use? I propose that the baseline should be chosen such that a certain estimate used internally by policy gradient methods has the smallest error. I prove that, under slightly idealistic assumptions, this baseline gives a good upper bound on the regret of policy gradient methods. I derive closed-form expressions for this baseline in terms of properties of the formal learning problem and the computer's behaviour. The quantities appearing in the closed form expressions are often unknown, so I also propose two algorithms for estimating this baseline from only known quantities. Finally, I present an empirical comparison of commonly used baselines that demonstrates improved performance when using my proposed baseline. The second project focuses on a recently proposed class of formal learning problems that is in the intersection of two fields of computing science research: reinforcement learning and online learning. The considered problems are sometimes called online Markov decision processes, or Markov decision processes with changing rewards. The unique property of this class is that it assumes the computer's environment is adversarial, as though it were playing a game against the computer. This is in contrast to the more common assumption that the environment's behaviour is determined entirely by stochastic models. I propose three new algorithms for learning in Markov decision processes with changing rewards under various conditions. I prove theoretical performance guarantees for each algorithm that either complement or improve the best existing results and that often hold even under weaker assumptions. This comes at the cost of increased (but still polynomial) computational complexity. Finally, in the development and analysis of these algorithms, it was necessary to analyze an approximate version of a well-known optimization algorithm called online mirror ascent. To the best of my knowledge, this is the first rigorous analysis of this algorithm and it is of independent interest.

  • Subjects / Keywords
  • Graduation date
    Spring 2015
  • Type of Item
  • Degree
    Master of Science
  • DOI
  • License
    This thesis is made available by the University of Alberta Libraries with permission of the copyright owner solely for non-commercial purposes. This thesis, or any portion thereof, may not otherwise be copied or reproduced without the written consent of the copyright owner, except to the extent permitted by Canadian copyright law.