Actor-Expert: A Framework for using Q-learning in Continuous Action Spaces

  • Author / Creator
    Lim, Sungsu
  • Q-learning can be difficult to use in continuous action spaces, because a difficult optimization has to be solved to find the maximal action. Some common strategies have been to discretize the action space, solve the maximization with a powerful optimizer at each step, restrict the functional form of the action-values, or optimize a different entropy-regularized objective to learn a policy proportional to action-values. Such methods however, can prevent learning accurate action-values, be expensive to execute at each step, or find a potentially suboptimal policy. In this thesis, we propose a new policy search objective that facilitates using Q-learning and a new framework called Actor-Expert, that optimizes this objective. The Expert uses approximate Q-learning to update the action-values towards optimal action-values. The Actor iteratively learns the maximal actions over time for these changing action-values. We develop a Conditional Cross Entropy Method (CCEM) for the Actor, where such a global optimization approach facilitates use of generically parameterized action-values (Expert) with a separate policy (Actor). This method iteratively concentrates density around maximal actions, conditioned on state. We demonstrate in a toy environment that Actor-Expert with unrestricted action-value parameterization and efficient exploration mechanism succeeds while previous Q-learning methods fail. We also demonstrate that Actor-Expert performs as well as or better than previous Q-learning methods on benchmark continuous-action environments. We also show that it is comparable against Actor-Critic baselines, suggesting a new distinction among methods that learn both value function and policy: learning action-values of the current policy or (optimal) action-values decoupled from the policy.

  • Subjects / Keywords
  • Graduation date
    Fall 2019
  • Type of Item
  • Degree
    Master of Science
  • DOI
  • License
    Permission is hereby granted to the University of Alberta Libraries to reproduce single copies of this thesis and to lend or sell such copies for private, scholarly or scientific research purposes only. Where the thesis is converted to, or otherwise made available in digital form, the University of Alberta will advise potential users of the thesis of these terms. The author reserves all other publication and other rights in association with the copyright in the thesis and, except as herein before provided, neither the thesis nor any substantial portion thereof may be printed or otherwise reproduced in any material form whatsoever without the author's prior written permission.