Greedification Operators for Policy Optimization: Investigating Forward and Reverse KL Divergences

Chan, Alan

doi:doi:10.7939/r3-m4yx-n678

This decommissioned ERA site remains active temporarily to support our final migration steps to https://ualberta.scholaris.ca, ERA's new home. All new collections and items, including Spring 2025 theses, are at that site. For assistance, please contact erahelp@ualberta.ca.

View

Download

Communities and Collections

Graduate and Postdoctoral Studies (GPS), Faculty of / Theses and Dissertations

Usage

315 views
495 downloads

Greedification Operators for Policy Optimization: Investigating Forward and Reverse KL Divergences

Author / Creator

Chan, Alan
Policy gradient methods typically estimate both explicit policy and value functions. The long-extant view of policy gradient methods as approximate policy iteration---alternating between policy evaluation and policy improvement by greedification---is a helpful framework to elucidate algorithmic choices. Effective policy evaluation under function approximation is being actively investigated; approximate greedification, however, has yet to be systematically explored. In this work, we highlight and investigate the difference between the forward and reverse KL divergences when used for policy improvement. We show that the reverse KL has stronger theoretical guarantees for policy improvement, but that the forward KL can also induce improvement under additional assumptions. Finally, on both small-scale and large-scale experiments, we empirically analyze the behaviour and practical performance of these variants. We observe few consistent differences between the reverse and forward KLs on discrete-action spaces, but relatively more substantial stability and convergence differences emerge on continuous-action spaces.
Subjects / Keywords
Graduation date

Fall 2020
Type of Item

Thesis
Degree

Master of Science
DOI

https://doi.org/10.7939/r3-m4yx-n678
License

Permission is hereby granted to the University of Alberta Libraries to reproduce single copies of this thesis and to lend or sell such copies for private, scholarly or scientific research purposes only. Where the thesis is converted to, or otherwise made available in digital form, the University of Alberta will advise potential users of the thesis of these terms. The author reserves all other publication and other rights in association with the copyright in the thesis and, except as herein before provided, neither the thesis nor any substantial portion thereof may be printed or otherwise reproduced in any material form whatsoever without the author's prior written permission.

Language

English
Institution

University of Alberta
Degree level

Master's
Department
- Department of Computing Science
Specialization
- Statistical Machine Learning
Supervisor / co-supervisor and their department(s)
- White, Martha (Computing Science)