Toward Emphatic Reinforcement Learning

  • Author / Creator
    Ni, Jingjiao
  • Emphatic-Temporal-Difference (Emphatic-TD) learning algorithms were recently proposed based on the most central and widely used reinforcement learning algorithms, Temporal-Difference (TD) methods. Emphatic-TD learning algorithms were originally designed to solve the divergence problem of conventional TD methods when they are applied to off-policy training. However, recent studies on Emphatic-TD learning have shown that Emphatic-TD meth- ods can outperform conventional TD methods even in the on-policy case on prediction problems. Thus we are interested in how Emphatic-TD methods can in general be extended for on-policy control in this thesis. Also, Emphatic-TD methods are sensitive to the step-size parameter, and inappropriate step-size parameters will lead to divergence, which is called “the sensitivity problem” in this thesis. We encountered this problem during the empirical studies on Emphatic-TD methods, thus we provide a solution to this sensitivity problem.
    In this thesis, we will make contributions in two separate but correlated areas. First, we proposed new heuristics for reliably adapting step sizes to emphatic methods for the sensitivity problem of the step-size parameter. Second, we extended the idea of emphatic methods to the on-policy control methods and proposed the new n-step Emphatic-Sarsa method and Emphatic-Sarsa(λ) method. We also conducted some empirical studies for them and our empirical results showed that both the step-size heuristics and the new on-policy emphatic control methods worked, and the new on-policy emphatic control
    methods outperformed the corresponding non-emphatic methods in some particular cases. A limitation of our work is that our empirical studies on the on-policy control methods is only designed in the equal-interest case.

  • Subjects / Keywords
  • Graduation date
    Spring 2021
  • Type of Item
  • Degree
    Master of Science
  • DOI
  • License
    This thesis is made available by the University of Alberta Libraries with permission of the copyright owner solely for non-commercial purposes. This thesis, or any portion thereof, may not otherwise be copied or reproduced without the written consent of the copyright owner, except to the extent permitted by Canadian copyright law.