Usage
  • 175 views
  • 207 downloads

Investigating Two Policy Gradient Methods Under Different Time Discretizations

  • Author / Creator
    Farrahi, Homayoon
  • Continuous-time reinforcement learning tasks commonly use discrete time steps of fixed cycle times for actions. Choosing a small action-cycle time in such tasks allows reinforcement learning agents fast reaction and a more temporally detailed perception of the environment. The learning performance of both policy gradient and action-value methods, however, may deteriorate as the cycle time duration is reduced, which necessitates the tuning of the cycle time as a hyper-parameter. Since tuning an additional hyper-parameter is time-consuming, specifically for real-world robots, existing algorithms can benefit from having hyper-parameters that are robust to the choice of cycle time. In this thesis, we aim to study how changing the action-cycle time affects the performance of two prominent policy gradient algorithms PPO and SAC and investigate the efficacy of their widely-used hyper-parameter values across different cycle times. We explore how changing some of these hyper-parameters based on the cycle time can help or hinder the performance of these algorithms and inquire into and understand the relationship between them. These relationships are put forward as new hyper-parameters that can be adjusted based on the cycle time, and their effectiveness is examined and validated on simulated and real-world robotic tasks. We show that the new hyper-parameters, unlike the existing ones, can be more robust to different environments and cycle times and can enable hyper-parameter values tuned to a cycle time on a specific problem to be transferred to a different cycle time.

  • Subjects / Keywords
  • Graduation date
    Fall 2021
  • Type of Item
    Thesis
  • Degree
    Master of Science
  • DOI
    https://doi.org/10.7939/r3-sttb-hb65
  • License
    This thesis is made available by the University of Alberta Libraries with permission of the copyright owner solely for non-commercial purposes. This thesis, or any portion thereof, may not otherwise be copied or reproduced without the written consent of the copyright owner, except to the extent permitted by Canadian copyright law.