Search
Skip to Search Results- 2Bootstrapping
- 2Function approximation
- 2Planning
- 2Policy gradient methods
- 2Temporal difference learning
- 2Two-timescale stochastic approximation
- 1Bhatnagar, Shalabh
- 1Garg, Shivam
- 1Ghavamzadeh, Mohammad
- 1Jabbari Arfaee, Shahab
- 1Lee, Mark
- 1Sutton, Richard
-
Spring 2022
Policy gradient (PG) estimators are ineffective in dealing with softmax policies that are sub-optimally saturated, which refers to the situation when the policy concentrates its probability mass on sub-optimal actions. Sub-optimal policy saturation may arise from a bad policy initialization or a...
-
Fall 2010
We investigate the use of machine learning to create effective heuristics for single-agent search. Our method aims to generate a sequence of heuristics from a given weak heuristic h{0} and a set of unlabeled training instances using a bootstrapping procedure. The training instances that can be...
-
2009
Bhatnagar, Shalabh, Sutton, Richard, Ghavamzadeh, Mohammad, Lee, Mark
Technical report TR09-10. We present four new reinforcement learning algorithms based on actor-critic, function approximation, and natural gradient ideas, and we provide their convergence proofs. Actor-critic reinforcement learning methods are online approximations to policy iteration in which...