Targeted Search Control in AlphaZero for Effective Policy Improvement

  • Author / Creator
    Trudeau, Alexandre
  • AlphaZero is a self-play reinforcement learning algorithm that achieves superhuman play in the games of chess, shogi, and Go via policy iteration. To be an effective policy improvement operator, AlphaZero’s search needs to have accurate value estimates for the states that appear in its search tree. The accuracy of AlphaZero's value function depends upon the distribution of states encountered and trained upon. AlphaZero begins its self-play training matches from the initial state of a game and only samples actions over the first few moves, limiting its exploration of states deeper in the game tree. In this thesis, I introduce Go-Exploit, a novel search control strategy for AlphaZero. Go-Exploit samples the start state of its self-play trajectories from an archive of states of interest. Beginning self-play trajectories from states throughout the game tree enables Go-Exploit to more effectively explore the game tree and to learn a value function that generalizes better. Producing shorter self-play trajectories allows Go-Exploit to train upon more independent value targets, further improving value training. Finally, the exploration inherent in Go-Exploit reduces its need for exploratory actions, enabling it to train under more exploitative policies. In the games of Connect Four and 9x9 Go, I show that Go-Exploit learns with a greater sample efficiency than standard AlphaZero, resulting in stronger performance against reference opponents and in head-to-head play. I also compare Go-Exploit to KataGo, a more sample efficient reimplementation of AlphaZero, and show that Go-Exploit’s search control strategy exhibits a greater sample efficiency than KataGo’s. Furthermore, Go-Exploit’s sample efficiency improves when KataGo’s other innovations are incorporated.

  • Subjects / Keywords
  • Graduation date
    Spring 2023
  • Type of Item
  • Degree
    Master of Science
  • DOI
  • License
    This thesis is made available by the University of Alberta Libraries with permission of the copyright owner solely for non-commercial purposes. This thesis, or any portion thereof, may not otherwise be copied or reproduced without the written consent of the copyright owner, except to the extent permitted by Canadian copyright law.