Targeted Search Control in AlphaZero for Effective Policy Improvement

Trudeau, Alexandre

doi:doi:10.7939/r3-9emf-7z16

This decommissioned ERA site remains active temporarily to support our final migration steps to https://ualberta.scholaris.ca, ERA's new home. All new collections and items, including Spring 2025 theses, are at that site. For assistance, please contact erahelp@ualberta.ca.

View

Download

Communities and Collections

Graduate and Postdoctoral Studies (GPS), Faculty of / Theses and Dissertations

Usage

116 views
373 downloads

Targeted Search Control in AlphaZero for Effective Policy Improvement

Author / Creator

Trudeau, Alexandre
AlphaZero is a self-play reinforcement learning algorithm that achieves superhuman play in the games of chess, shogi, and Go via policy iteration. To be an effective policy improvement operator, AlphaZero’s search needs to have accurate value estimates for the states that appear in its search tree. The accuracy of AlphaZero's value function depends upon the distribution of states encountered and trained upon. AlphaZero begins its self-play training matches from the initial state of a game and only samples actions over the first few moves, limiting its exploration of states deeper in the game tree. In this thesis, I introduce Go-Exploit, a novel search control strategy for AlphaZero. Go-Exploit samples the start state of its self-play trajectories from an archive of states of interest. Beginning self-play trajectories from states throughout the game tree enables Go-Exploit to more effectively explore the game tree and to learn a value function that generalizes better. Producing shorter self-play trajectories allows Go-Exploit to train upon more independent value targets, further improving value training. Finally, the exploration inherent in Go-Exploit reduces its need for exploratory actions, enabling it to train under more exploitative policies. In the games of Connect Four and 9x9 Go, I show that Go-Exploit learns with a greater sample efficiency than standard AlphaZero, resulting in stronger performance against reference opponents and in head-to-head play. I also compare Go-Exploit to KataGo, a more sample efficient reimplementation of AlphaZero, and show that Go-Exploit’s search control strategy exhibits a greater sample efficiency than KataGo’s. Furthermore, Go-Exploit’s sample efficiency improves when KataGo’s other innovations are incorporated.
Subjects / Keywords
Graduation date

Spring 2023
Type of Item

Thesis
Degree

Master of Science
DOI

https://doi.org/10.7939/r3-9emf-7z16
License

This thesis is made available by the University of Alberta Libraries with permission of the copyright owner solely for non-commercial purposes. This thesis, or any portion thereof, may not otherwise be copied or reproduced without the written consent of the copyright owner, except to the extent permitted by Canadian copyright law.

Language

English
Institution

University of Alberta
Degree level

Master's
Department
- Department of Computing Science
Supervisor / co-supervisor and their department(s)
- Bowling, Michael (Computing Science)