Download the full-sized PDF of Reinforcement Learning and Simulation-Based Search in Computer GoDownload the full-sized PDF



Permanent link (DOI):


Export to: EndNote  |  Zotero  |  Mendeley


This file is in the following communities:

Graduate Studies and Research, Faculty of


This file is in the following collections:

Theses and Dissertations

Reinforcement Learning and Simulation-Based Search in Computer Go Open Access


Other title
Reinforcement learning, simulation-based search, computer Go, temporal-difference learning
Type of item
Degree grantor
University of Alberta
Author or creator
Silver, David
Supervisor and department
Sutton, Richard (Computing Science)
Mueller, Martin (Computing Science)
Examining committee member and department
Ng, Andrew (Computer Science, Stanford University)
Sutton, Richard (Computing Science)
Schaeffer, Jonathan (Computing Science)
Musilek, Petr (Electrical and Computer Engineering)
Szepesvari, Csaba (Computing Science)
Mueller, Martin (Computing Science)
Department of Computing Science

Date accepted
Graduation date
Doctor of Philosophy
Degree level
Learning and planning are two fundamental problems in artificial intelligence. The learning problem can be tackled by reinforcement learning methods, such as temporal-difference learning, which update a value function from real experience, and use function approximation to generalise across states. The planning problem can be tackled by simulation-based search methods, such as Monte-Carlo tree search, which update a value function from simulated experience, but treat each state individually. We introduce a new method, temporal-difference search, that combines elements of both reinforcement learning and simulation-based search methods. In this new method the value function is updated from simulated experience, but it uses function approximation to efficiently generalise across states. We also introduce the Dyna-2 architecture, which combines temporal-difference learning with temporal-difference search. Whereas temporal-difference learning acquires general domain knowledge from its past experience, temporal-difference search acquires local knowledge that is specialised to the agent's current state, by simulating future experience. Dyna-2 combines both forms of knowledge together. We apply our algorithms to the game of 9x9 Go. Using temporal-difference learning, with a million binary features matching simple patterns of stones, and using no prior knowledge except the grid structure of the board, we learnt a fast and effective evaluation function. Using temporal-difference search with the same representation produced a dramatic improvement: without any explicit search tree, and with equivalent domain knowledge, it achieved better performance than a vanilla Monte-Carlo tree search. When combined together using the Dyna-2 architecture, our program outperformed all handcrafted, traditional search, and traditional machine learning programs on the 9x9 Computer Go Server. We also use our framework to extend the Monte-Carlo tree search algorithm. By forming a rapid generalisation over subtrees of the search space, and incorporating heuristic pattern knowledge that was learnt or handcrafted offline, we were able to significantly improve the performance of the Go program MoGo. Using these enhancements, MoGo became the first 9x9 Go program to achieve human master level.
Permission is hereby granted to the University of Alberta Libraries to reproduce single copies of this thesis and to lend or sell such copies for private, scholarly or scientific research purposes only. Where the thesis is converted to, or otherwise made available in digital form, the University of Alberta will advise potential users of the thesis of these terms. The author reserves all other publication and other rights in association with the copyright in the thesis and, except as herein before provided, neither the thesis nor any substantial portion thereof may be printed or otherwise reproduced in any material form whatsoever without the author's prior written permission.
Citation for previous publication

File Details

Date Uploaded
Date Modified
Audit Status
Audits have not yet been run on this file.
File format: pdf (Portable Document Format)
Mime type: application/pdf
File size: 7633306
Last modified: 2015:10:12 14:57:01-06:00
Filename: thesis.pdf
Original checksum: f74a68d4932d5d0a08ebf4e63a09c748
Well formed: false
Valid: false
Status message: Lexical error offset=7264641
Page count: 170
Activity of users you follow
User Activity Date