Silver et al - Mastering the Game of Go Flashcards
(16 cards)
What was the major innovation of AlphaGo Zero when compared to previous AI systems?
AlphaGo Zero learned completely from self-play without using any human gameplay data or hand-crafted features, surpassing all previous Go programs including earlier versions of AlphaGo that relied on human knowledge.
What key algorithims or methods were combined in AlphaGo Zero’s architecture?
AlphaGo Zero combined deep neural networks with Monte Carlo Tree Search (MCTS) in a reinforcement learning framework.
What did the neural network in AlphaGo Zero output?
The neural network had two outputs: (1) a vector of move probabilities and (2) a scalar value estimating the probability of the current player winning from the current position.
What was the basic reinforcement learning approach used?
AlphaGo Zero used self-play reinforcement learning, where it continually played against itself and updated its neural network parameters to predict game outcomes and optimal moves more accurately.
How was Monte Carlo Tree Search (MCTS) implemented in AlphaGo Zero?
MCTS was used to explore the game tree, guided by the neural network’s policy and value predictions. Each MCTS simulation traversed the tree by selecting actions with high probability and expected outcome, adding a new node, and then using the value network to evaluate the position.
How did AlphaGo Zero’s training process work?
The system followed an iterative process:
(1) Self-play using the current neural network to generate training data,
(2) Retraining the neural network on a random sample of data from recent games,
(3) Evaluating the new network against the previous best network.
How long did it take AlphaGo Zero to surpass human-level play?
AlphaGo Zero surpassed human-level play after just 3 days of training.
How did AlphaGo Zero compare to AlphaGo Lee (the version that defeated Lee Sedol)?
After 40 days of training, AlphaGo Zero defeated AlphaGo Lee by 100 games to 0.
How did the AlphaGo Zero paper improve neural network efficiency compared to previous versions?
AlphaGo Zero used a single neural network for both policy and value outputs, whereas previous versions used separate specialized networks.
What did the learning curve of AlphaGo Zero look like?
The learning curve showed rapid improvement in the first few days followed by more gradual improvement, with performance surpassing AlphaGo Lee after around 36 hours and reaching peak performance after 40 days.
What Go concepts did AlphaGo Zero discover through self-play?
AlphaGo Zero independently discovered fundamental Go concepts like fuseki (opening patterns), joseki (corner sequences), life and death, ko fights, and endgame play without human guidance.
How did AlphaGo Zero’s style of play differ from traditional human approaches?
AlphaGo Zero played many moves that would be considered unusual by human standards, including early 3-3 point invasions and shoulder hits. It also showed greater flexibility in sacrificing stones for strategic advantage.
What limitations of AlphaGo Zero were noted in the paper?
The paper noted that AlphaGo Zero was still limited to the specific domain of Go with perfect information, and further advances would be needed for imperfect-information games or real-world applications.
How did AlphaGo Zero’s approach differ from traditional reinforcement learning methods?
Unlike many reinforcement learning methods that use model-free approaches or separate planning processes, AlphaGo Zero integrated learning and planning through its neural network-guided MCTS, creating what the authors called a “learning and planning system.”
What is the “learning and planning” system?
1) The neural network “learns” from self-play data
2) MCTS uses the neural network to “plan” by exploring the game tree
3) The improved plans from MCTS are used to generate better training data
3) This better data improves the neural network’s learning
Effectively, the neural network guides the search, and the search results improve the neural network.