W6 Two Agent RL Flashcards

1
Q

What are the differences between AlphaGo, AlphaGo Zero, and AlphaZero?

A

AlphaGo: it is a program that beat human Go champion, which consists of a combination of supervised learning from grandmaster games and from self-play games

AlphaGoZero: it performs tabula rasa learning of Go, based solely on self-play. It plays stronger than AlphaGo, faster than AlphaGo because it uses curriculum learning
* the biggest problem that was overcome by AlphaGo Zero: instability
* 2 architectural elements of AlphaGo Zero?
A neural network with 2 heads and MCTS player

AlphaZero is a generalization of this program that
also plays chess and shogi.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

More details about AlphaGo Zero

A

AlphaGo Zero uses a model-based actor critic approach with a planner that improves a single value/policy network. For policy improvement it uses MCTS, for learning a single deep residual network with a policy head and a value head (Sect. B.2.6), see Fig. 6.9.

MCTS improves the quality of the training examples in each iteration (left panel), and the net is trained with these better examples, improving its quality (right panel).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

how was stability achieved in AlphaGo Zero?

A

coverage of the state space: playing a large number of games and MCTS look-ahead

correlation is reduced through experience replay buffer

convergence of training is improved by using on-policy MCTS and taking small training steps (small learning rate)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is MCTS? Why is MCTS more useful than minimax?

A

Monte Carlo Tree Search
MCTS is a variable depth adaptive search, planning-based algorithm, that did not need a heuristic function, but instead used random playouts to estimate board strength.

MCTS has 2 advantages:
1. First, MCTS is based on averaging single lines of play, instead of recursively traversing subtrees. The computational complexity of a path from the root to a leaf is polynomial in the search depth.
2. Second, MCTS does not need a heuristic evaluation function. It plays out a line of play in the game from the root to an end position

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Is MCTS on-policy or off-policy?

A

MCTS is on-policy: the values that are backed up are those of the nodes that were selected.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are the four steps of MCTS?

A

select: traverse the tree from root to a leaf using UCT
expand: add a child to the tree at each step
playout/rollout/sampling: play random moves until the end of the game (self-play)
backpropagation: propagate the reward back upwards in the tree

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What does UCT do?

A

it calculates the value of a child a, to know if we should expand it
UCT(a) = wins_a / visits_a + C_p * sqrt(ln visits_parent / visits_a)

second term is for exploration

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Give the UCT formula. How is P-UCT different?

A

DRL book Page 168

it calculates the value of a child a, to know if we should expand it
UCT(a) = wins_a / visits_a + C_p * sqrt(ln visits_parent / visits_a)

balance between winrate and newness: second term is for exploration

predictor-UCT: it uses input from the policy head of the deep network, in addition to winrate and newness, the policy is added to the exploration part
P-UCT(a) = wins_a / visits_a + C_p * pi(a|s) * sqrt(visits_parent) / (1+ visits_a)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is zero-sum? How to implement it?

A

In two-agent RL, my win is your loss and vice versa. The two agents works against each other.

A popular way to do so is to implement the environment’s actions with self-play:
we replace the environment by a copy of ourselves. In this way we let ourselves play against an opponent that has all the knowledge that we currently have (both agents know the transition function/the rules), and agents learn from each other.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

3 levels of self-play?

A
  1. Move-level: in the MCTS playouts, our opponent actually is a copy of ourselves, we use the agent to play against itself, as its own opponent—hence, self-play at the level of game moves
  2. Example-level: the input for self-training the approximator for the policy and the reward functions is generated by our own games —hence, self-play at the level of the value/policy network.
  3. Tournament-level: the self-play loop creates a training
    curriculum that starts tabula rasa and ends at world champion level. The system trains at the level of the player against itself—hence, self-play, of the third kind.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Describe the function of each of the four operations of MCTS.

A
  1. Select In the selection step the tree is traversed from the root node down until a leaf of the MCTS search tree is reached where a new child is selected that is not part of the tree yet. At each internal state the selection rule is followed to determine which action to take and thus which state to go to next. The UCT rule
    works well in many applications. The selections at these states are part of the policy 𝜋(𝑠) of actions of the state.
  2. Expand Then, in the expansion step, a child is added to the tree. In most cases only one child is added. In some MCTS versions all successors of a leaf are added
    to the tree.
  3. Playout Subsequently, during the playout step random moves are played in a form of self-play until the end of the game is reached. (These nodes are not added to the MCTS tree, but their search result is, in the backpropagation step.) The reward 𝑟 of this simulated game is +1 in case of a win for the first player, 0 in
    case of a draw, and −1 in case of a win for the opponent
  4. Backpropagation In the backpropagation step, reward 𝑟 is propagated back upwards in the tree, through the nodes that were traversed down previously. Two counts are updated: the visit count, for all nodes, and the win count, depending on the reward value. Note that in a two-agent game, nodes in the MCTS tree
    alternate color. If white has won, then only white-to-play nodes are incremented; if black has won, then only the black-to-play nodes.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How does UCT achieve trading off exploration and exploitation? Which inputs does it use?

A

Tune Cp, the formula uses the number of wins in child a, number of times child a has been visited and number of times the parent node has been visited.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

When 𝐶𝑝 is small, does MCTS explore more or exploit more?

A

Larger Cp means more exploration
Smaller Cp mean more exploitation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Whats the state space complexity for Chess, GO?

A

10^47 for chess, 10^170 for Go

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

For small numbers of node expansions, would you prefer more exploration or more exploitation?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is a double-headed network? How is it different from regular actor critic?

A

one network with two heads, for example the residual network used for AlphaGoZero has a value head and a policy head.

Regular AC can build value network and policy network separately

17
Q

Which three elements make up the self-play loop? (You may draw a picture.)

A

policy/value, transition rules, opponent
(DRL Book Page 158)

In comparison with the model-based world of Chap. 5 (Fig. 6.8) our learned model has been replaced by perfect knowledge of the transition rules, and the environment is now called opponent: the negative version of the same agent playing the role of agent.

18
Q

What is tabula rasa learning?

A

it is when you start to learn with zero knowledge: only self-play and a single neural network

19
Q

How can tabula rasa learning be faster than reinforcement learning on top of supervised learning of grandmaster games?

A

self-play is faster because it creates a sequence of learning tasks that are ordered from easy to hard. Training such an ordered sequence of small tasks is quicker than one large unordered task.

20
Q

What is curriculum learning?

A

Curriculum learning starts the training process with easy concepts before the hard concepts are learned. In curriculum learning the examples are ordered in batches from easy to hard.

21
Q

Chap 6 Core Concepts & algorithms

A

self-play
curriculum learning
minimax, MCTS (Monte Carlo Tree Search), AlphaZero tabula rasa learning