Week 6 Flashcards

1
Q

What class of algorithm is Q learning part of?

A

Temporal Difference (TD) Learning or TD(0) Learning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the goal of temporal difference learning?

A

Predict expected (possible discounted) future reward:
- Given a particular policy for choosing subsequent actions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is a policy in temporal difference learning?

A

The choice of action over a given state.

at = pi(st)

Policy is conventionally denoted pi in reinforcement learning. Called strategy in game theory

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Does Q-Learning have to get to the terminal node to get information?

A

Yes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is Deep-Q Learning?

A

Deep-Q Learning uses a neural network to approximate the Q-function.

Often a convolutional neural network

Convolutional neural networks are very effective at large image processing
- Treats the input as an image rather than a collection of features
- e.g. the board

Used by DeepMind to create agents that can play video games. Later to learn to play Go.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Convolutional Neural Networks

A
  • Learns local feature detectors
  • Weights constrained to learn the same feature at different parts of the image
  • Higher level layers are building features of features
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What assumption in statistics, machine learning, data science, etc.. makes learning in games less effective?

A

The data is independent and stationary

But in games, subsequent data are not independent.

And in video games, subsequent data are frames - highly correlated.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the difference between TD-Learning and Monte Carlo learning styles

A

In TD-Learning you learn, then you play

In Monte-Carlo learning you learn while you play. You learn between moves. (Like minimax search).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Monte Carlo Board Evaluation

A

Simple way to generate a heuristic function. Often called “playouts”.

For position v, do the following N times:
- Play the game to the end, all players choose random moves.
- Return the average payoff

Estimates the strength of the board position by the average payoff to the nodes accessible to it.

Downside is the time cost to perform the N playouts per board position. Although you are not playing against a random play, it can be very useful

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is MCTS

A

Monte Carlo Tree Search

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Elements of MCTS

A

Game tree is built, incrementally but asymmetrically.

For each iteration, a tree policy is used to find the most urgent node to expand, using a strategy which balances exploration and exploitation.

A simulation is run from the selected node using a default policy (often random choice of action)

Update the search tree according to the result. This updates the value of the selected node and all its ancestors.

Return the best child of the current root node

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are the four phases of MCTS?

A

Selection
- Starting at the root, child selection policy is recursively applied to traverse the game tree, looking for the most important expandable node. A node is expandable if it is non-terminal and has unvisited children

Expansion
- Increase the size of the current partial tree (typically by one node) by expanding nodes (e.g. visit and unvisited child)

Play-out (simulation)
Run a simulation from the added node by random self-play.

Back-up
The statistics of the added node(s) and its ancestors are updated

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is UCT?

A

Upper confidence trees is a preferred selection method
- An exploration strategy, like epsilon greedy
- Always chooses unselected children
- Between two children of the same parent with the same average value, chooses child selected less often.
- In the long run, children with lower true values will be chosen less often. The best child will be chosen exponentially more often.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

In UCT what does each node store?

A

Two quantities:
Q(v) - the sum of all payoffs received (all playouts which passed through this node)

N(v) – a count of the number of times node visited

So Q(v) / N(v) is an estimate of the value of the node v

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How does the explore term change in UCB?

A

Gets bigger when child is not explored but parent is;

Explore term gets smaller when child is explored

C determines exploration-exploitation trade-off (tuned experimentally)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Possibilities when ending the search and making a move in MCTS

A

Max child (default): Choose the child with the highest average reward (highest Q/N)

Robust child: Choose the child most often visited (highest N)

Max-Robust child: Choose the child which is maximal in both reward and visits. If none exists, run longer until one does exist.

Secure child: Choose child which maximises a lower confidence interval, such as UCB value but with a minus sign instead of a plus sign

17
Q

Main differences between TD-Learning and MCTS

A

TD-Learning
- Needs to learn in self play before used as a game playing agent.
- Learns board value-function or Q table

Monte Carlo Tree Search
- Learns while playing the game
- Uses a form of exploration-exploitation strategy (UCT) to select nodes to evaluate.
- Uses random play-out to evaluate them

18
Q

MCTS summary

A

Actions from the current board positon are evaluated:
- building a partial subtree from the current position
Selection: to select the part of the tree to expand
Expansion: To expand from the selected node

Simulate the game from that point (e.g. random roll-outs) to reach terminal nodes and estimate the values from the current position

A method for choosing the best action

19
Q

How can you vary MCTS with a heuristic

A

Could use heuristics for selection
- Select for expansion the node with the highest value of the heuristic

Could use heuristics for roll-outs
- During roll-outs, use the heuristic to select the action for each player
- But the heuristic must be quick to calculate

20
Q

Do MCTS algorithms rely on repeated random sampling?

A

Yes

Use randomness to find answers which might be deterministic in principle