Week 5 (N/A) - Maching Learning In Games Flashcards

Question 1

Q

What is “Book Learning”?

Answer

A

Learning a sequence of moves for important positions
e. g. #1 - the opening moves in Chess
e. g. #2 - learn from mistakes - identify moves that lead to a loss and whether there was a better alternative

Question 2

Q

How do you recognise which moves are important?

Answer

A

Page 4 Lecture 5

Question 3

Q

What is “Search Control Learning”?

Answer

A

Learn how to make search more efficient
ADJUSTING SEARCH PARAMETERS*

e. g. #1 - order of move generation affects a-b pruning
- Learning a preferred order for generating possible moves to prune more subtrees

e. g. #2 - vary the cut-off depth
- Learn a classifier to predict what depth the algorithm should search to depending on the current game state

Question 4

Q

How to deal with the challenge of being able to adjust search parameters in an agent for game playing?

Answer

A

Search control learning

Question 5

Q

How to handle each stage of the game separately in game playing?

Answer

A

Compile a “book” of opening games or end games - Book Learning

Question 6

Q

What is “Learning evaluation function weights”?

Answer

A

Adjust weights in evaluation function based on experience of their ability to predict the true final utility

Eval(s) = W1 * F1(s) + W2 * F2(s) + … + Wn * Fn(s)
= sumof(Wi * Fi(s))
= W dotprod F(s)

Question 7

Q

What is Gradient Descent Learning?

[Follow-up]

Answer

A

A supervised learning method to train our agent to be able to predict the true minimax utility value of a state.

It achieves this by learning the weights of an evaluation function.
Training set = Set of features for a state
- Each vector is a state with the target label as the true minimax utility value

The iterative step makes use of the weight update rule? [Follow-up]

Repeatedly update the weights based on each learning example until the weights converge? [Follow-up]

Question 8

Q

How do you quantify how well the predicted output is compared to the actual output in Gradient Descent Learning?

Answer

A

Define an error function E
- E is (half) the sum over all training examples of the square of the differences between the predicted output (z) and the actual output (t)

error E = 1/2 * sumof(t - z)^2

Question 9

Q

Why is it that the Gradient Descent Learning algorithm moves in the steepest downhill direction?

Answer

A

The aim is to find a set of weight values for which E is very low - to minimise the errors between predicted and actual target output.

The less errors, the better our agent/classifier is.

Picture error E as height, it defines an error landscape on the weight space.

error E = 1/2 * sumof(t - z)^2

t = desired output (target)
z = actual output (prediction)

Error Landscape is a convex cone with the steepest point at the bottom

Page 9 Lecture 5

Question 10

Q

Explain the application of the weight update rule and derive it.

Answer

A

Used for Gradient Descent Learning.

if y = y(u) and u = u(x) then
dy/dx = (dy/du)(du/dx)

So if z = Eval(s;w) and t = true utility state of s
dE/dw = d/dz [1/2(t - z)^2] dz/dw
= (z - t) dz/dw
= (z - t) f(s)

Repeatedly update the weights based on each learning example until the weights converge

Question 11

Q

What are the problems with Gradient Descent Learning?

Answer

A

Delayed reinforcement
- Reward resulting from an action may not be received until several time steps later, which also slows down the learning
Credit assignment
- Need to know which action(s) was responsible for the outcome

Question 12

Q

What is Temporal Difference Learning and how is it different from Supervised learning?

Answer

A

Temporal Difference (TD) is a form of REINFORCEMENT LEARNING

Supervised learning is for single step prediction
- predict Saturday’s weather based on Friday’s weather

TD learning is for multi-step prediction

predict Saturday’s weather based on Monday’s weather, then update predicted based on Tuesday’s, Wednesday’s, …, etc.
predict outcome of game based on first move, then update prediction as more moves are made

Question 13

Q

What are the problems with Temporal Difference Learning?

Answer

A

Correctness of prediction is not known until several steps later

Question 14

Q

WTF is TDLeaf Algorithm?

Answer

A

Combines temporal difference learning with minimax search

Basic idea is to;

Update weight in evaluation function to reduce differences in rewards predicted at different levels in search tree
Good functions should be stable from one move to the next

Question 15

Q

What are the different ways to train a classifier/agent?

Answer

A

Learning from labelled examples

Learning by playing a skilled opponent

Learning by playing against random moves

Learning by playing against yourself

Question 16

Q

Explain differences between supervised and temporal difference learning

Answer

Study These Flashcards

A

supervised = single step prediction
- predict Saturday’s weather based on Friday’s weather

temporal difference = multi-step prediction

predict Saturday’s weather based on Monday’s weather, then update predicted based on Tuesday’s, Wednesday’s, …, etc.
predict outcome of game based on first move, then update prediction as more moves are made

Question 17

Q

Explain the terms in TDLeaf

Answer

Study These Flashcards

A

Page 14 - 15 Lecture 5

DON’T HAVE TO DERIVE OR MEMORISE TDLEAF

JUST BE ABLE TO EXPLAIN WHAT THE TERMS MEAN

Question 18

Q

Why do we use the temporal difference di between successive states in this rule?

Answer

Study These Flashcards

A

Because we want to minimise the change in eval for successive states

Because we want the eval function to stay relatively stable between states

Because it should be a good predictor of future states

Temporal difference accounts for the differences between states and

Allows a more stable set of weights

Question 19

Q

What is the role of the learning rate parameter N in tdleaf?

Answer

Study These Flashcards

A

Controls the update of weights in a STABLE MANNER

Wi

Question 20

Q

Under what conditions should we use lambda = 0 and why?

Answer

Study These Flashcards

A

Use lambda = 0; when weights adjusted to move towards the predicted reward at the next state

TLDR - looks to adjacent states

Question 21

Q

Give an example of reinformcement learning

Answer

Study These Flashcards

A

Temporal Difference Learning

Question 22

Q

Under what conditions should we use lambda = 1 and why?

Answer

Study These Flashcards

A

Use lambda = 1; when weights adjusted to move towards the final true reward (BETTER IF EVAL IS UNREALISTIC)

TLDR - looks to final reward

Week 5 (N/A) - Maching Learning In Games Flashcards

(22 cards)