Week 5 (N/A) - Maching Learning In Games Flashcards

1
Q

What is “Book Learning”?

A
  • Learning a sequence of moves for important positions
    e. g. #1 - the opening moves in Chess
    e. g. #2 - learn from mistakes - identify moves that lead to a loss and whether there was a better alternative
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

How do you recognise which moves are important?

A

Page 4 Lecture 5

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is “Search Control Learning”?

A
  • Learn how to make search more efficient
  • ADJUSTING SEARCH PARAMETERS*

e. g. #1 - order of move generation affects a-b pruning
- Learning a preferred order for generating possible moves to prune more subtrees

e. g. #2 - vary the cut-off depth
- Learn a classifier to predict what depth the algorithm should search to depending on the current game state

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How to deal with the challenge of being able to adjust search parameters in an agent for game playing?

A

Search control learning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How to handle each stage of the game separately in game playing?

A

Compile a “book” of opening games or end games - Book Learning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is “Learning evaluation function weights”?

A
  • Adjust weights in evaluation function based on experience of their ability to predict the true final utility

Eval(s) = W1 * F1(s) + W2 * F2(s) + … + Wn * Fn(s)
= sumof(Wi * Fi(s))
= W dotprod F(s)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is Gradient Descent Learning?

[Follow-up]

A

A supervised learning method to train our agent to be able to predict the true minimax utility value of a state.

It achieves this by learning the weights of an evaluation function.
Training set = Set of features for a state
- Each vector is a state with the target label as the true minimax utility value

The iterative step makes use of the weight update rule? [Follow-up]

Repeatedly update the weights based on each learning example until the weights converge? [Follow-up]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How do you quantify how well the predicted output is compared to the actual output in Gradient Descent Learning?

A

Define an error function E
- E is (half) the sum over all training examples of the square of the differences between the predicted output (z) and the actual output (t)

error E = 1/2 * sumof(t - z)^2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Why is it that the Gradient Descent Learning algorithm moves in the steepest downhill direction?

A

The aim is to find a set of weight values for which E is very low - to minimise the errors between predicted and actual target output.

The less errors, the better our agent/classifier is.

Picture error E as height, it defines an error landscape on the weight space.

error E = 1/2 * sumof(t - z)^2

t = desired output (target)
z = actual output (prediction)

Error Landscape is a convex cone with the steepest point at the bottom

Page 9 Lecture 5

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Explain the application of the weight update rule and derive it.

A

Used for Gradient Descent Learning.

if y = y(u) and u = u(x) then
dy/dx = (dy/du)(du/dx)

So if z = Eval(s;w) and t = true utility state of s
dE/dw = d/dz [1/2(t - z)^2] dz/dw
= (z - t) dz/dw
= (z - t) f(s)

Repeatedly update the weights based on each learning example until the weights converge

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are the problems with Gradient Descent Learning?

A
  • Delayed reinforcement
    • Reward resulting from an action may not be received until several time steps later, which also slows down the learning
  • Credit assignment
    • Need to know which action(s) was responsible for the outcome
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is Temporal Difference Learning and how is it different from Supervised learning?

A

Temporal Difference (TD) is a form of REINFORCEMENT LEARNING

Supervised learning is for single step prediction
- predict Saturday’s weather based on Friday’s weather

TD learning is for multi-step prediction

  • predict Saturday’s weather based on Monday’s weather, then update predicted based on Tuesday’s, Wednesday’s, …, etc.
  • predict outcome of game based on first move, then update prediction as more moves are made
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are the problems with Temporal Difference Learning?

A
  • Correctness of prediction is not known until several steps later
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

WTF is TDLeaf Algorithm?

A

Combines temporal difference learning with minimax search

Basic idea is to;

  • Update weight in evaluation function to reduce differences in rewards predicted at different levels in search tree
  • Good functions should be stable from one move to the next
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are the different ways to train a classifier/agent?

A

Learning from labelled examples

Learning by playing a skilled opponent

Learning by playing against random moves

Learning by playing against yourself

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Explain differences between supervised and temporal difference learning

A

supervised = single step prediction
- predict Saturday’s weather based on Friday’s weather

temporal difference = multi-step prediction

  • predict Saturday’s weather based on Monday’s weather, then update predicted based on Tuesday’s, Wednesday’s, …, etc.
  • predict outcome of game based on first move, then update prediction as more moves are made
17
Q

Explain the terms in TDLeaf

A

Page 14 - 15 Lecture 5

DON’T HAVE TO DERIVE OR MEMORISE TDLEAF

JUST BE ABLE TO EXPLAIN WHAT THE TERMS MEAN

18
Q

Why do we use the temporal difference di between successive states in this rule?

A

Because we want to minimise the change in eval for successive states

Because we want the eval function to stay relatively stable between states

Because it should be a good predictor of future states

Temporal difference accounts for the differences between states and

Allows a more stable set of weights

19
Q

What is the role of the learning rate parameter N in tdleaf?

A

Controls the update of weights in a STABLE MANNER

Wi

20
Q

Under what conditions should we use lambda = 0 and why?

A

Use lambda = 0; when weights adjusted to move towards the predicted reward at the next state

TLDR - looks to adjacent states

21
Q

Give an example of reinformcement learning

A

Temporal Difference Learning

22
Q

Under what conditions should we use lambda = 1 and why?

A

Use lambda = 1; when weights adjusted to move towards the final true reward (BETTER IF EVAL IS UNREALISTIC)

TLDR - looks to final reward