Final Review pt. 2 Flashcards

1
Q

In Sutton 88, why are all the curves in figure 4 convex?

A

It indicates for a given lambda, the error is minimum for some intermediate alpha. For low alpha, the weight update is taking place slowly which might lead to high bias. For high alpha, the weight update is taking place too fast which might lead to high variance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

In Sutton 88, experiment 2, what kind of figure 4 would you expect if you were to give infinite sequences instead of 10? Why?

A

Best alpha and the best error values for each lambda both decrease (except for alpha=0 which always stay the same). I think the reason is that small alpha is needed to stabilize the training when learning from an increased amount of data. Best lambda should still be the same (not entirely sure - my intuition is that the best lambda depends on the size of the MDP).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

In Sutton 88, if there are way more data than 100 sequences for each training set in experiment one, what kind of graph of figure 3 would you expect? Why?

A

The shape of the curve should be similar, but the error values should all decrease because the impact of the random noise goes down as the number of sequences increases

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

In Sutton 88, why doesn’t TD(0) do well in figure 5?

A

I believe it is because TD(0) is slow to propagate information. After experiencing a single episode, it changes only the value function for the penultimate state, whereas the other TD(lambda) will update “further back” along the sequence, to varying degrees controlled by lambda.

Also, I think TD(1) is MC, and not TD(0). I think of lambda as a parameter to describe how much we value n-step look aheads. Basically, TD methods are trying to balance variance (MC) with bias (1-step look ahead, which is TD(0)). If lambda = 1, it means we take into consideration all n-step lookaheads which amounts to MC (this also comes out of the definition of lambda-return). I sometimes get this confused with gamma, which tells us how interested in future REWARDS we are. I think the best way to separate the meaning of these two parameters is:

gamma = how much to discount reward we’re going to get

lambda = how much to discount looking ahead

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

In Sutton 88, experiment 1, what is the convergence criterion? What effect does it have on the TD algorithm?

A

The weight vector was not updated after each sequence and only used to update the weight after the complete presentation of the training set. This likely was done to account for the stochastic nature of the problem and allow for convergence in minimal sequences and training sets.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

In Sutton 88, is the TD algorithm used in the paper the same as the TD we saw in class?

A

No. The Sutton 88 paper implemented TD update rules in forward view. The TD we saw in the class used backward update rules.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

In Sutton 88, what is the value of ∇_w P_t ​ in the paper, if x_t = [0,0,1,0,0] under the context of a random walk example? Why?

A

[1/6, 1/3, 1/2, 2/3, 5/6] Due to the nature of the random walk and terminal states being 0. (I do not think this is correct but I would like to start the discussion. Can anyone help me out here?)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

In Sutton 88, why does TD(0) do best in figure 3 of Prof. Sutton’s paper (you can use the intuition from the above question)?

A

In the linear case, TD(0) minimizes the error for future experience and converges to the ML estimate of the underlying Markov process.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Intuitively, why would TD(0) converge to “the maximum likelihood estimate of the underlying MDP”?

A

With repeated presentations of finite data in TD(0), the next state for a given state occurs with a certain probability. The value of a given state is updated more often towards the target (sum of the immediate reward r plus single discounted value of immediate next state V(s’)) according to how often those next states occur (frequency is proportional to their transition probabilities from the originating state), leading essentially to the expected value of that state at convergence. Maximum likelihood estimates are also computed based on this same probabilities, hence it is expected that TD(0) will converge towards MLE

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is an eligibility trace?

A

A temporary record of the occurrence of an event (think the exponentially decaying graph showing (1-λ)^n * λ)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly