Review Session #2 Flashcards

Question 1

Q

True or False: A policy that is greedy – with respect to the optimal value function – is not necessarily an optimal policy.

Answer

A

False. An optimal value function captures all (discounted) future rewards. So a greedy policy on optima value function is globally optimal.

Question 2

Q

True or False: In TD learning, the sum of the learning rates used must converge for the value function to converge.

Answer

A

False, the sum of learning rates must diverge for the value function to converge.
Also, the sum of learning rates squared must converge for the value function to converge.

Question 3

Q

True or False: Monte Carlo is an unbiased estimator of the value function compared to TD methods. Therefore, it is the preferred algorithm when doing RL with episodic tasks.

Answer

A

True (Monte Carlo is an unbiased estimator of the value function compared to TD methods): TD methods start with an initial estimate of q values and tend to be biased on these.

False (Therefore, it is the preferred algorithm when doing RL with episodic tasks): Most episodic tasks are too long and the computational advantages of TD updates favor TD methods over MC methods.

Question 4

Q

True or False: Backward and forward TD(lambda) can be applied to the same problems.

Answer

A

True, but backward TD(lambda) is usually easier to compute.

Question 5

Q

True or False: Offline algorithms are generally superior to online algorithms.

Answer

A

False: Online algorithms (can) do updates online at each step and learn faster.

False. Online algorithms update values as soon as new information is available and makes most efficient use of experiences.

Question 6

Q

True or False: Given a model (T,R) we can also sample in, we should first try TD learning.

Answer

A

False… You have a model, use it.

Question 7

Q

True or False: TD(1) slowly propagates information, so it does better in the repeated presentations regime rather than with single presentations.

Answer

A

False. TD(0) propagates slowly while TD(1) propagates information all the way in each presentation.

Question 8

Q

True or False: In TD(lambda), we should see the same general curve for best learning rate (lowest error), regardless of lambda value.

Answer

A

True, figure 4 in project 1 demonstrates this. For the best learning rate, the best lambda value is usually between 0.3 and 0.7.

Review Session #2 Flashcards

(8 cards)