FinalExamExamples1 Flashcards

1
Q

Markov means RL agents are amnesiacs and forget everything up until the current state.

A

True. A process is defined as Markov if the transition to the next state is fully defined by the current state. However, the current state can be expanded to include past states to force non-Markov processes into a Markovian framework.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

In RL, recent moves influence outcomes more than moves further in the past.

A

False. One of the main ideas of RL is the future value of some state or action may be dependent on states far back in the history. This is modified by gamma (the discount factor) though: we can use larger gamma values (but strictly less than one for infinite horizon situations) to make the agent care about longer term rewards.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

In the gridworld MDP in “Smoov and Curly’s Bogus Journey”, if we add 10 to each state’s rewards (terminal and non-terminal) the optimal policy will not change

A

True. Adding the same scalar value to all states leaves the underlying MDP unchanged.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

An MDP given a fixed policy is a Markov Chain with rewards.

A

True. A fixed policy means a fixed action is taken given a state. In this case, this MDP totally depends on the current state, and is is now reduced to a Markov chain with rewards.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

It is not always possible to convert a finite horizon MDD to an infinite horizon MDP.

A

False. We can always convert to infinite horizon by adding a terminal state with a self-loop (with a reward of 0).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

If we know optimal Q values, we can get the optimal V values only if we know the environment’s transition function/matrix.

A

False. Knowing the optimal Q values is essentially the same as knowing the optimal V values, i.e. max[V(s)] == max[Q(s, a)].

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

The value of the returned policy is the only way to evaluate a learner.

A

False. Time and space complexities of the learner are also among the indicators.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

The optimal policy for any MDP can be found in polynomial time.

A

True. For any (finite) MDP, we can form the associated linear program (LP) and solve it in polynomial time (via interior point methods, for example, although for large state spaces an LP may not be practical). Then we take the greedy policy and voila!

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

A policy that is greedy - with respect to the optimal value function - is not necessarily an optimal policy.

A

False. Optimal action on an optimal value function is optimal by definition.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

In TD learning, the sum of the learning rates must converge for the value function to converge.

A

False. The sum of the SQUARES of the learning rates must converge for the value function to converge. The sum must actually diverge.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Monte Carlo is an unbiased estimator of the value function compared to TD methods. Therefore it is the preferred algorithm when doing RL with episodic tasks.

A

False. TD(1) is actually equivalent to MC. It is true that MC is an unbiased estimator compared to TD, but it has high VARIANCE. TD, on the other hand, has high bias but low variance, which often makes it better for learning from sequential data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Backward and forward TD(lambda) can be applied to the same problems.

A

True (S&B chapter 12 discusses these views and shows their equivalence). However, in practice backward TD(lambda) is usually easier to compute.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Offline algorithms are generally superior to online algorithms.

A

False. It depends on the problem context. Online algorithms update values as soon as new information is available and makes most efficient use of experiences.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Given a model (T,T) we can also sample in, we should first try TD learning.

A

False. You have a model - use it!

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

TD(1) slowly propagates information, so it does better in the repeated presentations regime rather than with single presentations.

A

False. TD(0) propagates slowly while TD(1) propagates information all the way in each presentation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

In TD(lambda), we should see the same general curve for best learning rate (lowest error) regardless of the lambda value.

A

False. This was demonstrated in Sutton’s 1988 paper. Different lambda values can make a big difference on errors when combined with different learning rate values.