Final Review Flashcards

1
Q

True or False: A policy that is greedy – with respect to the optimal value function – is not necessarily an optimal policy. Why?

A

False. Behaving greedily with respect to an optimal value function is an optimal policy because it optimizes the long-term cumulative reward.

A greedy policy derived from an optimal value function is necessarily optimal. However a greedy policy derived from just any value function is not necessarily optimal.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

True or False: The optimal policy for any MDP can be found in polynomial time. Why?

A

True for any finite MDP. We can reduce the MDP to linear program and solve it in polynomial time.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

True or False: The value of the returned policy is the only way to evaluate a learner. Why?

A

False, you might also care about the computational- or memory- or sample-efficiency

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

True or False: If we know the optimal Q values, we can get the optimal V values only if we know the environment’s transition function/matrix. Why?

A

**verify from review 1
True. Because Q values needs to be multiplied with the corresponding transition probabilities and the summed over the action space to derive optimal Q values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

True or False: An MDP given a fixed policy is a Markov chain with rewards. Why?

A

True. A fixed policy means same action selection for a given state. That means MDP is a a Markov chain with rewards.

An MDP describes (1) a set of states, (2) a set of actions (3) transition probabilities and (4) rewards for transitioning.

By applying a fixed policy (which defines what action to take in each state), we eliminate the choice of action, which means we’re left with:

[(1) a set of states (3) transition probabilities] (Markov Chain)

[(4) rewards for transitioning]. (w/ Rewards)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

True or False: In the gridworld MDP in “Smoov and Curly’s Bogus Journey”, if we add 10 to each state’s reward (terminal and non-terminal) the optimal policy will not change. Why?

A

**verify from review 1

True. Because the underlying differential rewards across the states do not change by adding 10 to all rewards.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

True or False: In the gridworld MDP in “Smoov and Curly’s Bogus Journey”, if we add 10 to each state’s reward (terminal and non-terminal) the optimal policy will not change. Why?

A

**verify from review 1

True. Because the underlying differential rewards across the states do not change by adding 10 to all rewards.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

True or False: In RL, recent moves influence outcomes more than moves further in the past. Why?

A

False. It depends on the discount factor.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

True or False: The Markov property means RL agents are amnesiacs and forget everything up until the current state. Why?

A

I think True. Markov property means the future can be predicted only with the present.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Explain the two types of tasks modeled by MDPs; unify the notation used to describe these two tasks, and describe the type of state that allows you to unify them. (Hint: you may find section 3.4 of S&B helpful.)

A

We are talking about episodic and continuous tasks. The returns from these tasks could be unified using a self-loop with zero rewards to the terminal state for an episodic task.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Does policy iteration or value iteration converge faster? If so, why?

A

Policy iteration converges faster because it takes few iterations than value iterations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Is it possible that value iteration just never converges on your formulation of MDP with a proper convergence criterion?

A

No value iteration is guaranteed to converge.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is the meaning of the bellman equations? What is it saying? What is your way of visualizing this equation in your head?

A

The long-term return for an action is the sum of the current reward and all possible future rewards.

the long term return for an action is the sum of the current reward and all possible discounted future rewards.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are the bellman equations? (try to write it down without looking at notes since this is the single most important equation to know for this class)

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is a policy?

A

choice function for action for a given state.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is Value Iteration?

A

value iteration is an iterative algorithm to compute the optimal value function.

17
Q

What is Value Iteration?

A

value iteration is an iterative algorithm to compute the optimal value function.

18
Q

What is the Markov property?

A

Markov property is a memoryless property: your future is only dependent on your present. It can capture the past in the current state

19
Q

Is it possible to have multiple terminal states in an MDP?

A

there is no restriction for an MDP to have multiple terminal states.

20
Q

True or False: Contraction mappings and non-expansions are concepts used to prove the convergence of RL algorithms, but are otherwise unrelated concepts. Why?

A

non-expansion is a general case of contraction mapping. For contraction mapping |F(a) - F(b)| <= gamma*|a-b| where gamma needs to be strictly smaller than 1. But for non-expansion the gamma value can be 1 so it allows the distance to be smaller or remain equal

21
Q

True or False: An update rule which is not a non-expansion will not converge without exception. Why?

A

In general, True. But as an ultimatum, this is False.

Non-expansion is a condition that guarantees convergence. However, the lack thereof does not guarantee the lack of convergence.

22
Q

True or False: In TD(λ), we should see the same general curve for best learning rate (lowest error), regardless of λ value. Why?

A

The Lambda matters. In fact the lambda has some effect on the learning rate, as it is somehow (the summation of its exponents multiplied by xt) a multiplier to (Pt+1 - Pt) as well. You will observe that at higher lambdas you will need to lower your alpha in order not to exceed the threshold necessary for convergence

23
Q

True or False: TD(1) slowly propagates information, so it does better in the repeated presentations regime rather than with single presentations. Why?

A

False.

TD(0) is == to Monte Carlo learning which updates very slowly. TD(1) propagates information all the way down the “chain” for every presentation.

24
Q

True or False: Given a model (T,R) we can also sample in, we should first try TD learning. Why?

A

False.

While TD learning was revolutionary and is used in many algorithms today, value or policy iterations have better convergence properties and should by tried first.

25
Q

True or False: Offline algorithms are generally superior to online algorithms. Why?

A

False.

Online algorithms update values as soon as new information is presented, therefore making the most efficient use of experiences.

26
Q

True or False: Backward and forward TD(\lambdaλ) can be applied to the same problems. Why?

A

True.

Both backward and forward TD(λ) will converge on the same problem. Backward TD(λ) typically is easier to compute.

27
Q

True or False: Monte Carlo is an unbiased estimator of the value function compared to TD methods. Therefore, it is the preferred algorithm when doing RL with episodic tasks. Why?

A

False. Even though MC is unbiased but it has higher variances, therefore TD methods may outperform MC in terms of learning performance.

28
Q

True or False: In TD learning, the sum of the learning rates used must converge for the value function to converge. Why?

A

False. Sum of learning rates must diverge. Sum of the square of learning rates must converge.

29
Q

What are some advantages of TD methods compared to dynamic programming methods? to Monte Carlo methods?

A

TD combines the merits of DP and MC. Compared to DP which requires the transition model for planning, TD can do model-free learning. Compared to MC which has high variance, TD makes more efficient use of the data from intermediate states thus has lower variance, and can do online learning.

Compared to DP: it’s model free

Compared to MC: lower variance, online, incremental method

30
Q

What can we say about the bias-variance trade-off between a 1-step estimator versus a Monte Carlo estimator (in other words, which estimator has a higher variance and which estimator has a higher bias?)

A

MC: High variance, 0 bias. Good Convergence. Not sensitive to initial value. Simple to understand and use. Effective in non-Markov.

TD: low variance, medium/high bias. More efficient than MC. TD(0) converges with LP to optimal V(s). IT is not guaranteed to converge with function approximation. More sensitive to initial value. Exploits Markov Property.

31
Q

What’s the formula to view TD(\lambdaλ) as a function of n step estimators?

A
32
Q

What is the value of the 5-step estimator of the terminal state? What about other n-step estimators of the terminal state?

A

The state value of the terminal state in an episodic problem should always be 0 (since the agent terminates)

33
Q

What is a Monte-Carlo return?

A

The value of the state is the expected return. (Sutton and Barto, p. 92)

34
Q

What is the prediction problem?

A

‘Using past experience with an incompletely known system to predict its future behavior’

35
Q

What are model-free methods?

A

In model-free methods, we do not need to learn transition probabilities and rewards explicitly. We learn value function and/or optimal policy directly instead.

36
Q

What is TD? In which algorithms is TD used?

A

TD is short for temporal difference, and is a value-based learning algorithm.