Review Session #1 Flashcards

Question 1

Q

True or False: Markov means RL agents are amnesiacs and forget everything up until the current state.

Answer

A

True. In a MDP the present is a sufficient statistic of the past.

Question 2

Q

True or False:In RL, recent moves influence outcomes more than moves further in the past.

Answer

A

False. This is the credit attribution problem: we simply don’t know a priori.

False. You can lose a chess game on your first move – and apparently someone did.

Question 3

Q

True or False: In the gridworld MDP in “Smoov and Curly’s Bogus Journey”, if we add 10 to each state’s reward (terminal and non-terminal) the optimal policy will not change.

Answer

A

True, adding a constant to the value function doesn’t affect the optimal policy

Question 4

Q

True or False: An MDP given a fixed policy is a Markov chain with rewards.

Answer

A

True. A fixed policy means a fixed action is taken given a state. In this case, this MDP totally depends on the current state, and it is now reduced to a Markov chain with rewards.

Question 5

Q

True or False: It is not always possible to convert a finite horizon MDP to an infinite horizon MDP.

Answer

A

False. Adding a self-loop with reward 0 for each terminal state can always convert a finite-horizon MDP to an infinite horizon MDP.

Question 6

Q

True or False: The value of the returned policy is the only way to evaluate a learner.

Answer

A

False. Time and space complexities of the learner are also among the indicators.

False. We can evaluate a learner on other metrics including computational complexity and how much data is required to effectively learn.

Question 7

Q

True or False: The optimal policy for any MDP can be found in polynomial time.

Answer

A

True. For any (finite) MDP, we can form the associated linear program (LP) and solve it in polynomial time (via interior point methods, for example). (For large state spaces, an LP may not be practical.) Then we take the greedy policy and voila!

Question 8

Q

True or False: If we know the optimal Q values, we can get the optimal V values only if we know the environment’s transition function/matrix

Answer

A

False. For an optimal policy, max V(s) is equal to the max Q(s, a). No need to know the environment transition matrix.

Review Session #1 Flashcards

(8 cards)