Review Session #3 Flashcards

1
Q

True or False: In general, an update rule which is not a non-expansion will not converge.

A

False. Coco-Q is an exception as noted in Lecture 4. In general though, you should expect this to hold up.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

True or False: MDPs are a type of Markov game.

A

True, MDPs are a single-player Markov game

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

True or False: Contraction mappings and non-expansions are concepts used to prove the convergence of RL algorithms, but are otherwise unrelated concepts.

A

False, a non-expansion is a type of contraction mapping.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

True or False: Linear programming is the only way we are able to solve MDPs in linear time.

A

False, linear programming solves MDPs in polynomial time.

False, linear programming is not used to solve MDPs. Dynamic programming is typically used.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

True or False: The objective of the dual LP presented in lecture is minimization of “policy flow”. (The minimization is because we are aiming to find an upper bound on “policy flow”.)

A

False, the objective is to maximize the “policy flow”.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

True or False: Any optimal policy found with reward shaping is the optimal policy for the original MDP.

A

False, only potential-based reward shaping must preserve the original MDP’s optimal policy.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

True or False: Potential-based shaping will find an optimial policy faster than an unshaped MDP.

A

False, this depends on the selected potential. It is possible to get stuck in a sub-optimal loop before eventually finding the optimal policy.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

True or False: Rmax will always find the optimal policy for a properly tuned learning function.

A

False, Rmax is not guaranteed to find the optimal policy. But Rmax can help obtain near optimal results.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

True or False: Q-learning converges only under certain exploration decay conditions.

A

False, Q-learning can converge even when random actions are performed since this is an off-policy algorithm.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

True or False: The trade-off between exploration and exploitation is not applicable to finite bandit domains since we are able to sample all options.

A

False, depending on the confidence level we feel comfortably with we stop exploring bandits we “believe” we have obtained the optimal solution for.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly