AAA Flashcards

1
Q

How long do we need to run VI to achieve the optimal policy?

A

The only way to guarantee is to run to infinity BUT it is also guaranteed to converge to some value which is “good enough” in polynomial time with set gamma and some set bits of precision.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

If we are running VI and |V_t(s)-V_t-1(s)| < epsilon, for some value epsilon, what is the bound on the difference between the current policy and the optimal policy?

A

(2epsilongamma) / (1 - gamma)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the relation between gamma and how far in the future we are looking?

A

The higher gamma is, the farther in the future we are looking

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

T/F - Is k-step value iteration a contraction mapping? If so what is the index of contraction? If non, why not?

A

True - The index of contraction is gamma^k for k steps of value iteration

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Which method can be used to solve MDPs in polynomial time?

A

Linear programming

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How do we represent an MDP to solve with linear programming?

A
We change the max action operator in the Bellman equation to a >= across all actions and set our objective to minimize the sum of V over all s i.e.
Minimize Sum_s(V_s) where
V_s >= R(s,a) + gamma* Sum_s' [T(s,a,s')*v_s'] for all s,a
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Why don’t people use Linear Programming (very ofter) to solve MDPs in practice? When is it useful?

A

There is a reasonable amount of overhead to configure and run LP for complex MDPs. VI and PI do well in practice and can often get good results quickly.
It is useful if you need to put additional linear constraints on the MDP.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Intuitively, what does the dual representation of an MDP as a linear program represent?

A

For each next state, the amount of policy flow through that state must be the same as the policy flow leaving the state. Policy flow is effectively the amount the agent moves through a given state action pair. We are essentially trying to determine how the flow should be directed through the state space to maximize reward.

This leads to Policy Iteration

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What fundamental characteristics does Policy Iteration have?

A
  1. That Q_t -> Q*
  2. Convergence is exact and complete in finite time (as with VI)
  3. Convergence is at least as fast as VI (w.r.t. timesteps)
  4. PI will not get stuck in local optima
  5. PI iterations are significantly more expensive
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How do we define domination wrt MDP policies?

A

Policy 1 dominates Policy 2 if and only if for all states the value of the state with Policy 1 is greater than or equal to the value of the state with Policy 2.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How do we define strict domination wrt MDP policies?

A

Policy 1 strictly dominates Policy 2 if it dominates Policy 2 AND there exists some state for which the value of the state with Policy 1 is strictly greater than with Policy 2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is an epsilon optimal policy?

A

A policy is epsilon optimal if the value function is epsilon close to the optimal policy for all states

How well did you know this?
1
Not at all
2
3
4
5
Perfectly