AAA Flashcards

Question 1

Q

How long do we need to run VI to achieve the optimal policy?

Answer

A

The only way to guarantee is to run to infinity BUT it is also guaranteed to converge to some value which is “good enough” in polynomial time with set gamma and some set bits of precision.

Question 2

Q

If we are running VI and |V_t(s)-V_t-1(s)| < epsilon, for some value epsilon, what is the bound on the difference between the current policy and the optimal policy?

Answer

A

(2epsilongamma) / (1 - gamma)

Question 3

Q

What is the relation between gamma and how far in the future we are looking?

Answer

A

The higher gamma is, the farther in the future we are looking

Question 4

Q

T/F - Is k-step value iteration a contraction mapping? If so what is the index of contraction? If non, why not?

Answer

A

True - The index of contraction is gamma^k for k steps of value iteration

Question 5

Q

Which method can be used to solve MDPs in polynomial time?

Answer

A

Linear programming

Question 6

Q

How do we represent an MDP to solve with linear programming?

Answer

A

We change the max action operator in the Bellman equation to a >= across all actions and set our objective to minimize the sum of V over all s i.e.
Minimize Sum_s(V_s) where
V_s >= R(s,a) + gamma* Sum_s' [T(s,a,s')*v_s'] for all s,a

Question 7

Q

Why don’t people use Linear Programming (very ofter) to solve MDPs in practice? When is it useful?

Answer

A

There is a reasonable amount of overhead to configure and run LP for complex MDPs. VI and PI do well in practice and can often get good results quickly.
It is useful if you need to put additional linear constraints on the MDP.

Question 8

Q

Intuitively, what does the dual representation of an MDP as a linear program represent?

Answer

A

For each next state, the amount of policy flow through that state must be the same as the policy flow leaving the state. Policy flow is effectively the amount the agent moves through a given state action pair. We are essentially trying to determine how the flow should be directed through the state space to maximize reward.

This leads to Policy Iteration

Question 9

Q

What fundamental characteristics does Policy Iteration have?

Answer

A

That Q_t -> Q*
Convergence is exact and complete in finite time (as with VI)
Convergence is at least as fast as VI (w.r.t. timesteps)
PI will not get stuck in local optima
PI iterations are significantly more expensive

Question 10

Q

How do we define domination wrt MDP policies?

Answer

A

Policy 1 dominates Policy 2 if and only if for all states the value of the state with Policy 1 is greater than or equal to the value of the state with Policy 2.

Question 11

Q

How do we define strict domination wrt MDP policies?

Answer

A

Policy 1 strictly dominates Policy 2 if it dominates Policy 2 AND there exists some state for which the value of the state with Policy 1 is strictly greater than with Policy 2

Question 12

Q

What is an epsilon optimal policy?

Answer

A

A policy is epsilon optimal if the value function is epsilon close to the optimal policy for all states

AAA Flashcards

(12 cards)