Final Review pt. 3 Flashcards

Question 1

Q

True or False: The trade-off between exploration and exploitation is not applicable to finite bandit domains since we are able to sample all options. Why?

Answer

A

I would add that its false due to the stochasticity of the problem. We must visit a state several times to increase our confidence in the likelihood of each outcome.

On the other hand, if it were deterministic, one visit of a state would be enough to confidently know the reward for visiting that state in the future

Question 2

Q

What are the benefits of an off-policy algorithm?

Answer

A

Off policy allows the agent to evaluate and improve a policy that is different from the Policy that is used for action selection. Target Policy != Behavior Policy.

This allows for continuous exploration, learning from demonstration, and parallel learning.

Question 3

Q

Is it possible that Q-learning doesn’t converge? Is it possible that the point it converges to is not optimal?

Answer

A

Neither is possible. We can prove that Q-learning always converges, and it converges to Q*.

PROVIDED:
1. The sum of the learning rates must not converge. The squared sum of the learning rates must converge.

All actions must be sampled infinitely often.

Question 4

Q

Why can you find an optimal policy despite finding sub-optimal Q-values (e.g. via Q-learning)?

Answer

A

Provided the argmax_a Q(s,a) == argmax_a Q(s,a) for all s, then the policy is still the same as the optimal policy even if Q(s, a) != Q(s, a)

Question 5

Q

True or False: Policy shaping requires a completely correct oracle to give the RL agent advice. Why?

Answer

A

False. Policy shaping needs a certain degree of confidence, but a completely correct oracle is not necessary (it would help though).

the policy shaping routine is dependent on a probability of selecting a good vs bad action. This problem is a function of a parameter C that varies from 0 to 1 indicating how precise/accurate the learner is. If we needed a perfect oracle, the C coefficient would not exist and would just be set equal to 1. We can have some imperfection, but ideally more accurate than not.

Question 6

Q

True or False: Since Sarsa is an “online” algorithm, it will never learn the optimal policy for an environment. Why?

Answer

A

False. Assuming the appropriate step size is being used, i.e. sum = infinity and sum^2 < infinity, along with adequate exploration of all states, SARSA will converge to an optimal policy and action-value function.

Question 7

Q

True or False: Rmax will always find the optimal policy for a properly tuned learning function. Why?

Answer

A

False. It does not guarantee optimal, but it can help get to near-optimal results.

Question 8

Q

True or False: Potential-based shaping will always find an optimal policy faster than an unshaped MDP. Why?

Answer

A

False. Potential-based shaping is not magic. It is only a way to redefine your rewards, and does not always converge faster. But If you initialize Q(s,a) as all zeros, it can give you some head start.

Question 9

Q

True or False: Any optimal policy found with reward shaping is the optimal policy for the original MDP. Why?

Answer

A

False. reward shaping functions that are not potential-based are not guaranteed to preserve the optimal policy for the original MDP.

Question 10

Q

One perspective of Sarsa is that it looks a bit like policy iteration? Can you tell us which part of Sarsa is policy evaluation and which part is policy improvement?

Answer

A

policy evaluation: update the Q(s,a) value

policy improvement: choose the A’ using epsilon-greedy policy

Question 11

Q

In an assignment document, it says Sarsa uses TD to do control. What exactly does it mean? Which part of Sarsa is TD?

Answer

A

Sarsa still uses TD error to update the Q function, just as Q-learning.

Question 12

Q

Why is Sarsa on policy? How does this compare to an off-policy method like Q-learning by following a random policy to generate data?

Answer

A

SARSA is on policy because it directly takes the action A’ going to S’ using an e-greedy policy. Q-learning takes action A and observes the R, S’ for all A and chooses the maximum value.

SARSA: Q(S,A) = Q(S,A) + alpha[Rt+1 + gamma(Q(S’,A’) - Q(S,A)]

Q-LEarning: Q(S,A) = Q(S,A) + alpha[Rt+1 + gamma(max_a Q(S’,a) - Q(S,A)]

Question 13

Q

What is Sarsa? What’s the input and output of this function?

Answer

A

Sarsa is an on-policy learning algorithm. The inputs are the actual (s, a) pair an agent has experienced, and the output is the estimated Q table.

Brainscape's Knowledge GenomeTM

Final Review pt. 3 Flashcards

Brainscape's Knowledge Genome^TM