Final Flashcards

Question

Rmax will always find the optimal policy for a properly tuned learning function.

Answer 1

I think this is technically False (although true in spirit). The problem is the word 'always.' As Jacob alludes to, in these PAC-style bounds, there is always a small probability that the desired goal isn't reached, and if reached it's not quite optimal. I think it is false. Rmax will find the optimal policy only for the states visited by the agent. The states not visited, it is assumed to be an infinite loop with the maximum reward to itself, but no policy can be calculated to those non-visited states.

Answer 2

False. Q-learning is off policy so exploration is important, but greediness of the behavior policy isn't.

Answer 3

False. You may still need to explore "non-optimal" arms of the bandit since there may be noise in the estimate of its value.

Answer 4

False. Being online doesn't affect whether or not it learns the optimal policy. The amount of exploration done by the algorithm determines its ability to learn *the* optimal path. Case in point is the cliff walker example where the SARSA agent does not learn *the* optimal (step-wise optimal) path, but it learns the better path that is safer for the walker (a different definition of optimal). Cliff walker is in the Sutton-Barto book. I'll take "online" to mean "on-policy." False. If the SARSA implementation is GLIE, then it can learn an optimal policy GLIE = Greedy in Limit with Infinite Exploration

Answer 5

False. If you engineer nonlinear features, then linear function approximation can still lead to quite expressive functions. False. Nonlinear approximation can be approximated by linear piece-wise construction.

Answer 6

FALSE, this was described in the KWIK paper provided for the homework that we all loved doing. "However, some hypothesis classes are exponentially harder to learn in the KWIK setting than in the MB setting."

Answer 7

False. Combining even LFA with off policy TD methods can lead to divergence. False. Baird's counter-example shows how classic update using linear function approximation can lead to divergence. False. Sutton and Barto, some uses of linear and non-linear function approximation can diverge to infinity. Also, "However, generalization can cause divergence in the case of repeated boostrapped temporaldifference updates

Answer 8

True. When averager is applied to V(s'), we can rearrange the summation terms and write a new MDP with its value function in terms of transitions between state S and all other basis/anchor states.

Answer 9

Q-learning directly learns the optimal policy, whilst SARSA learns a near-optimal policy whilst exploring. If you want to learn an optimal policy using SARSA, then you will need to decide on a strategy to decay ϵϵ in ϵ ϵ-greedy action choice, which may become a fiddly hyperparameter to tune. Q-learning (and off-policy learning in general) has higher per-sample variance than SARSA, and may suffer from problems converging as a result. This turns up as a problem when training neural networks via Q-learning. SARSA will approach convergence allowing for possible penalties from exploratory moves, whilst Q-learning will ignore them. That makes SARSA more conservative - if there is risk of a large negative reward close to the optimal path, Q-learning will tend to trigger that reward whilst exploring, whilst SARSA will tend to avoid a dangerous optimal path and only slowly learn to use it when the exploration parameters are reduced. The classic toy problem that demonstrates this effect is called cliff walking. In practice the last point can make a big difference if mistakes are costly - e.g. you are training a robot not in simulation, but in the real world. You may prefer a more conservative learning algorithm that avoids high risk, if there was real time and money at stake if the robot was damaged. If your goal is to train an optimal agent in simulation, or in a low-cost and fast-iterating environment, then Q-learning is a good choice, due to the first point (learning optimal policy directly). If your agent learns online, and you care about rewards gained whilst learning, then SARSA may be a better choice.

Final Flashcards

(33 cards)