final Flashcards
(226 cards)
Markov means RL agents are amnesiacs and forget everything up until the current state.
True. A process is defined as Markov if the transition to the next state is fully defined by the current state (only the present matters).
However, the current state can be expanded to include past states to force non-Markov processes into a Markovian framework.
In RL, recent moves influence outcomes more than moves further in the past.
False. One of the main ideas of RL is the future value of some state or action may be dependent on states far back in the history. This is modified by gamma (the discount factor) though: we can use larger gamma values (but strictly less than one for infinite horizon situations) to make the agent care about longer term rewards.
e.g., you can lose a chess game on your first move..
In the gridworld MDP in “Smoov and Curly’s Bogus Journey”, if we add 10 to each state’s rewards (terminal and non-terminal) the optimal policy will not change
True. Adding the same scalar value to all states leaves the underlying MDP unchanged.
An MDP given a fixed policy is a Markov Chain with rewards.
True. A fixed policy means a fixed action is taken given a state. In this case, this MDP totally depends on the current state, and is is now reduced to a Markov chain with rewards.
It is not always possible to convert a finite horizon MDP to an infinite horizon MDP.
False. We can always convert to infinite horizon by adding a terminal state with a self-loop (with a reward of 0).
If we know optimal Q values, we can get the optimal V values only if we know the environment’s transition function/matrix.
False. Knowing the optimal Q values is essentially the same as knowing the optimal V values.
i.e. max V(s) == max Q(s, a)
The value of the returned policy is the only way to evaluate a learner.
False. Time and space complexities of the learner are also among the indicators.
The optimal policy for any MDP can be found in polynomial time.
True. For any (finite) MDP, we can form the associated linear program (LP) and solve it in polynomial time (via interior point methods, for example, although for large state spaces an LP may not be practical). Then we take the greedy policy and voila!
A policy that is greedy - with respect to the optimal value function - is not necessarily an optimal policy.
False. Optimal action on an optimal value function is optimal by definition.
In TD learning, the sum of the learning rates must converge for the value function to converge.
False. The sum of the SQUARES of the learning rates must converge for the value function to converge.
The sum of learning rates must diverge for the value function to converge.
Monte Carlo is an unbiased estimator of the value function compared to TD methods. Therefore it is the preferred algorithm when doing RL with episodic tasks.
False. TD(1) is actually equivalent to MC. It is true that MC is an unbiased estimator compared to TD, but it has high VARIANCE. TD, on the other hand, has high bias but low variance, which often makes it better for learning from sequential data.
Backward and forward TD(lambda) can be applied to the same problems.
True (S&B chapter 12 discusses these views and shows their equivalence). However, in practice backward TD(lambda) is usually easier to compute.
Offline algorithms are generally superior to online algorithms.
False. It depends on the problem context. Online algorithms update values as soon as new information is available and makes most efficient use of experiences.
Given a model (T,R) we can also sample in, we should first try TD learning.
False. You have a model - use it!
TD learning like Q learning is better suited when model(transition, rewards) are not known.
TD(1) slowly propagates information, so it does better in the repeated presentations regime rather than with single presentations.
False. TD(0) propagates slowly while TD(1) propagates information all the way in each presentation.
TD(1) is equivalent to MC, so it samples from full trajectories. TD(0) is a one-step method and suffers from slow propagation if samples aren’t presented repeatedly until convergence.
In TD(lambda), we should see the same general curve for best learning rate (lowest error), regardless of lambda value.
True, We see the same general curves but not the same lowest error (Fig. 4 in Sutton1988) regardless of the lambda values.
Sutton TD experiments.
In general, an update rule which is not a non-expansion will not converge.
False. Coco-Q is an exception as noted in Lecture 4. In general though, you should expect this to hold up.
..confusing..
MDPs are a type of Markov game.
True, MDPs are a special case of Markov games, where there is only one agent.
Contraction mappings and non-expansions are concepts used to prove the convergence of RL algorithms, but are otherwise unrelated concepts.
False, a non-expansion is a type of contraction mapping.
Linear programming is the only way we are able to solve MDPs in linear time.
False, linear programming can solve MDPs in polynomial time, not linear. Also they quickly become impractical for larger state space MDPs.
The objective of the dual LP presented in lecture is minimization of “policy flow”. (The minimization is because we are aiming to find an upper bound on “policy flow”.)
False, the objective of dual LP is to maximize the “policy flow”.
Any optimal policy found with reward shaping is the optimal policy for the original MDP.
False, only potential-based reward shaping must preserve the original MDP’s optimal policy.
Potential-based shaping will find an optimal policy faster than an unshaped MDP.
False, this depends on the selected potential. It is possible to get stuck in a sub-optimal loop before eventually finding the optimal policy.
not necessarily..
Rmax will always find the optimal policy for a properly tuned learning function.
False, Rmax is not guaranteed to find the optimal policy. But Rmax can help obtain near optimal results.