Final Flashcards
(33 cards)
It is not always possible to convert a finite horizon MDP to an infinite horizon MDP.
False. You can always convert a terminal state into an absorbing state with a transition to itself and reward 0
In RL, recent moves influence outcomes more than moves further in the past.
I’ll go with False as you can lose a game (like the chess game that was mentioned by prof. Isbell in one of the earliest videos) at the beginning and no matter how perfectly you play it afterwards, you might still lose it.
Depending on the game/environment, you might lose on the first move. False.
An MDP given a fixed policy is a Markov chain with rewards.
True
Markov decision processes are an extension of Markov chains; the difference is the addition of actions (allowing choice) and rewards (giving motivation). Conversely, if only one action exists for each state (e.g. “wait”) and all rewards are the same (e.g. “zero”), a Markov decision process reduces to a Markov chain.
MDP: Markov Decision Process
Markov Chain
Markovian
probablity of future states depends only on current state”
where I can go is dependent on where I am
If we know the optimal Q values, we can get the optimal V values only if we know the environment’s transition function/matrix.
False, you can convert from Q-values to V values without knowing the transition matrix
In the gridworld MDP in “Smoov and Curly’s Bogus Journey”, if we add 10 to each state’s reward (terminal and non-terminal) the optimal policy will not change.
Adding any scalar to every reward is no different than adding zero. The min(a+100,b+100) is still the same as min(a,b) and similarly with max().
Markov means RL agents are amnesiacs and forget everything up until the current state.
“True”
Markov means future states only depends on current state
Answer depends on how you interpret state, does the agent count as part of the state?
Q-Table (agent) does retain information from previous experiences, but you don’t need to reference previous actions to decide the current action to take
A policy that is greedy–with respect to the optimal value function–is not necessarily an optimal policy.
False. A policy “which is greedy with respect to the optimal value function” is by defintion optimal.
In TD learning, the sum of the learning rates used must converge for the value function to converge.
False
The sum has to diverge, but the sum of squares has to converge.
Monte Carlo is an unbiased estimator of the value function compared to TD methods. Therefore, it is the preferred algorithm when doing RL with episodic tasks.
False
The value of the returned policy is the only metric we care about when evaluating a learner.
False
False. This is kind of subjective. I would say, machine time, data efficiency, the experience required from data scientists are all additional things to consider.
Backward and forward TD(lambda) can be applied to the same problems.
True, however to do it forward, you would have to wait for a complete episode, and therefore cant do it online. But it can be done both ways.
Offline algorithms are generally superior to online algorithms.
Both on-policy/online and off-policy algorithms can do exploration, which is the key requirement for finding optimal policies. Ignoring the illdefined superlative (superior), both algorithm styles can generate the same optimal policy, but there is no guarantee as such.
Considering the superlative, one must first define it. If by superior you mean the policy that doesn’t result in falling off a cliff, then online is superior. If step-wise superiority is desired, then offline is better, but you’ll fall off the cliff more often.
False
Given a model (T,R) we can also sample in, we should first try TD learning.
If we’re already given a model, then why default to model-free methods?
A good example is backgammon - a complete model is provided, but the state space is enormous (sweeping the state space would take too long) and the most successful methods have been model-free.
TD(1) slowly propagates information, so it does better in the repeated presentations regime rather than with single presentations.
False
False. TD(0) propagates information slowly. At every presentation, TD(1) is propagating backwards an outcome’s information to the full extent.
In TD(lambda), we should see the same general curve for best learning rate (lowest error), regardless of lambda value.
False. The shapes are similar (convex) but the minima are completely different (Desperately Seeking Sutton!)’
In general, an update rule which is not a non-expansion will not converge.
True: A non-expansion is {constant, contraction}, and a not non-expansion is not a constant and not a contraction, therefore it is an expansion and expansions diverge.-
Non-expansion = MaxA - MaxB <= Max|A-B|
MDPs are a type of Markov game
True. Markov games with only one player action space and a corresponding single reward function are MDPs.
Contraction mappings and non-expansions are concepts used to prove the convergence of RL algorithms, but are otherwise unrelated concepts.
False. They are totally related. A non-expansion must be followed by a contraction, in order to provide convergence guarantees.
Contraction mapping is an operator (B) that maps value functions to value functions which converge
Property 1 - has a solution and it is unique
Property 2 - value iteration converges
Example: y = 6 and z = 8
|By - Bz| <= |y - z|
B(x) = x+1 AND B(x) = x-1 are Not a contraction mapping because they will stay the same distance apart (2)
B(x) = x/2 IS a contraction mapping because 1 <= 2’
Linear programming is the only way we are able to solve MDPs in linear time.
There is only one way that we know to solve MDPs in polynomial is Linear Programming
False. LPs have polynomial time complexity. The linearity refers to the objective and constraints.
The objective of the dual LP presented in lecture is minimization of “policy flow”. (The minimization is because we are aiming to find an upper bound on “policy flow”.)
False. The objective of the dual LP is to maximize “policy flow”, subject to conservation of flow constraints. The minimization is for the primary LP problem, in order to find the least upper bound of value functions over states and actions.
Transition function T(S,A,S’)
Transition function T(S,A,S’) is the probability of ending up in S’
Any optimal policy found with reward shaping is the optimal policy for the original MDP.
False in general. This is true if the reward is shaped with a potential-based shaping function.
Potential-based shaping will find an optimial policy faster than an unshaped MDP.
“This paper examines potential-based shaping functions, which introduce artificial rewards in a particular form that is guaranteed to leave the optimal behavior unchanged yet can influence the agent’s exploration behavior to decrease the time spent trying suboptimal actions”