Final Exam Flashcards
What is an MDP?
A Markov decision process
What are the components of an MDP?
State, Action, Reward
Is it possible to have multiple terminal states in an MDP?
Yes
What is the Markov property?
Next state only depends on current state.Everything one needs to know about the environment is encoded in the current state. There is no need to access data from past steps.
What is Value Iteration?
Using the Bellman equation to update the value of each state iteratively until convergence. Updating the policy once per iteration until convergence.
What is a policy?
A policy is a mapping from states to actions for an MDP.
What is the meaning of the Bellman equations? What do they do?
“(the Bellman equation) expresses a relationship between the value of a state and the values of its successor states.” S&B 2020
OR
An update rule for the value of a state based on the seen reward and the expected value of the next state.
Is it possible that value iteration never converges on your formulation of MDP with a proper convergence criterion?
False. “(Policy and Value Iteration)… All of these algorithms converge to an optimal policy for discounted finite MDPs.” S&B 2020.
Does policy iteration or value iteration converge faster? Give an explanation of your answer.
Both algorithms are guaranteed to converge to an optimal policy in the end. Yet, the policy iteration algorithm converges within fewer iterations. As a result, the policy iteration is reported to conclude faster than the value iteration algorithm.
Explain the two types of tasks modeled by MDPs; unify the notation used to describe these two tasks, and describe the type of state that allows you to unify them. (Hint: you may find section 3.4 of S&B helpful.)
Episodic - sequences of states that repeat … for example a card game gets to the end and then restarts with a shuffle/deal. And so would also be finite.
Continuing - sequences of states with no definable end. Like predicting the weather … the days continue with new weather each time … you never get to the last day and restart the episode.
You can unify them by making the last state in episodic tasks act as a sink that transitions to itself with probability 1.0 and reward = 0.
We use lambda to convert continuing tasks to simulate a finite length task; but, it would never become episodic as you’d never reset to the original state 0.
True or False: The Markov property means RL agents are amnesiacs and forget everything up until the current state. Why?
True, the markov property states that an agent only needs the data of the currents state. However, the current state can have past experiences/observations included in the current state.
True or False: In RL, recent moves influence outcomes more than moves further in the past. Why?
False
True or False: In the gridworld MDP in “Smoov and Curly’s Bogus Journey”, if we add 10 to each state’s reward (terminal and non-terminal) the optimal policy will not change. Why?
True. Adding a constant to reward would not change the policy. This is not considered as reward shaping.
True or False: An MDP given a fixed policy is a Markov chain with rewards. Why?
True. This is because the agent would be traversing through a fixed set of states which will never change.
True or False: It is not always possible to convert a finite horizon MDP to an infinite horizon MDP. Why?
False … can’t we always set the last state to loop back upon itself with prob 1.0. Then it can go on forever (and the reward would be 0).
True or False: If we know the optimal Q values, we can get the optimal V values only if we know the environment’s transition function/matrix. Why?
False, V(s) = max_a Q(s,a). Thus V(s) is the action that maximizes Q(s,a). If we already have all of the Q-values for each possible action of a given state, V(s) is simply the largest number in the set of Q-values. No transition matrix needed.
True or False: The optimal policy for any MDP can be found in polynomial time. Why?
True - “A DP (Dynamic Programming) method is guaranteed to find an optimal policy in polynomial time even though the total number of (deterministic) policies is k^n” S&B 2020
True or False: A policy that is greedy – with respect to the optimal value function – is not necessarily an optimal policy. Why?
False. Optimal action on an optimal value function is optimal by definition.
What is TD? In which algorithms is TD used?
Temporal Difference. Classical TD has TD(0) - One-Step, TD(1) - Monte-Carlo, TD(lambda) - weighted look-ahead. A one-step TD update is the basis of all bootstrapping RL algorithms such as Q-learning, SARAS, DQN, PPO.
What are model-free methods?
Methods that do not use transition probabilities and reward functions
“Methods for solving reinforcement learning problems that use models and planning are called model-based methods, as opposed to simpler model-free methods that are explicitly trial-and-error learners—viewed as almost the opposite of planning” - S&B 2020
What is the prediction problem?
Estimating/Predicting the value of each state.
As opposed to the Control Problem, in which we can take actions to impact the sequence of states.
What is a n-step estimator?
This is an estimator of state values that looks ahead n time steps into the future.
What is a Monte-Carlo return?
The return of a full episode (sum of all rewards)
What is the value of the 5-step estimator of the terminal state? What about other n-step estimators of the terminal state?
0, because there are no future rewards for being in the terminal state.