Final Flashcards
It is not always possible to convert a finite horizon MDP to an infinite horizon MDP.
False. You can always convert a terminal state into an absorbing state with a transition to itself and reward 0
In RL, recent moves influence outcomes more than moves further in the past.
False, while that can be the case (and often works well), some algorithms, like TD(1) give equal weight to each move and it is also possible to give more weight to earlier moves as well. It depends on the problem you’re trying to solve.
False. You can lose a chess game on your first move – and apparently someone did.
An MDP given a fixed policy is a Markov chain with rewards.
True, since fixed policy means the agent doesn’t have an option to choose an action in each state. The agent transitions from state to state according to this fixed policy (without choosing any actions) which is Markov chain.
True. A fixed policy means a fixed action is taken given a state. In this case, this MDP totally depends on the current state, and it is now reduced to a Markov chain with rewards.
If we know the optimal Q values, we can get the optimal V values only if we know the environment’s transition function/matrix.
False, you don’t need the transition function.
False. For an optimal policy, max V(s) is equal to the max Q(s, a). No need to know the environment transition matrix.
In the gridworld MDP in “Smoov and Curly’s Bogus Journey”, if we add 10 to each state’s reward (terminal and non-terminal) the optimal policy will not change.
True, assuming an infinite horizon the optimal policy will be unchanged
Markov means RL agents are amnesiacs and forget everything up until the current state.
True here: the current state is all the agent should know.
Now if you want to discuss what a “current state” is… Well that can get more complicated.
True. In a MDP the present is a sufficient statistic of the past.
Only the current state matters and what past states were is irrelevant to the agent.
A policy that is greedy–with respect to the optimal value function–is not necessarily an optimal policy.
False as noted… Optimal action on an optimal value function is optimal by definition.
In TD learning, the sum of the learning rates used must converge for the value function to converge.
False, the sum of learning rates must diverge for the value function to converge.
Also, the sum of learning rates squared must converge for the value function to converge.
Monte Carlo is an unbiased estimator of the value function compared to TD methods. Therefore, it is the preferred algorithm when doing RL with episodic tasks.
False. (second sentence)
True –> Monte Carlo is an unbiased estimator of the value function compared to TD methods –> TD methods start with an initial estimate of q values and tend to be biased on these
False –> Most episodic tasks are too long and the computational advantages of TD updates favor TD methods over MC methods.
The value of the returned policy is the only metric we care about when evaluating a learner.
False. We can evaluate a learner on other metrics including computational complexity and how much data is required to effectively learn.
T/F POMDP allow us to strike a balance actions to gain reward and actions to gain information
TRUE this was all folded into one model (there was not special info to do this)
Partially Observable Markov Decision Processes (POMDPs) have succeeded in planning domains that require balancing actions that increase an agent’s knowledge and actions that increase an agent’s reward.
T/F DEC-POMDPS allow us to wrap coordinating and communicating into choosing actions to maximize utility
True.
The decentralized partially observable Markov decision process (Dec-POMDP) is a model for coordination and decision-making among multiple agents.
It is a probabilistic model that can consider uncertainty in outcomes, sensors and communication (i.e., costly, delayed, noisy or nonexistent communication). it is a generalization of a Markov decision process (MDP) and a partially observable Markov decision process (POMDP) to consider multiple decentralized agents.
DEC-POMDP stands for
Decentralized Partially Observable Markov Decision Process
The primary difference between POMDPs and DEC-POMDPS
In DEC-POMDPS, actions are taken simultaneously by a finite set of agents (not just 1)
DEC-POMDPS vs POSG
DEC-POMDP all share reward function (they are working together)
—> Note: what is POSG?
T/F DEC-POMDPs are represented by Ri (diff reward for each agent)
FALSE, there is a shared Reward (all agents working together). If reward wasn’t shared the model would be POSG
Properties of DEC-POMDP
- Elements of game theory and POMDPS
2. NEXP-complete (for finite horizon) –> NEXP-complete when the input is represented as a brief circuit
Inverse RL
Input behavior and environment and OUTPUT reward
MLRL
- Guess R
- Compute policy
- Measure probability of data given policy
- Compute gradient R to find how to change reward to make it fit the data better
In reward shaping, if the human believes X, Y, Z with 2/3, 1/6, 1/6 and the algorithm believes 1/10, 1/10, 8/10. What action should they choose?
Choose action Z because pairwise product is highest.
Argmax p(a|poliicy1)*p(a|policy2)
Drama Management
There is a 3rd agent, the “Author” that wants to build an agent that interferes with the player. Author created packman, agent is game itself, player is player obvi
TTD-MDPs vs MDPs
- rather than states: have trajectories (sequence so far)
- rather than rewards: have target distribution p(T).
Trajectory Distribution Markov Decision Process
Value Iteration Algorithm
Start w/ arbitrary utilities
Update utilities based on reward + neighbors (discounted future reward)
Repeat until convergence
Note that value iteration is obtained simply by turning the Bellman optimality equation into an update rule. Also note how the value iteration backup is identical to the policy evaluation backup (4.5) except that it requires the maximum to be taken over all actions.
Value iteration effectively combines, in each of its sweeps, one sweep of policy evaluation and one sweep of policy improvement.
Supervised Learning
y = f(x). Function approximation, find f that maps y to x
Unsupervised Learning
f(c). Clusters descripition
Reinforcement Learning
y = f(x) but given x and z. Still trying to find f to generate y. We see so r is the z
MDP stands for
Markov Decision Processes.
MDPs are made up of
States, transitions (model), actions, rewards and create a policy.
Markovian Property
Only the present matters AND things are stationary (rules/the world doesn’t change over time)
Delayed Rewards
In chess example, if you make a bad move early on that you can never recover from. That bad move needs to be reflected in the reward
Temporal Credit Assignment Problem
The problem of determining the actions that lead to a certain outcome in sequence.
For example, in football, at each second, each football player takes an action. In this context, an action can e.g. be “pass the ball”, “dribbe”, “run” or “shoot the ball”. At the end of the football match, the outcome can either be a victory, a loss or a tie. After the match, the coach talks to the players and analyses the match and the performance of each player. He discusses the contribution of each player to the result of the match. The problem of determinig the contribution of each player to the result of the match is the (temporal) credit assignment problem.
How to change policies to account for finite horizons
Policy(s,t).. policy is function of state AND time
Utility of Sequences (stationary preferences)
if you prefer one sequence of states over another today, you prefer the same sequence tomorrow
How to calculate infinite horizons without infinity?
Use discounted future rewards (use gamma)
Reward vs Utility
Reward is immediate payoff of state. Utility is long term payoff of action, it takes into account delayed reward
Policy Iteration
start with initial policy (guess)
evaluate: given policy, calculate utility
improve: Policy at t+1 is the action that maximizes the utility
T/F Policy Iteration won’t converge
False
On Policy vs Off-Policy
off policy estimates the q values (state-action value) directly from the Q function regardless of the policy being followed by the agent. (Q-learning)
on-policy is that it updates its Q-values using the Q-value of the next state 𝑠′ and the current policy’s action 𝑎″. It estimates the return for state-action pairs assuming the current policy continues to be followed. (SARSA)
T/F TD update rule always converges with any learning rate
False. It will converge if you sum all of the learning rates at each time, t > infinity and if you sum all the learning rates squared at time < infinity
T/F TD(1) is the same as outcome-based updates (if no repeated states)
True. and with even more learning because updates don’t have to wait for the episode to be over
Maximum Likelihood Estimate vs outcome-based estimate (TD(1))
Maximum Likelihood uses all of the examples, but TD(1) uses just individual runs so if a rare thing happens on TD(1) it can be biased (high variance). (this leads us to TD(lambda)
T/F TD(0) is the same as maximum likelihood estimate
TRUE if we run over the data over and over again
T/F TD(lambda) is weighted combination of k step estimators
True
T/F TD(1) typically has less error than TD(0)
False. TD(1) typically has more error than TD(0)
T/F TD(0) has the least amount of error usually
False. TD(lambda) performs best. Usually 0.3 - 0.7 is best
Temporal Difference is
Difference between reward (value estimates) as we go from one step to another
T/F reward must be scalar
TRUE
Scalar value means single value.
T/F environment is visible to the agent
False, usually