Final Flashcards
(120 cards)
It is not always possible to convert a finite horizon MDP to an infinite horizon MDP.
False. You can always convert a terminal state into an absorbing state with a transition to itself and reward 0
In RL, recent moves influence outcomes more than moves further in the past.
False, while that can be the case (and often works well), some algorithms, like TD(1) give equal weight to each move and it is also possible to give more weight to earlier moves as well. It depends on the problem you’re trying to solve.
False. You can lose a chess game on your first move – and apparently someone did.
An MDP given a fixed policy is a Markov chain with rewards.
True, since fixed policy means the agent doesn’t have an option to choose an action in each state. The agent transitions from state to state according to this fixed policy (without choosing any actions) which is Markov chain.
True. A fixed policy means a fixed action is taken given a state. In this case, this MDP totally depends on the current state, and it is now reduced to a Markov chain with rewards.
If we know the optimal Q values, we can get the optimal V values only if we know the environment’s transition function/matrix.
False, you don’t need the transition function.
False. For an optimal policy, max V(s) is equal to the max Q(s, a). No need to know the environment transition matrix.
In the gridworld MDP in “Smoov and Curly’s Bogus Journey”, if we add 10 to each state’s reward (terminal and non-terminal) the optimal policy will not change.
True, assuming an infinite horizon the optimal policy will be unchanged
Markov means RL agents are amnesiacs and forget everything up until the current state.
True here: the current state is all the agent should know.
Now if you want to discuss what a “current state” is… Well that can get more complicated.
True. In a MDP the present is a sufficient statistic of the past.
Only the current state matters and what past states were is irrelevant to the agent.
A policy that is greedy–with respect to the optimal value function–is not necessarily an optimal policy.
False as noted… Optimal action on an optimal value function is optimal by definition.
In TD learning, the sum of the learning rates used must converge for the value function to converge.
False, the sum of learning rates must diverge for the value function to converge.
Also, the sum of learning rates squared must converge for the value function to converge.
Monte Carlo is an unbiased estimator of the value function compared to TD methods. Therefore, it is the preferred algorithm when doing RL with episodic tasks.
False. (second sentence)
True –> Monte Carlo is an unbiased estimator of the value function compared to TD methods –> TD methods start with an initial estimate of q values and tend to be biased on these
False –> Most episodic tasks are too long and the computational advantages of TD updates favor TD methods over MC methods.
The value of the returned policy is the only metric we care about when evaluating a learner.
False. We can evaluate a learner on other metrics including computational complexity and how much data is required to effectively learn.
T/F POMDP allow us to strike a balance actions to gain reward and actions to gain information
TRUE this was all folded into one model (there was not special info to do this)
Partially Observable Markov Decision Processes (POMDPs) have succeeded in planning domains that require balancing actions that increase an agent’s knowledge and actions that increase an agent’s reward.
T/F DEC-POMDPS allow us to wrap coordinating and communicating into choosing actions to maximize utility
True.
The decentralized partially observable Markov decision process (Dec-POMDP) is a model for coordination and decision-making among multiple agents.
It is a probabilistic model that can consider uncertainty in outcomes, sensors and communication (i.e., costly, delayed, noisy or nonexistent communication). it is a generalization of a Markov decision process (MDP) and a partially observable Markov decision process (POMDP) to consider multiple decentralized agents.
DEC-POMDP stands for
Decentralized Partially Observable Markov Decision Process
The primary difference between POMDPs and DEC-POMDPS
In DEC-POMDPS, actions are taken simultaneously by a finite set of agents (not just 1)
DEC-POMDPS vs POSG
DEC-POMDP all share reward function (they are working together)
—> Note: what is POSG?
T/F DEC-POMDPs are represented by Ri (diff reward for each agent)
FALSE, there is a shared Reward (all agents working together). If reward wasn’t shared the model would be POSG
Properties of DEC-POMDP
- Elements of game theory and POMDPS
2. NEXP-complete (for finite horizon) –> NEXP-complete when the input is represented as a brief circuit
Inverse RL
Input behavior and environment and OUTPUT reward
MLRL
- Guess R
- Compute policy
- Measure probability of data given policy
- Compute gradient R to find how to change reward to make it fit the data better
In reward shaping, if the human believes X, Y, Z with 2/3, 1/6, 1/6 and the algorithm believes 1/10, 1/10, 8/10. What action should they choose?
Choose action Z because pairwise product is highest.
Argmax p(a|poliicy1)*p(a|policy2)
Drama Management
There is a 3rd agent, the “Author” that wants to build an agent that interferes with the player. Author created packman, agent is game itself, player is player obvi
TTD-MDPs vs MDPs
- rather than states: have trajectories (sequence so far)
- rather than rewards: have target distribution p(T).
Trajectory Distribution Markov Decision Process
Value Iteration Algorithm
Start w/ arbitrary utilities
Update utilities based on reward + neighbors (discounted future reward)
Repeat until convergence
Note that value iteration is obtained simply by turning the Bellman optimality equation into an update rule. Also note how the value iteration backup is identical to the policy evaluation backup (4.5) except that it requires the maximum to be taken over all actions.
Value iteration effectively combines, in each of its sweeps, one sweep of policy evaluation and one sweep of policy improvement.
Supervised Learning
y = f(x). Function approximation, find f that maps y to x