Exam Study Flashcards
A Markov Process is a process in which all states do not depend on actions
True. Markov means that you don’t have to condition on anything past the most recent state. A Markov Decision Process is a set of Markov Property Compliant states, with rewards and values
Decaying Reward encourages the agent to end the game quickly instead of running around and gathering more rewards
True. As reward decays the total reward for the episode decreases, so the agent is encouraged to maximize total reward by ending the game quickly.
R(s) and R(s,a) are equivalent
True. It just happens that it’s easier to think about one vs. the other in certain situations
Reinforcement Learning is harder to compute than a simple MDP
True. You can just use the Bellman for MDP but Reinforcement Learning requires that you make observations and then summarize those observations as values
An optimal policy is the best possible sequence of actions for an MDP
True, with a single caveat. The optimal policy is a policy that maximizes reward over an entire episode by taking the argmax of resulting values of actions + rewards. But MDPs are memoryless, so there is no concept of “sequence” for policy.
Temporal Difference Learning is the difference in reward you see on subsequent time steps.
False. Temporal Difference Learning is the difference in value estimates on subsequence time steps.
RL falls generally into 3 different categories: Model-Based, Value-Based, and Policy-Based
True, Model-Based is essentially using the Bellman Equation to solve a problem Value-Based is Temporal Difference Learning Policy-Based is similar to Value-Based, but it solves in a finite amount of time with a certain amount of confidence (in Greedy it’s guaranteed)
TD Learning is defined by Incremental Estimates that are Outcome Based
True. TD Learning thinks of learning in terms of “episodes”, which it uses to estimate the transition functions rather than having a predefined model.
For learning rate to guarantee convergence, the sum of the learning rate must be infinite, and the sum of the learning rate squared must be finite.
True. This is called contraction mapping and it guarantees convergence.
All of the TD learning methods have set backs, TD(1) is inefficient because it requires too much data and has high variance, TD(0) has a maximum likelihood estimate but is hard to calculate for long episodes.
True. This is why we use TD(Lamba), which has many fo the benefits of TD(0) but is much more performant. Empirically, lambas between 0.3 and 0.7 seem to perform best.
To control learning, you simply have the operator choose actions in addition to learning.
True. States are experienced as observations during learning, so the operator can influence learning.
Q-Learning converges
True. The Bellman Equation satisfies a Contraction Mapping where the sum of all is infinite, but the sum of all squared is less than infinite. It always converges to Q.
As long as the update operators for Q-learning or Value-iterations are non-expansions, then they will converge.
True. There are expansions that will converge, but only non-expansions are guaranteed to converge independent of their starting values.
A convex combination will converge.
False. It must be a fixed convex combination to converge. If the value can change, like the Boltzmann exploration, then it is no guaranteed to converge.
In Greedy Policies, the difference between the true value and the current value of the policy is less than some epsilon value for exploration.
True
It serves as a good check for how long we run value iteration until we’re pretty confident that we have the optimal policy.
True
For a set of linear equations, the solution can be found in polynomial time.
True
For Policy Iteration, Convergence is exact and complete in finite time.
False. This is only true for a Greedy Policy. It is true that it will be optimal policy, however.
The optimal policy will dominate all other policies in finite time.
This is true for greedy policy. This is because policies dominate on a state-by-state bases, so we don’t get stuck in local optima.
We get strict value improvement anytime we don’t have a policy that is already optimal.
True. Any time a policy is already optimal then a step of policy iteration results in a value non-deprovement. It might not get better, but it won’t get worse.
You can change the reward function only by multiplying the reward.
False. You can also change the reward function by shifting a constant, or using a potential functions. You can also scale positively, otherwise you may swap max and min by accident.
You can change the reward function only by adding a constant.
False. You can also change the reward function by multiplying the reward by a positive constant, or using a potential function.
You can change the reward function only using a potential function.
False, you can also change the reward function by multiplying the reward by a positive constant, or adding a constant.
You can change the reward function to learn faster only by using a potential function.
True. Multiplying a positive constant or shifting by a constant will not work, they work in roughly the same time.