A Markov Process is a process in which all states do not depend on actions
True. Markov means that you don't have to condition on anything past the most recent state. A Markov Decision Process is a set of Markov Property Compliant states, with rewards and values
Decaying Reward encourages the agent to end the game quickly instead of running around and gathering more rewards
True. As reward decays the total reward for the episode decreases, so the agent is encouraged to maximize total reward by ending the game quickly.
R(s) and R(s,a) are equivalent
True. It just happens that it's easier to think about one vs. the other in certain situations
Reinforcement Learning is harder to compute than a simple MDP
True. You can just use the Bellman for MDP but Reinforcement Learning requires that you make observations and then summarize those observations as values
An optimal policy is the best possible sequence of actions for an MDP
True, with a single caveat. The optimal policy is a policy that maximizes reward over an entire episode by taking the argmax of resulting values of actions + rewards. But MDPs are memoryless, so there is no concept of "sequence" for policy.
Temporal Difference Learning is the difference in reward you see on subsequent time steps.
False. Temporal Difference Learning is the difference in value estimates on subsequence time steps.
RL falls generally into 3 different categories: Model-Based, Value-Based, and Policy-Based
True, Model-Based is essentially using the Bellman Equation to solve a problem Value-Based is Temporal Difference Learning Policy-Based is similar to Value-Based, but it solves in a finite amount of time with a certain amount of confidence (in Greedy it's guaranteed)
TD Learning is defined by Incremental Estimates that are Outcome Based
True. TD Learning thinks of learning in terms of "episodes", which it uses to estimate the transition functions rather than having a predefined model.
For learning rate to guarantee convergence, the sum of the learning rate must be infinite, and the sum of the learning rate squared must be finite.
True. This is called contraction mapping and it guarantees convergence.
All of the TD learning methods have set backs, TD(1) is inefficient because it requires too much data and has high variance, TD(0) has a maximum likelihood estimate but is hard to calculate for long episodes.
True. This is why we use TD(Lamba), which has many fo the benefits of TD(0) but is much more performant. Empirically, lambas between 0.3 and 0.7 seem to perform best.
To control learning, you simply have the operator choose actions in addition to learning.
True. States are experienced as observations during learning, so the operator can influence learning.
True. The Bellman Equation satisfies a Contraction Mapping where the sum of all is infinite, but the sum of all squared is less than infinite. It always converges to Q.
As long as the update operators for Q-learning or Value-iterations are non-expansions, then they will converge.
True. There are expansions that will converge, but only non-expansions are guaranteed to converge independent of their starting values.
A convex combination will converge.
False. It must be a fixed convex combination to converge. If the value can change, like the Boltzmann exploration, then it is no guaranteed to converge.
In Greedy Policies, the difference between the true value and the current value of the policy is less than some epsilon value for exploration.
It serves as a good check for how long we run value iteration until we're pretty confident that we have the optimal policy.
For a set of linear equations, the solution can be found in polynomial time.
For Policy Iteration, Convergence is exact and complete in finite time.
False. This is only true for a Greedy Policy. It is true that it will be optimal policy, however.
The optimal policy will dominate all other policies in finite time.
This is true for greedy policy. This is because policies dominate on a state-by-state bases, so we don't get stuck in local optima.
We get strict value improvement anytime we don't have a policy that is already optimal.
True. Any time a policy is already optimal then a step of policy iteration results in a value non-deprovement. It might not get better, but it won't get worse.
You can change the reward function only by multiplying the reward.
False. You can also change the reward function by shifting a constant, or using a potential functions. You can also scale positively, otherwise you may swap max and min by accident.
You can change the reward function only by adding a constant.
False. You can also change the reward function by multiplying the reward by a positive constant, or using a potential function.
You can change the reward function only using a potential function.
False, you can also change the reward function by multiplying the reward by a positive constant, or adding a constant.
You can change the reward function to learn faster only by using a potential function.
True. Multiplying a positive constant or shifting by a constant will not work, they work in roughly the same time.
You can change the reward function to learn faster by multiplying by a positive constant or shifting by a constant.
False. You can only speed up learning with a potential function.
Even when you use reward shaping, you'll still always find the optimal policy.
False. You can sometimes end up in sub-optimal feedback loops.
The Gitten's Index can be used for all Q-Learning algorithms.
False. This actually only works for K-Armed Bandit spaces, not for continuous spaces. This is because it depends on a ratio of observations to determine confidence, which cannot be achieved in spaces where the ratio changes.
The Gitten's Index is the only metric for K-Armed Bandit problems.
False, there are actually several, including: - Identify optimal arm in the limit - Maximize the expected reward over a finite horizon - Find best - Few mistakes - Do well (the last 3 are related and have equivalent algorithms)
If you have an algorithm that can find the best arm, you can find an algorithm that can find an arm with few mistakes
True. If you have an algorithm that finds e-best with a probability of 1-delta in tau pulls, you will find that same arm with that same probability with an upper bound tau for the first algorithm.
We can estimate what we know using Hoeffding bounds and Union bounds.
True. In stochastic decision problems, we can convince ourselves that we have a sufficiently accurate estimate using those bounds. This lets us get near optimal rewards.
In a deterministic MDP, Rmax will cause us to explore all states by assuming that unexplored or unknown states will have the maximum possible reward.
True. In a sequential decision process, we can convince ourselves that we have a sufficient amount of information by proving that we have visited all states.
Using Rmax, once all edges are known, the algorithm will no longer make mistakes.
True. Because in step 3 we solve the MDP using the Bellman equations, which will give us the optimal policy and therefore never make a mistake.
Using Rmax, you will always visit all unknown states.
False. Sometimes you will end up in a suboptimal reward loop if the discount factor is too low or the state is too far away.
Using Rmax, you will never make any mistakes.
False. In order to learn new states, you may make mistakes in order to discover those states. The maximum number of mistakes that Rmax will make over the course of it's lifetime is O(n^2 K). Where n is the number of states and k if number of actions.
Using a simulation lemma (theorem), if you have an estimate of the real MDP, then optimizing the rewards in the estimate will be near optimal for the real MDP.
True. If the difference between the transition functions is less than or equal to alpha, then the difference between the rewards is also less than or equal to alpha.
If all transitions are either accurately estimates or unknown, optimal policy is either near optimal or an unknown state is reached quickly.
False. Using the explore or exploit lemma (theorem), Rmax guarantees that we will either explore or be near optimal with ***high probability***, ***not guaranteed***. The maximum run time is either guarantee near optimal or learn something new is bounded by the number of states times the number of actions (n*k)
There is both a need and a method to do generalizations in RL.
True. The need comes from the fact that there can be zillions upon zillions of states. In a continuous space, it's even impossible to visit every state once. The method is to take what we learned in supervised learning, specifically linear function approximation.
Policy Gradients are the most popular for linear function approximation for generalized reinforcement learning.
False. Value Function Gradients are the most popular because there's been a bunch of success there. However, in the Robotics sector, policy gradients have been very successful.
A Linear Value Approximator always works
False. Baird's Counterexample.
POMPs Generalize MDPs
True. All you need to do is add observables and an observation function to the MDP tuple. However, you cannot go in reverse.
You can always infer a state from observations.
False. You can infer the probability of being in a state, but you can't always definitely say you are in a state. A POMP is just an MDP with probability 1.
A POMP has a finite number of states.
Somewhat false. The underlying MDP has a finite number of states, but because our observations (belief states) are probability distributions, the actual number of states is infinite.
A POMDP can be written in a finite way.
True. Convex piecewise linear belief space of a finite set of linear functions (which can represent an infinite number of inputs). Sondik
A vector space of 0 with size s is the only maximum of belief state vectors.
False. Because you take the dot product of the result, you need a matching size. One way to do this is to make those all zero, but it also works if you use the reward vector.
The maximum of all maximums in a bag of vectors is the piecewise-linear and convex value.
True. And you'll maintain a finite number of vectors in doing so. You'll eliminate the infinite set of vectors by taking their sum and product a new vector.
You can eliminate vectors that are not participating in a piecewise-linear and convex value function.
True. Any vector that is dominated by another vector (that is, smaller at all points in the belief space) can be removed in linear time. You can also remove non-dominated vectors in O(LP) time.
The Optimal Policy converges in finite time for POMPs.
False. POMDPs have infinite states so they cannot converge in finite time. It is however true that we get some arbitrarily "good" approximation of the optimal value function after some finite number of iterations.
Planning in POMDPs is undecidable
True. You cannot definitely say what the optimal action given a state, if you could solve that problem you could solve the halting problem.
A partially observed, uncontrolled process is called a POMP, and an observed, controlled process is called a MDP.
False. POMDPs are partially observed and controlled. Markov Cains are observed, uncontrollable processes.
You can solve a POMP with piecewise linear and convex functions.
True. This is essentially taking the "max" of a POMP's belief space. Since we don't affirmatively know which state of the real MDP we are in unless the MDP = POMDP, we take actions based on the maximums of piecewise linear functions, which end up being convex when they are pieced together.
Bayesian RL gets its name from Bayes Rule
True. It comes from the fact that as we learn about which state we're actually in (through process of elimination) which converts to a POMDP, we normalize our belief states transition probability using Bayes
Exploration is exploitation.
True. In Bayesian RL, you can simply maximize the reward and make no distinction on whether actions taken are exploring or exploiting. This is because the transition functions are unknown, and are instead given probability based on Bayes Rule, which converts to a POMDP.
RL is actually planning in a continuous POMDP.
True. The hidden state is the parameters of the MDP that we're trying to learn. So, essentially, through Bayesian RL, we recreate the MDP.
RL is hard because of the delayed reward.
True. There's also a difficulty of bootstrapping (exploration), and the scaling of the state space and action space.
You can more easily solve continuous space problems by adding actions.
True. This is called temporal abstraction and while counterintuitive, it actually makes the space simpler. This is because instead of taking several individual actions, your policy is one "big" action which is a collection of actions. State abstraction (value-iteration) is like treating a bunch of states equivalently, Temporal abstraction is like treating a bunch of time-steps equivalently.
In Temporal Abstraction, in order to get a regular MDP, all you need to do is set all termination probabilities of the options(s) to 1.
False. Options are a mapping of state-action pairs to some probability of taking the option as well. To get an MDP, you need to set up your initialization set and policy correctly as well. - The initialization set must be all valid actions for that particular state, and - The policy must have probability 1 for the action you wish to take, and 0s everywhere else.
Variable time steps and atomic time steps are interchangeable.
True. Since you are summing the discounted reward for each of the atomic time steps that make up the variable time step (option), they are interchangeable at a single time step. In fact, MDPs are simply a special case of SMDPs, which are a generalization of MDPs, where at each time step the terminating probability for all possible result states of all available actions is 1 for time-step 1.
SMDPs satisfy all the requirements of an MDP.
False. Because options depend on the number of steps k taken between steps, they are no longer stateless, bur rather ***sequential****, which violates the Markovian Property. For Semi-Markov decision problems (SMDPs), an additional parameter of interest is the time spent in each transition.
Options perform temporal abstraction to shrink time-steps.
Somewhat false. There's also a chance that options will shrink state-space through state abstraction, but it depends on how your options are defined. Additionally, if your options cover all possible states and time-steps, then nothing will shrink, since at that point it's just the MDP instead of an SMDP.
Modular RL is a field that studies Q-Learning through the concept of Arbitration.
True. 3 examples of this are: 1. Greatest Mass Q-Learning, which adds up all the Q-values for state-action pairs per goal/agent and takes the largest one (so essentially, the goals are voting on which action to take) 2. Top Q-Learning, in which the largest Q-value wins across all goals/agents 3. Negotiated W-Learning, in which the agent with the most to lose gets to choose the action.
Greatest Mass Q-Learning is effective because all goals are satisfied by the action taken.
False. In fact, the opposite is true. Since agents are summing their Q-values for the state-action pairs, there is a possibility that none of the agents are satisfied by the action selected simply due to the intense disagreement. While it is possible that each goal's maximum Q-value is the same and all agents would be satisfied, this is not always the case. Worse, it is possible when n goals < m actions that the least satisfactory action for all n agents ends ip being selected simply because that's the only action that all n agents agree on.
Top Q-Learning is effective because the goal that dominates will at least have it's needs satisfied.
False. It is possible that the largest Q-value for a given agent is actually only slightly larger than the agent's other actions, but it just happens to have the largest Q-value. So in this situation, the agent doesn't actually care about which action it selects, but it still gets to choose which action to take simply because it's Q-values are larger.
In Negotiating Q-Learning, you maximize the Q-value of all agents.
False. For 2 reasons: 1. The agent that selects the action is the agent with the most to lose, not the most to gain, so you don't necessarily find actions that maximize the Q-value. 2. You don't maximize the value so much as minimize the loss of all agents, since you are choosing the agent with the largest loss and guaranteeing that the loss is minimized.
Modular RL is impossible.
True. This is due to Arrow's Impossibility Result, which states that it is not possible to construct a fair voting result. Because all of the Modular RL/Arbitration actions involve some kind of voting: - Greatest Mass Q-Learning through aggregation - Top Q-Learning through domination of top results - Negotiated Q-Learning through selecting the agent with the most to lose, you cannot construct these systems due to the lack of Units in the Q-values. (Metric vs Imperial example)
In Monte Carlo Tree Searches, the best rollout policy is behave randomly.
False. You can get a better performance by applying failure constraints to your selection process.
Monte Carlo Tree Search is useful for large state spaces.
True. This is because planning time is independent of the state space. The downside is that you need lots of samples to get a good estimate. Monte Carlo does depend on how far you look into the future, and has a running time that is exponential in the horizon O( (|A|*x)^H )
Prisoner's Dilemma-like problems (zero-sum games) will have different results played over multiple games than a single instance of game.
False. Even if the agents play relative to their observed reward in previous instances, they will all act relative to the expected terminal state, which will be the single-game result. Because of this, the single-game result. Because of this, the single-game result back-propagates identical results back through the games.
All Zero-Sum 2 player games can be represented by a matrix.
True. This is because you can represent the reward of a pair of decisions by each player as a single value. The caveat is that you can infer from game to matrix, but not matrix to games.
In a 2-player, zero-sum game, minimax = maximin, and there always exists an optimal pure strategy for each player.
False. The game must have perfect information. If you have both of these things, then there will be an optimal row/column that both players will see as maximizing their reward. The game must also be deterministic, but non-deterministic games can be changed to deterministic games by weighting the rewards by their probabilities.
A deterministic game and a non-deterministic game are both conceptually equal.
True. Because the non-deterministic rewards can be simplified to their rewards weighted by their probabilities, you can still create the minimax-maximin matrix and find the optimal solution for both players (aka the value of the game).
A game of perfect information and a game of hidden information are both conceptually equal.
False. A game of hidden information does not satisfy the requirement of minimax = maximin, so you can't treat them equally. Hidden information games are solved through mixed strategies, or taking actions based on probability rather than based on perfect expected outcome, the probability being the solution to multiple linear equations for each player.
A zero-sum game and a non-zero sum game are conceptually equal.
False. Because the rewards are different for non-zero-sum games, the Nash Equilibrium is not necessarily the minimax for the entire game, but rather the equilibrium where neither player can be convinced to switch their action (a.k.a neither player will get a greater reward for switching)
A repeated 2-player bimatrix game with a certain end is equal to a repeated 2-player bimatrix game with an uncertain end.
False. Even though the game can be given an average number of plays, the strategies to maximize reward are not the same. For an unbounded series of games, you must define optimal strategies in terms of only the previous game, not future games.
In repeated games, the possibility of retaliation opens the door for cooperation.
True. Because there are now strategies that each player can take in order to both create a plausible threat, and to also converge on the maximum possible value. The best example of this is Pavolv, where you can cooperate if you agree, and defect if you disagree. This converges to a cooperation for all plausible threat strategies.
You can build a Pavolv-like machine for any game.
False. It must be a 2-player, bimatrix game. It will construct a perfect nash equilibrium in polynomial time. There are 3 possible results: Pavlov, Zero-Sum-Like, or only a single player improves.
False. Nash-Q can have different values for equilibria, so it does not converge to a single strategy.
DEC-POMPs are impossible to solve.
True. Just like POMDPs they are impossible to solve. They are NEXP-complete for a finite horizon.
Inverse Reinforcement Learning takes Behavior and infers the environment and rewards, in contrast to Reinforcement Learning which takes the environment and rewards and infers behavior.
False. In both IRL and RL the environment is observed by the agent(s).
Reward Shaping and Policy Shaping are the same thing.
False. When working off a teacher's +/- data for a given action space, Reward Shaping arbitrarily assigns +1 and -1. Not only is this not correct, but it won't necessarily teach the agent the correct policy because the rewards are not scaled. Policy shaping is the literal policy based on +/- input, independent of reward.
You need to include confidence of the human policy in order to calculate policy during policy shaping.
False. Confidence is baked into the policy already. This is because of a combination of Bayes Theorem and observations normalizing the policy. While we become less and less confident in the human's observations over time, and start favoring the agent's policy, these are already baked into their respective policies so we only need to combine them.
It is always possible to convert a finite horizon MDP to a infinite horizon MDP.
True. The reward state simply becomes zero for additional states.
In Reinforcement Learning, recent moves influence the outcome more than moves further in the past.
False. Recent moves influence the outcome less due to contraction. However, they end up influencing the same amount as all previous moves due to averaging.
An MDP with a fixed policy is a Markov Chain with rewards.
True. Since a fixed policy never changes, you will experience the same chain every single time.
Monte Carlo is an unbiased estimator of the value function compared to Temporal Difference methods. Therefore, it is the preferred algorithm when doing Reinforcement Learning with episodic tasks.
False. Monte Carlo is TD(0), but we prefer TD(Lamba) so that that states can have multi-step lookaheads and be influenced not just by immediate rewards, but also future rewards.
Backward and forward TD(lambda) can be applied to the same problems.
True. But backward is easier to implement so we prefer it.
The accuracy of the return policy is the only metric we care about when evaluating a learner.
False. We also care about the speed at which learning occurs, and the data that it requires.
Predictions based on minimizing error -- for example Widrow-Hoff -- will be optimized as error by definition is minimized.
False. Sutton 1988 notes that Widrow-Hoff minimizes error on the training set, but that isn't the same as an yielding an MLE estimate.
TD methods are unbiased and accurate as they are based directly on the outcome of each trial/experience.
False. The initial values influence actions taken for each individual episode without exploration.
An update that is an expansion will not converge.
False. Only non-expansions are guaranteed to converge, but there are fringe cases that will still converge.
MDPs are a type of Markov Game.
True. They can be thought of as a 1-player Markov Game.
Any optimal policy found with reward shaping is the optimal policy for the original MDP.
True. You can change the reward function without changing the optimal policy by either multiplying by a constant, adding a scalar, or a non-linear potential-based reward.
Potential-based shaping will find an optimal policy faster than an unshaped MDP.
Somewhat true. You can create rewards that explore the MDP better and avoid sub-optimal states, but you can also create rewards that get caught in loops and do not converge.
Thinking back to HW #1, state-based bonuses are an effective way to shape the rewards.
False. If you do this the MDP will converge to those bonus states rather than trying to explore. You need to have some kind of penalty or time decayed reward for those states for them to be effective.
Rmax will always find the optimal policy for a properly tuned learning function.
True. The optimal policy of the underlying MDP cannot be better than the one where unknown states are assigned an optimistic value of Rmax.
In a k-armed bandit problem, even if you have found the best arm to pull you still must explore and try another.
False. If you've found the best arm then you've found the best arm with probability 1-delta and there's no need to try other arms.
Potential-Based shaping is equivalent to modifying initial Q-values.
False. Because potential-based shaping modifies rewards, and those rewards are constant, whereas Q-values can change.
The credit assignment problem refers to the state that gives the most reward on an MDP.
False. It refers to temporal assignment of credit. You are propagating value backwards.
Model-Based Reinforcement Learning is the use of supervised learning models such as neural networks to solve large state space RL problems.
False. Model-based refers to having a perfect information MDP (ie. all transition functions, states, and actions are known and perfectly observable). You can either have the MDP given to you, or you can estimate it using sampling. Once you have the model, you can solve it using the Bellman Equations.
Value-Iteration is one of the most important model-free reinforcement learning methods.
False. Because Value-Iteration requires a model.
Off-policy agents learn the values of a policy different than the policy they are acting under.
True. Because they are taking different actions, they are seeing different states and rewards than the greedy policy.
Only non-linear function approximation has been proven to converge when used with the right hyper-parameters.
False. Only linear.
POMDP are partially-observable because they are missing the MDP. An example this is model-free reinforcement learning problems.
False. POMDPs know the MDP, but they don't know where they are in the MDP. Instead of affirmative states, they have belief states and observables.
Grim trigger strategy means a player will cooperate for the entire game regardless of the other player's action.
False. Grim Trigger strategy means a player will cooperate until the opponent does not, then the player uncooperates (take revenge) for the remainder of the game.
Bayesian RL is a model-based approach that relies heavily in statistical methods such as Bayes Rule.
False. Bayesian RL is a statistical approach to RL that uses Bayes Rule, but is not necessarily model-based. You do not need to know the model ahead of time in order to use it.
DEC-POMDPs is a modeling framework for cooperative problems under uncertainty.
True. Agents are acting based on local observable information to maximize a joint reward.
Model-based reinforcement learning agents can solve environments with continuous state variables because they are able to learn the transition and reward function.
False, they solve environments regardless of whether they're continuous or discrete.