Exam Study Flashcards

Question

You can change the reward function to learn faster by multiplying by a positive constant or shifting by a constant.

Answer 1

False. You can only speed up learning with a potential function.

Answer 2

False. You can sometimes end up in sub-optimal feedback loops.

Answer 3

False. This actually only works for K-Armed Bandit spaces, not for continuous spaces. This is because it depends on a ratio of observations to determine confidence, which cannot be achieved in spaces where the ratio changes.

Answer 4

False, there are actually several, including: - Identify optimal arm in the limit - Maximize the expected reward over a finite horizon - Find best - Few mistakes - Do well (the last 3 are related and have equivalent algorithms)

Answer 5

True. If you have an algorithm that finds e-best with a probability of 1-delta in tau pulls, you will find that same arm with that same probability with an upper bound tau for the first algorithm.

Answer 6

True. In stochastic decision problems, we can convince ourselves that we have a sufficiently accurate estimate using those bounds. This lets us get near optimal rewards.

Answer 7

True. In a sequential decision process, we can convince ourselves that we have a sufficient amount of information by proving that we have visited all states.

Answer 8

True. Because in step 3 we solve the MDP using the Bellman equations, which will give us the optimal policy and therefore never make a mistake.

Answer 9

False. Sometimes you will end up in a suboptimal reward loop if the discount factor is too low or the state is too far away.

Answer 10

False. In order to learn new states, you may make mistakes in order to discover those states. The maximum number of mistakes that Rmax will make over the course of it's lifetime is O(n^2 K). Where n is the number of states and k if number of actions.

Answer 11

True. If the difference between the transition functions is less than or equal to alpha, then the difference between the rewards is also less than or equal to alpha.

Answer 12

False. Using the explore or exploit lemma (theorem), Rmax guarantees that we will either explore or be near optimal with \*\*\*high probability\*\*\*, \*\*\*not guaranteed\*\*\*. The maximum run time is either guarantee near optimal or learn something new is bounded by the number of states times the number of actions (n\*k)

Answer 13

True. The need comes from the fact that there can be zillions upon zillions of states. In a continuous space, it's even impossible to visit every state once. The method is to take what we learned in supervised learning, specifically linear function approximation.

Answer 14

False. Value Function Gradients are the most popular because there's been a bunch of success there. However, in the Robotics sector, policy gradients have been very successful.

Answer 15

False. Baird's Counterexample.

Answer 16

True. All you need to do is add observables and an observation function to the MDP tuple. However, you cannot go in reverse.

Answer 17

False. You can infer the probability of being in a state, but you can't always definitely say you are in a state. A POMP is just an MDP with probability 1.

Answer 18

Somewhat false. The underlying MDP has a finite number of states, but because our observations (belief states) are probability distributions, the actual number of states is infinite.

Answer 19

True. Convex piecewise linear belief space of a finite set of linear functions (which can represent an infinite number of inputs). Sondik

Answer 20

False. Because you take the dot product of the result, you need a matching size. One way to do this is to make those all zero, but it also works if you use the reward vector.

Answer 21

True. And you'll maintain a finite number of vectors in doing so. You'll eliminate the infinite set of vectors by taking their sum and product a new vector.

Answer 22

True. Any vector that is dominated by another vector (that is, smaller at all points in the belief space) can be removed in linear time. You can also remove non-dominated vectors in O(LP) time.

Answer 23

False. POMDPs have infinite states so they cannot converge in finite time. It is however true that we get some arbitrarily "good" approximation of the optimal value function after some finite number of iterations.

Answer 24

True. You cannot definitely say what the optimal action given a state, if you could solve that problem you could solve the halting problem.

Answer 25

False. POMDPs are partially observed and controlled. Markov Cains are observed, uncontrollable processes.

Answer 26

True. This is essentially taking the "max" of a POMP's belief space. Since we don't affirmatively know which state of the real MDP we are in unless the MDP = POMDP, we take actions based on the maximums of piecewise linear functions, which end up being convex when they are pieced together.

Answer 27

True. It comes from the fact that as we learn about which state we're actually in (through process of elimination) which converts to a POMDP, we normalize our belief states transition probability using Bayes

Answer 28

True. In Bayesian RL, you can simply maximize the reward and make no distinction on whether actions taken are exploring or exploiting. This is because the transition functions are unknown, and are instead given probability based on Bayes Rule, which converts to a POMDP.

Answer 29

True. The hidden state is the parameters of the MDP that we're trying to learn. So, essentially, through Bayesian RL, we recreate the MDP.

Answer 30

True. There's also a difficulty of bootstrapping (exploration), and the scaling of the state space and action space.

Answer 31

True. This is called temporal abstraction and while counterintuitive, it actually makes the space simpler. This is because instead of taking several individual actions, your policy is one "big" action which is a collection of actions. State abstraction (value-iteration) is like treating a bunch of states equivalently, Temporal abstraction is like treating a bunch of time-steps equivalently.

Answer 32

False. Options are a mapping of state-action pairs to some probability of taking the option as well. To get an MDP, you need to set up your initialization set and policy correctly as well. - The initialization set must be all valid actions for that particular state, and - The policy must have probability 1 for the action you wish to take, and 0s everywhere else.

Answer 33

True. Since you are summing the discounted reward for each of the atomic time steps that make up the variable time step (option), they are interchangeable at a single time step. In fact, MDPs are simply a special case of SMDPs, which are a generalization of MDPs, where at each time step the terminating probability for all possible result states of all available actions is 1 for time-step 1.

Answer 34

False. Because options depend on the number of steps k taken between steps, they are no longer stateless, bur rather \*\*\*sequential\*\*\*\*, which violates the Markovian Property. For Semi-Markov decision problems (SMDPs), an additional parameter of interest is the time spent in each transition.

Answer 35

Somewhat false. There's also a chance that options will shrink state-space through state abstraction, but it depends on how your options are defined. Additionally, if your options cover all possible states and time-steps, then nothing will shrink, since at that point it's just the MDP instead of an SMDP.

Answer 36

True. 3 examples of this are: 1. Greatest Mass Q-Learning, which adds up all the Q-values for state-action pairs per goal/agent and takes the largest one (so essentially, the goals are voting on which action to take) 2. Top Q-Learning, in which the largest Q-value wins across all goals/agents 3. Negotiated W-Learning, in which the agent with the most to lose gets to choose the action.

Answer 37

False. In fact, the opposite is true. Since agents are summing their Q-values for the state-action pairs, there is a possibility that none of the agents are satisfied by the action selected simply due to the intense disagreement. While it is possible that each goal's maximum Q-value is the same and all agents would be satisfied, this is not always the case. Worse, it is possible when n goals \< m actions that the least satisfactory action for all n agents ends ip being selected simply because that's the only action that all n agents agree on.

Answer 38

False. It is possible that the largest Q-value for a given agent is actually only slightly larger than the agent's other actions, but it just happens to have the largest Q-value. So in this situation, the agent doesn't actually care about which action it selects, but it still gets to choose which action to take simply because it's Q-values are larger.

Answer 39

False. For 2 reasons: 1. The agent that selects the action is the agent with the most to lose, not the most to gain, so you don't necessarily find actions that maximize the Q-value. 2. You don't maximize the value so much as minimize the loss of all agents, since you are choosing the agent with the largest loss and guaranteeing that the loss is minimized.

Answer 40

True. This is due to Arrow's Impossibility Result, which states that it is not possible to construct a fair voting result. Because all of the Modular RL/Arbitration actions involve some kind of voting: - Greatest Mass Q-Learning through aggregation - Top Q-Learning through domination of top results - Negotiated Q-Learning through selecting the agent with the most to lose, you cannot construct these systems due to the lack of Units in the Q-values. (Metric vs Imperial example)

Answer 41

False. You can get a better performance by applying failure constraints to your selection process.

Answer 42

True. This is because planning time is independent of the state space. The downside is that you need lots of samples to get a good estimate. Monte Carlo does depend on how far you look into the future, and has a running time that is exponential in the horizon O( (|A|\*x)^H )

Answer 43

False. Even if the agents play relative to their observed reward in previous instances, they will all act relative to the expected terminal state, which will be the single-game result. Because of this, the single-game result. Because of this, the single-game result back-propagates identical results back through the games.

Answer 44

True. This is because you can represent the reward of a pair of decisions by each player as a single value. The caveat is that you can infer from game to matrix, but not matrix to games.

Answer 45

False. The game must have perfect information. If you have both of these things, then there will be an optimal row/column that both players will see as maximizing their reward. The game must also be deterministic, but non-deterministic games can be changed to deterministic games by weighting the rewards by their probabilities.

Answer 46

True. Because the non-deterministic rewards can be simplified to their rewards weighted by their probabilities, you can still create the minimax-maximin matrix and find the optimal solution for both players (aka the value of the game).

Answer 47

False. A game of hidden information does not satisfy the requirement of minimax = maximin, so you can't treat them equally. Hidden information games are solved through mixed strategies, or taking actions based on probability rather than based on perfect expected outcome, the probability being the solution to multiple linear equations for each player.

Answer 48

False. Because the rewards are different for non-zero-sum games, the Nash Equilibrium is not necessarily the minimax for the entire game, but rather the equilibrium where neither player can be convinced to switch their action (a.k.a neither player will get a greater reward for switching)

Answer 49

False. Even though the game can be given an average number of plays, the strategies to maximize reward are not the same. For an unbounded series of games, you must define optimal strategies in terms of only the previous game, not future games.

Answer 50

True. Because there are now strategies that each player can take in order to both create a plausible threat, and to also converge on the maximum possible value. The best example of this is Pavolv, where you can cooperate if you agree, and defect if you disagree. This converges to a cooperation for all plausible threat strategies.

Answer 51

False. It must be a 2-player, bimatrix game. It will construct a perfect nash equilibrium in polynomial time. There are 3 possible results: Pavlov, Zero-Sum-Like, or only a single player improves.

Answer 52

False. Nash-Q can have different values for equilibria, so it does not converge to a single strategy.

Answer 53

True. Just like POMDPs they are impossible to solve. They are NEXP-complete for a finite horizon.

Answer 54

False. In both IRL and RL the environment is observed by the agent(s).

Answer 55

False. When working off a teacher's +/- data for a given action space, Reward Shaping arbitrarily assigns +1 and -1. Not only is this not correct, but it won't necessarily teach the agent the correct policy because the rewards are not scaled. Policy shaping is the literal policy based on +/- input, independent of reward.

Answer 56

False. Confidence is baked into the policy already. This is because of a combination of Bayes Theorem and observations normalizing the policy. While we become less and less confident in the human's observations over time, and start favoring the agent's policy, these are already baked into their respective policies so we only need to combine them.

Answer 57

True. The reward state simply becomes zero for additional states.

Answer 58

False. Recent moves influence the outcome less due to contraction. However, they end up influencing the same amount as all previous moves due to averaging.

Answer 59

True. Since a fixed policy never changes, you will experience the same chain every single time.

Answer 60

False. Monte Carlo is TD(0), but we prefer TD(Lamba) so that that states can have multi-step lookaheads and be influenced not just by immediate rewards, but also future rewards.

Answer 61

True. But backward is easier to implement so we prefer it.

Answer 62

False. We also care about the speed at which learning occurs, and the data that it requires.

Answer 63

False. Sutton 1988 notes that Widrow-Hoff minimizes error on the training set, but that isn't the same as an yielding an MLE estimate.

Answer 64

False. The initial values influence actions taken for each individual episode without exploration.

Answer 65

False. Only non-expansions are guaranteed to converge, but there are fringe cases that will still converge.

Answer 66

True. They can be thought of as a 1-player Markov Game.

Answer 67

True. You can change the reward function without changing the optimal policy by either multiplying by a constant, adding a scalar, or a non-linear potential-based reward.

Answer 68

Somewhat true. You can create rewards that explore the MDP better and avoid sub-optimal states, but you can also create rewards that get caught in loops and do not converge.

Answer 69

False. If you do this the MDP will converge to those bonus states rather than trying to explore. You need to have some kind of penalty or time decayed reward for those states for them to be effective.

Answer 70

True. The optimal policy of the underlying MDP cannot be better than the one where unknown states are assigned an optimistic value of Rmax.

Answer 71

False. If you've found the best arm then you've found the best arm with probability 1-delta and there's no need to try other arms.

Answer 72

False. Because potential-based shaping modifies rewards, and those rewards are constant, whereas Q-values can change.

Answer 73

False. It refers to temporal assignment of credit. You are propagating value backwards.

Answer 74

False. Model-based refers to having a perfect information MDP (ie. all transition functions, states, and actions are known and perfectly observable). You can either have the MDP given to you, or you can estimate it using sampling. Once you have the model, you can solve it using the Bellman Equations.

Answer 75

False. Because Value-Iteration requires a model.

Answer 76

True. Because they are taking different actions, they are seeing different states and rewards than the greedy policy.

Answer 77

False. Only linear.

Answer 78

False. POMDPs know the MDP, but they don't know where they are in the MDP. Instead of affirmative states, they have belief states and observables.

Answer 79

False. Grim Trigger strategy means a player will cooperate until the opponent does not, then the player uncooperates (take revenge) for the remainder of the game.

Answer 80

False. Bayesian RL is a statistical approach to RL that uses Bayes Rule, but is not necessarily model-based. You do not need to know the model ahead of time in order to use it.

Answer 81

True. Agents are acting based on local observable information to maximize a joint reward.

Answer 82

False, they solve environments regardless of whether they're continuous or discrete.

Exam Study Flashcards

(106 cards)