final Flashcards

Question

Q-learning converges only under certain exploration decay conditions.

Answer 1

False, Q-learning can converge even when random actions are performed since this is an off-policy algorithm.

Answer 2

False, depending on the confidence level we feel comfortable with, we stop exploring bandits we "believe" we have obtained the optimal solution for.

Answer 3

False. "online" does not matter, SARSA converges with infinite exploration at the limit and a greedy policy.

Answer 4

Online means learning from each sample as it comes in whereas offline means learning from the corpus of data as a whole after the fact.

Answer 5

on-policy uses the actual learned policy to pick the next actions off-policy estimates the q values (state-action value) directly from the Q function regardless of the policy being followed by the agent. (greedily updates the policy despite the fact that its not following a greedy policy)

Answer 6

False, by doing feature engineering can be used broadly

Answer 7

False, KWIK may have worse time complexity than MB. | such as when a problem state is very large.

Answer 8

False, when using a classic update using linear function approximation, you are not guaranteed to converge. for example: Baird’s star problem

Answer 9

True, Convergence is guaranteed. It can be slow if you choose psi poorly. But, it is fast for the some value of psi.

Answer 10

True, any generalization of an MDP would result in another MDP.

Answer 11

False. You can go from POMDP over some state-space to a larger MDP over a belief-space. POMDP is a generalization of MDP, so there exists a MDP that can be represented as MDP and POMDP.

Answer 12

False. Lots of algorithms will work here that involve using the POMDP: qlearning, sarsa, etc...

Answer 13

False. any state that can be represented as a POMDP can be represented as a PSR. The size of the POMDP does not affect whether or not it can be represented as PSR

Answer 14

False. Options over MDP forms a semi-MDP where action in an atomic timestep in MDP is abstracted to become a sub-policy over K steps.

Answer 15

False: Nash equilibria can be both, pure and mixed. The notion of a Nash equilibrium is that all agents are not willing to change their policy if they could. That holds also for mixed strategies.

Answer 16

False. A two player, finite, deterministic game of perfect information always has a pure optimal strategy. Also maximin=minimax in this game type. von Neumann's theorem states that there always exists a pure strategy for each player in that game situation.

Answer 17

False. The folk theorem states that the threat of retaliation opens the door to cooperation, which implies games that aren’t one shot. the theorem states that the notion of threats can stabilize payoff profiles in repeated games

Answer 18

False. That is Grim Trigger. In Pavlov we defect until we agree.

Answer 19

False. Coco uses side payments. Correlated equilibria can be reached without side payments. CE requires a third-party to provide some level of information about what the other party will do. The third party is like a consultant, but not a middleman for a payment.

Answer 20

False. A subperfect game means that the payouts can’t be improved by changing the history of the game play. the wording seems misused here. Subgame perfect refers to a Nash equilibrium that remains a Nash equilibrium for all subgames. subgame perfect is when the strategy is always the best response, independent of history including starting states.

Answer 21

False, Inverse RL utilizes behavior observations and an environment to create a reward structure. IRL means that we're learning the reward function based on observed behaviors, rather than classical RL where we know the reward function in advance and want to learn behaviors to maximize the reward.

Answer 22

True. The agents must plan to balance communication and action as each have a cost when taking independent actions with a shared reward function. ..Where communication refers to shared reward between players.

Answer 23

False, policy shaping requires a certain amount of confidence, but that does not mean it needs a completely correct oracle.

Answer 24

y=f(x) - Function approximation - Given x’s and y’s, find f

Answer 25

utilizes clustering, descriptions, to create a set of labels, f, for certain observances based on their attributes such that f(x). given x's, find f

Answer 26

finds an underlying function, f, given a reward, z, based on an observance, x, and a predicted value, y, such that y = f(x)|z. Given x’s and z’s, find f and y

Answer 27

A framework of discrete states and actions to model decision making. States, S Actions, A Transition function T(s,a,s’) between states. State transitions result in a reward. Reward model, R(s, a, s’), describes the reward for moving from state s to s’ via action a. Policy Pi(s) is the function that returns an action, a, for any state, s. The optimal policy is referred to as Pi* and maximizes long term, expected reward.

Answer 28

Only the present matters. That is, we only need to evaluate the current state and the sequence of events that we’ve taken to arrive at our current state does not matter. Historical states/acrtions can be rolled into current state representation to still satisfy markovian property.

Answer 29

For a reward that happens at the end of a sequence of events, how do we ascertain the event which impacted the reward? you don’t know your immediate rewards based on the state and action you just took … only after you complete a sequence of actions may you get a delayed reward The issue is compounded by the fact that a reward now may be more impactful than a reward in the future. As such, minor changes to the reward model matter.

Answer 30

refers to the scenario where we can play a game forever until we reach some sort of known terminal state. whereas finite horizons refers to a scenario with a specific number of time steps. The optimal policy we decide may depend on whether we have an infinite, large, or small number of time steps

Answer 31

the phenomena that we prefer a sequence of states regardless of time step: If U(S0, S1, S2, ...) >U(S0, S1', S2', ..) then U(S1, S2, ...) >U(S1', S2', ..) When considering a game with an infinite horizon, if the utility function over a set of states (utility of sequence) is greater than the utility of another sequence, then this inequality holds true from any other time step

Answer 32

Reward is the immediate benefit from taking an action or transitioning to a state, while the utility is the long term benefits of being in a certain state.

Answer 33

the policy that maximizes the expected value of our utility function, were it to be employed. Pi* We can then rewrite our optimal policy for any given state in terms of an action that maximizes the utility

Answer 34

is given as the expected value of the discounted sum of rewards given we follow a certain policy in that state

Answer 35

start with an arbitrary set of utility for each state, and update according to the Bellman equation, then repeat. The reason this approach works is that we are adding the real R(s) for every state as we walk over the MDP

Answer 36

Start with an initial policy, evaluate the utility at a specific step given that policy, and then set the policy for the same step for next step based on the action that maximizes utility for that step

Answer 37

a fixed sequence of actions These plans are poor during learning stages because we don’t know enough about the environment, or when there’s stochasticity in our transitions. Issue: during learning, it’s hard to make a plan, but after the agent has learned the game, then it can make the plan Issue: doesn’t work with stochasticity (i.e., if there is a probability that an intended action doesn’t actually work as intended)

Answer 38

plan which includes if statements that depend on environmental variables. This isn’t similar to “dynamic replanning” where if something doesn’t work as intended, then make a new plan Instead, there are built-in if statements that are part of the original plan Since only state is the variable, we must have a state to represent every variable in the environment, so conditional plans are huge.

Answer 39

mapping from states to action (similar to MDP) Essentially you have the same if-statements at every state Very large space … makes the learning problem difficult Always optimal stationary policy

Answer 40

can first sum the discounted values of the returns and take the expected value of this return in addition to return, should also evaluate the computational complexity (time and space). Experience/sample complexity (time) - How much data it needs

Answer 41

You are given an MDP and you find an optimal policy.

Answer 42

estimates the q values (state-action value) directly from the Q function regardless of the policy being followed by the agent. Off-policy has 2 parts: behavior policy and estimated policy. behavior policy is how it collects data. estimated policy is the policy learned with the data. QLearning is off policy in that it estimates the return (total discounted future reward) for state-action pairs assuming a greedy policy were followed despite the fact that it's not following a greedy policy. A well known off-policy algorithm is Q-Learning, as its update rule uses the action which will yield the highest Q-Value, while the actual policy used might restrict that action or choose another.

Answer 43

updates its Q-values using the Q-value of the next state s′ and the current policy action a’. It estimates the return for state-action pairs assuming the current policy continues to be followed. SARSA is an example of on-policy qlearning

Answer 44

True, as reward decays the total reward for the episode decreases, so the agent is encouraged to maximize total reward by ending the game quickly.

Answer 45

True, it just happens that it's easier to think about one vs the other in certain situations.

Answer 46

True, you can just use the Bellman Equations for an MDP, but Reinforcement Learning requires that you make observations and then summarize those observations as values.

Answer 47

True, with a single caveat. The optimal policy is a policy that maximizes reward over an entire episode by taking the argmax of resulting values of actions + rewards. But MDPs are memoryless, so there is no concept of "sequence" for a policy.

Answer 48

False, Temporal Difference Learning is the difference in value estimates on subsequence time steps.

Answer 49

True Model-Based is essentially using the Bellman Equations to solve a problem Value-Based is Temporal Difference Learning Policy-Based is similar to Value-Based, but it solves in a finite amount of time with a certain amount of confidence (in Greedy it's guaranteed).

Answer 50

True, TD Learning thinks of learning in terms of "episodes", which it uses to estimate the transition functions rather than having a predefined model.

Answer 51

is the function of a state, an action, and another state Gives you the probability of transitioning to the next state s’ given you’re in a particular state s and performed action a The model doesn’t change overtime … therefore it is stationary

Answer 52

true Tells you given a state s, which action a to take The optimal policy pi* is the policy that maximizes the long term expected reward There is a policy for each state, so given a state it outputs an action (it doesn’t tell you the sequence of actions)

Answer 53

true because of the max function

Answer 54

1) Start with arbitrary utilities 2) Update utilities based on neighbor 3) Repeat step 2 until convergence

Answer 55

If you have the utilities, you can find pi/the policy … however, if you have the policy than there are many utilities that could result from that

Answer 56

1) Start with an initial guess of the policy pi0 2) Evaluate: given pi_t, calculate U_t=U^pi_t 3) Improve: pi_t+1 = argmaxa sum(transitional probability * utility of next state at t) 4) Repeat steps 2 and 3 until convergence

Answer 57

Takes state/action/reward, feeds into a modeller, the modeller learns the transitions and rewards, feeds into an MDP solver, outputs an optimal value function Q*, finds the argmax of the actions, gets the policy. The learning part is where you take the output of the modeller and feed it back into the modeller to learn better

Answer 58

Take the state/action/reward to form a value, output an estimate of the optimal value function Q, take that estimate of the optimal value function and keep iterating on the value, then use the argmax of the actions to get the policy

Answer 59

Take the state/action/reward and keep iterating on the policy until it converges

Answer 60

Essentially learning how to predict over time Looking at a sequence of states from the initial to the final (and the associated reward obtained between states), we try to predict the expected discounted reward You can think of TD(0) as considering immediate information You can think of TD(1) as considering all information The value of your state (no actions considered) is the immediate reward plus the discounted reward In TD(0), you update your value (for the state that you just left) based on the error of the next reward and the predicted value … else keep the same value as the last time for all other states

Answer 61

The limit of of V_t(S) will approach the actual value V(S) … when T is big enough The sum of all learning rates must be equal to infinity (i.e., unbounded) - Must be big enough so that you can iteratively update the value towards the actual value of the state The sum of the square of all learning rates must be less than infinity - Must be small enough that it doesn’t overwhelm the noise

Answer 62

Essentially, TD(ƛ) is just a weighted linear combination of all the k-step estimators Each estimator tells you their preference for what we should change the value of V(S) towards … and we’ll take a weighted combination of all these estimates each time we do an update for V(S) If lambda = 1, then we put all our weighting on Einfinity … TD(1) If lambda = 0, then we put all our weight on E1 … TD(0) If lambda = (0,1), then we have a weight combination of all estimates in between

Answer 63

Let B be an operator, or mapping from value functions to value functions E.g., you give it a Q function, it will give you back possibly a different Q function We look up the value of the next state using whatever the Q function that was provided … BQ comes out which is a new value function Q*=BQ* is another way to say the Bellman equation Q_t = BQ_{t-1} is another way of saying value iteration

Answer 64

B is an operator if for all F,G and some 0 <= gamma < 1, then B is a contraction mapping I.e., the distance between B applied to F, and B applied to G is less than or equal to gamma times the original distance between F and G It’s making things get closer together

Answer 65

If B is a contraction mapping, then 1) F*=BF* has a solution and its unique 2) F_t = BF_{t-1} => F_t -> F* - As you use value iteration to update the function, then the function converges to the optimal function F* Everytime you apply the operator, you are guaranteed to have your vlues closer and closer to F* than you were before

Answer 66

Expanding the contraction mapping between two Q value functions … essentially it’s just the Bellman equation that uses the state-action pair that results in the largest distance between the Q value functions Both Q value functions will have the same reward At the end, you get two different a’ s (because the best action for the next state is going to be different for given the different Q value function)

Answer 67

true, You don’t even need to know how far away you are, you just need to stop when the value of all states are less than some epsilon

Answer 68

true, Linear programming is an optimization framework in which you can give linear constraints and linear functions, and in polynomial time solve the problem

Answer 69

max is a set of linear constraints: max(-3,7,2,5) = x By setting it up this way, you get a set of values … you need to do min(x) in order to choose the value in the set that is in (-3,7,2,5) Using that, we can make the value function an objective function where we’re minimizing over the value function of all the states This is called PRIMAL

Answer 70

a property where you can change the variables into constraints, and constraints into variables

Answer 71

Q_sa how much agent-ness is running through a policy This is subject to - The sum of the policy flow should equal the number of times that next state is visited (sum over all states-actions that resulted in this state, the policy flow through that state-action pair times the transition probability), discounted - The policy flow at each state-action has to be positive

Answer 72

Initialize the Q value of all states to be arbitrary (e.g., all zeros) Compute the policy pi_t Evaluate the policy to get a new Q function, and assign that to the next time step’s Q function Repeat steps 2 and 3 until convergence

Answer 73

Q_t -> Q* convergence is exact and complete in finite time converges at least as fast as VI

Answer 74

Faster convergence, at the expense of taking more computational resources

Answer 75

Make it an easier MDP to solve/represent and similar to what it would have learned Don’t have a reward function yet other reasons: Semantics Efficiency (speed, space, solvability)

Answer 76

- Multiply by positive constant - Shifting by constant - Non-linear transformations (potential-based rewards)

Answer 77

Although you are encouraging the learner to move towards the goal by setting intermediate rewards, you are introducing spurious optimal policies So need to set a mechanism to prevent these local optimal policies Potential functions - Instead of putting intermediate rewards on actions occurring, put intermediate rewards on states of the world

Answer 78

Here Psi represents the bonus reward you get for being in a state The reward for moving from state s to state s’ via action a is subtracting the Psi from the state you just left and adding gamma*Psi of the state you’re entering This means that Q’(s,a) = Q(s,a) - Psi(s) Note that this doesn’t affect the optimal policy because it doesn’t affect the max function

Answer 79

It’s worth noting that Q learning with the potential function, ends up being the same as regular Q learning with the Q-function initialized as the potential function - This means the same actions occur because the max_a are the same - But the Q values are different - So how you initialize your Q values mean that you’re putting emphasis on states where the initial Q values are bigger … so there’s an argument against randomizations because initial values matter Injecting a potential function could impact learning time - If you initialize the Q values optimistically, then you can bound the learning time

Answer 80

- K bandits (i.e., slot machine), each with 1 arm - Each slot machine has a probability of winning a payoff - We don’t know which slot machine arm to pull because we don’t know what each of the payoffs are … if we knew, then we can just pull the argmax slot machine - Assume that the payoff is either 0 or 1 - 1 arm pull is not enough information to know the probabilities/payoffs of each slot machine observed - Denominator is number of pulls - Numerator is observed number of payoffs

Answer 81

Maximum likelihood - Each pull recalculate the expected value, but choose the one with the highest expected value - In the example, if 'a' never returned any payoff for the observations but 'b' returned at least one payoff, then this will always choose to maximize 'b' since its expected value will never be 0 Maximum confidence - Whichever had the initial most observations will always be explored ``` Minimum confidence (if same confidence, then randomly choose) - You’ll always uniformly choose all the slot machines ```

Answer 82

Want to somehow balance maximum likelihood and minimum confidence

Answer 83

1) identify optimal arm in the limit 2) maximize (discounted) expected reward Want to do some exploration, but also exploitation so that you can take advantage of the observations you made during exploration However, very hard computationally to do this Gittins Index: s/d (s = # of successes, d = # of observations) can be mapped to some index such that if you did this for all slot machines and maximized on the index, you maximize the discounted expected reward - The idea is that what expected payoff is enough for you to always choose this specific slot machine, and never explore again - Effectively combines maximum likelihood and low confidence - Issue: only really works for the bandit problem, and doesn’t really generalize 3) Maximize expected reward over finite horizon - Similar to (2) but still hard to compute 4) identify near-optimal arm (epsilon) with high probability (1-delta) in time Tau(k,epsilon, delta) … polynomial in k, 1/delta, 1/epsilon) - I.e., don’t need to get the best arm, but just the one that’s epsilon away from the best with some probability 1-delta - Similar to PAC learning 5) nearly maximize reward (epsilon) with high probability (1-delta) in time Tau’(k, epsilon, delta) … polynomial in k, 1/delta, 1/epsilon 6) pull a non-near optimal arm (epsilon) no more than Tau’’(k, epsilon, delta) times with high probability (1-delta) - I.e., the number of times you pull a non-optimal arm is small … minimize mistakes

Answer 84

X1,...,Xn are random variables We don’t know u_i, but we know u_hat because of performing maximum likelihood (i.e., average) If delta is big, than z_delta is big, and so the width of the confidence bound is big If we have a lot of data (big n), then this makes the bound smaller This tells us that given we have a way of knowing how close we are to estimating hidden parameter from a sample, we can then use that to figure ou thow many times we have to pull an arm before we know fairly accurate what the payoff for that arm is, and therefore how many times we have to pull each arm so that we’re fairly confident that we find something nearly suboptimal

Answer 85

Stochastic (Hoeffding bound until estimates are accurate) | Sequential (unknown state-action pairs assumed optimistic)

Answer 86

1) Keep track of MDP 2) Any unknown state-action pair is RMAX Let’s have some parameter c = state-action pair is unknown unless we tried it out c times I.e., try the unknown state-action pair c times … then use the ML estimate 3) Solve the MDP 4) Take action from Pi*

Answer 87

If all transitions are either accurately estimated or unknown, optimal policy is either near optimal or an unknown state is reached “quickly” Therefore we can’t run for a very long time without either learning something new or reaching near optimal

Answer 88

We can assign features to states to generalize/categorize them We can learn the policy, which maps states to actions We can learn the value function, which maps state/action to estimated returns We can learn the model (transition and reward) - This is similar to supervised learning since you are observing state/actions and the next state

Answer 89

We have a Q function that is the function of weights/parameters (w^a) and features (f(s)) and feed that into a function approximator (F) Based on input , we perform an update such that we move a little closer towards the reward and The TD error is the difference between the Bellman equation, and the current Q value We update the weight depending on how much the weight’s direction influenced the change in the Q value In a way we are bootstrapping: we are using our current predictions as a target for future predictions

Answer 90

successes: 3-layer backprop: Tesauro, Wang/Dietterich - However, difficult to replicate CMAC: Sutton Linear nets: Singh, Parr - Very important that you choose the right features - 1) computationally efficient - 2) value-informative Deep nets: Minh et al (Atari) Tesauro (implemented on TD Gammon) -> Boyan/Moore (examples where things should have worked well, but in reality they perform poorly and need not work) -> Sutton (replicated Boyan/Moore’s experiments but changed the way training happened and it worked … so it can work if you do it right) Pollack: showed that genetic algorithms did well on TD Gammon too So TD Gammon itself as a game is probably a good example Probably do to it being consistently random so you don’t fall into traps

Answer 91

if you update the sequence badly, then it can spiral out of control

Answer 92

The averager is simply another MDP, so it converges It’s like if we move from one state to a second state, and from the second state do a mini-transition to all the basis states in order to figure out the value function over everything, by solving the MDP over the basis states Interpolate between anchor points We define some points x1, x2, x3 Then find a function approximator that the rest of the points are filled in between these points Doesn’t matter what we do, as long as we define any point we are trying to find as a convex combination of the anchor points - This means that any point between two anchor points has to be a value between the two anchor points For the spaces beyond the min or max anchor point, it’s interpolative - So any point x’ still has to be defined wrt the anchor points Supervised learners that can be expressed as an averager: K-NN, distance weighted K-NN - Find things closest to you and let them vote Kernel functions

Answer 93

A way of talking about “non-Markov” environments The MDP isn’t directly observable to the agent, so insteadintead they make decisions based on their observations Z O is the observation function that given a state and the observation, spits out the probability of seeing Z in state s * Generalize MDPs - If the POMDP is the true MDP, then the observables are simply the states - O(s,Z) is 1 iff s=Z

Answer 94

Belief state b(s): gives you probability we are in state s - Vector … probability distribution over states b,a,z -> b’ We are taking the POMDP and turning it into a belief MDP Rewards are not directly observed, but people may write the rewards as a function of the observables Need to derive b’(s’) = probability of being in state s’ after b,a,z This just turned a POMDP into an MDP with infinite states If we run linear programming, policy iteration, or value iteration, these are polynomial in number of states, so they take forever

Answer 95

At t = 0, define the value = 0 everywhere Reward isn’t a function of b and a, so need to change it Claim: value of the belief state is piecewise and convex Here, there are many linear functions that alpha can take on We take the max (i.e., the blue line) of all these functions Base case: there exists a set of Gamma_0 such that

Answer 96

Planning: you know everything (model) - Planning in POMDPs is hard/undecidable RL: don’t know everything, so need to interaction Model-based RL: learned the model and use it - Learn the POMDP Model-free RL: don’t learn the model - Map observations to actions

Answer 97

The idea is to keep a Bayesian posterior over possible MDPs that we’re in and use it to optimally balance exploration-exploitation Optimal policy: explore to figure out which MDP you’re in … then exploit to get high reward RL is actually planning in a kind of continuous POMDP The hidden states are the set of parameters of the MDP

Answer 98

State, model, actions, reward

Answer 99

1. Must fulfill Markovian property, i.e. only the present matters 2. It must be stationary, i.e. the physics of the world (the "model"), actions available, etc. must not change as a function of time.

Answer 100

A policy is a mapping from states to actions. An optimal policy is one that maps states to actions in such a way that it maximizes the expected reward the agent will receive over time.

Answer 101

A policy is a mapping from states to actions. A plan is simply a sequence of actions. It's an important distinction because the stochastic nature of the world means that just because you plan to go somewhere doesn't guarantee you'll get there. A policy is more general in that it simply says "if I'm in a state, what action do I take?"

Answer 102

It's the problem of how we figure out how states and actions from far in the past influenced where we are now?

Answer 103

The idea that if I prefer a set of states today over another set today, then that same inequality will hold the next day, and the next day, and so on. This is important because it leads to the natural decision to simply add rewards together to calculate the utility (i.e. value) of a sequence of states.

Answer 104

It allows us to calculate an infinite series in finite time. Gamma must be 0<=gamma<1

Answer 105

R != V, i.e. reward only encompasses immediate gratification. Value tells us how good it is to be in some state based on the immediate reward + the discounted future value (i.e. it accounts for DELAYED rewards).

Answer 106

Because at each step, we get another glimpse of reality, and as we take more and more steps, we encompass more and more of the real dynamics of the problem.

Answer 107

Quality, more specifically, how good it is to take some action.

Answer 108

Q functions are very useful from learning from samples of data, because we don't have to have access to the transition function (which we need for the value function because value iteration requires us to be able to calculate the value of the next state for each update).

Answer 109

1. Model-based 2. Value-based (TD methods fall into this) 3. Policy-based

Answer 110

1. SUM of all learning rate values must be infinite | 2. SUM OF SQUARES of learning rate values must be finite

Answer 111

TD(0): - Slow to propagate information - High bias, low variance - Maximum likelihood estimate (MLE) TD(1): - Equivalent to MC, samples full trajectories - Requires full trajectory in order to update - Low bias, high variance

Answer 112

Yes, it does converge, and in fact converges to the optimal value Q*

Answer 113

Order statistics, FIXED convex combinations

Answer 114

the difference is whether or not there are actions chosen by the learner...

Answer 115

transformation from value functions to value functions. Q -> B -> Q' => BQ -> Q' Q* = BQ*: Bellman equation Q_t = BQ_t+1: value iteration

Answer 116

causes 'things' to get closer together. if for all F,G and some 0 <= gamma < 1, ||BF - BG||_inf <= gamma * ||F-G||_inf, then B is a contraction mapping. ::: is there some gamma between 0 and 1 that when applied, the values get smaller and smaller (contract) abs(BX - BY) <= gamma abs(B-Y)

Answer 117

if B is a contraction mapping, | F* = BF* has a solution and its unique F_t = BF_t-1 => F_t -> F* value iteration converses

Answer 118

True Decentralized Partially Observable Markov Decisions Actions are taken simultaneously by a finite set of agents (not just 1) in DEC-POMDPS DEC-POMDP all share reward function (they are working together)

Answer 119

FALSE, there is a shared Reward (all agents working together). If reward wasn’t shared the model would be POSG

Answer 120

1. Elements of game theory and POMDPS | 2. NEXP-complete (for finite horizon)

Answer 121

Input behavior and environment and OUTPUT reward no reward function is given. Instead, the reward function is inferred given an observed behavior from an expert. The idea is to mimic observed behavior, which is often optimal or close to optimal.

Answer 122

Choose action Z because pairwise product is highest. Argmax p(a|poliicy1)*p(a|policy2)

Answer 123

There is a 3rd agent, the “Author” that wants to build an agent that interferes with the player. Author created packman, agent is game itself, player is player obvi

Answer 124

Start w/ arbitrary utilities Update utilities based on reward + neighbors (discounted future reward) Repeat until convergence

Answer 125

Only the present matters AND things are stationary (rules/the world doesn't change over time)

Answer 126

False. It will converge if you sum all of the learning rates at each time, t > infinity and if you sum all the learning rates squared at time < infinity

Answer 127

True. and with even more learning because updates don't have to wait for the episode to be over

Answer 128

Maximum Likelihood uses all of the examples, but TD(1) uses just individual runs so if a rare thing happens on TD(1) it can be biased (high variance). (this leads us to TD(lambda)

Answer 129

TRUE if we run over the data over and over again

Answer 130

False. TD(1) typically has more error than TD(0)

Answer 131

False. TD(lambda) performs best. Usually 0.3 - 0.7 is best

Answer 132

Difference between reward (value estimates) as we go from one step to another

Answer 133

False. Model is the agent's idea of the environment

Answer 134

RL: model is unknown, the agent performs actions Planning: model is known, the agent performs computations (know planning)

Answer 135

True. As also noted, you would have to wait for a complete episode to do it forwards(i.e. forwards does not work "online")...

Answer 136

you care only about the current (short-sighted)

Answer 137

all future reward is equally weighted (far-sighted)

Answer 138

avoid infinite returns and also account for a model not being perfect (future reward is not guaranteed)

Answer 139

MDP is an extension of MC the difference is the addition of actions (allowing choice) and rewards (giving motivation).

Answer 140

The forward view looks at all n-steps ahead and uses λ to essentially decay those future estimates.

Answer 141

The backward view of TD(λ) updates values at each step. So after each step in the episode you make updates to all prior steps. Uses eligibility trace

Answer 142

Online learning means that you are doing it as the data comes in. Offline means that you have a static dataset.

Answer 143

In Q learning, you update the estimate from the maximum estimate of possible next actions, regardless of which action you took. Whilst in SARSA, you update estimates based on and take the same action.

Answer 144

True. A non-expansion is {constant, contraction}, and a not non-expansion is not a constant and not a contraction, therefore it is an expansion and expansions diverge.

Answer 145

False | A non-expansion must be followed by a contraction, in order to provide convergence guarantees.

Answer 146

order statistics (max, min), fixed convex combinations

Answer 147

policy 1 dominates policy 2 if for all states the value of that state for policy 1 is greater than or equal to the value of that state for policy 2

Answer 148

True. Initialize to 0. Random means you are biasing

Answer 149

False... if the selected potential is wrong, it will slow things down.

Answer 150

False (although true in spirit). The problem is the word 'always.' As Jacob alludes to, in these PAC-style bounds, there is always a small probability that the desired goal isn't reached, and if reached it's not quite optimal.

Answer 151

This is a game that is played only once The pay-off may be such that a game might be impossible to play twice E.g. mutually assured nuclear destruction Slightly different with tactical / conventional warfare In one-shot interaction, people often have an incentive to behave opportunistically / selfishly Consider a one shot Prisoners’ Dilemma game where an individual wants to minimise their own sentence

Answer 152

a game played more than once a crucial point is the reaction to a defective strategy by another player Tit for Tat Strategy – if you defect, I defect in the next round Grim Strategy – if you defect, I will defect in all future rounds

Answer 153

False as we are only concerned with exploration decay.convergence is guaranteed by the contraction property of the Bellman operator, which does not include any assumptions on the exploration rate.

Answer 154

False. You may still need to explore "non-optimal" arms of the bandit since there may be noise in the estimate of its value.

Answer 155

False. SARSA converges to an optimal policy as long as all state-action are visited an infinite number of times and the policy converges to a greedy policy

Answer 156

False. Some hypothesis classes are exponentially harder to learn in the KWIK setting than in the MB setting.

Answer 157

False. Baird's counter-example shows how classic update using linear function approximation can lead to divergence.

Answer 158

True. As noted in Gordon (1995), you will have fewer states...

Answer 159

a way to know how much we know (certainty). know when we are certain enough

Answer 160

optimism in the face of uncertainty

Answer 161

stochastic + sequential

Answer 162

essentially anything that generalizes a linear model (ANN, CMAC, linear nets, deep nets)

Answer 163

some state may look like other states so take into account features of states and weight the importance of the feature when making a decision. Replace update rule with function approximation

Answer 164

True. Need anchor points to converge

Answer 165

False, but you can get near-optimal

Answer 166

use expectation maximization to learn the model

Answer 167

useful for large states, needs lot of samples to get good estimates, planning time indepedent of size of state space, running time is exponential in the horizon

Answer 168

random simulations

Answer 169

Modular RL and arbitration

Answer 170

options and semiMDPs

Answer 171

cooperate if agree, defect if disagree

Answer 172

always best response independent of history

Answer 173

value iteration works, minimaX Q converges, efficent

Answer 174

value iteration does not work, minimax q does not coverge

Answer 175

cooperative competitive values

Answer 176

Instead of looking at a very small action space, we can create new actions that transitions that could help reduce the number of steps to learn and move towards a goal.

Answer 177

FALSE, true with the vaceat that we should choose the right options. Also, the sttates don't matter and helps with exploration

Answer 178

A new class of algorithms, residual gradient algorithms, is proposed, which perform gradient descent on the mean squared Bellman residual, guaranteeing convergence. It is shown, however, that they may learn very slowly in some cases. A larger class of algorithms, residual algorithms, is proposed that has the guaranteed convergence of the residual gradient algorithms, yet can retain the fast learning speed of direct algorithms. In fact, both direct and residual gradient algorithms are shown to be special cases of residual algorithms, and it is shown that residual algorithms can combine the advantages of each approach. At any given time during learning, the Q function that has been learned so far may not exactly satisfy the Bellman equation. The Bellman residual is the difference between the two sides of the equation. This is slightly different from the TD error. The TD error is the difference between the two sides of the equation without the expected value, evalutated after just a single transition for (x,u). So the Bellman residual is the expected value of the TD error. The TD error is the Bellman residual plus zero-mean noise.

Answer 179

The dual of a given linear program (LP) is another LP that is derived from the original (the primal) LP in the following schematic way: Each variable in the primal LP becomes a constraint in the dual LP; Each constraint in the primal LP becomes a variable in the dual LP; The objective direction is inversed – maximum in the primal becomes minimum in the dual and vice-versa. The solution to the dual problem provides a lower bound to the solution of the primal (minimization) problem.[1]

Answer 180

In mathematical optimization theory, duality or the duality principle is the principle that optimization problems may be viewed from either of two perspectives, the primal problem or the dual problem. The solution to the dual problem provides a lower bound to the solution of the primal (minimization) problem.[1] However in general the optimal values of the primal and dual problems need not be equal.

Answer 181

In plain terms, a pure Nash equilibrium is a strategy profile in which no player would benefit by deviating, given that all other players don't deviate. Some games have multiple pure Nash equilib ria and some games do not have any pure Nash equilibria.

Answer 182

Mixed strategy Nash equilibria are equilibria where at least one player is playing a mixed (random) strategy.

Answer 183

The Folk Theorem suggests that if the players are patient enough and far-sighted, then repeated interaction can result in virtually any average payoff in an SPE equilibrium.

Answer 184

In computational learning theory, probably approximately correct learning is a framework for mathematical analysis of machine learning. It was proposed in 1984 by Leslie Valiant. In this framework, the learner receives samples and must select a generalization function from a certain class of possible functions. helps analyze whether and under what conditions a learner L will probably output an approximately correct classifier.

final Flashcards

(226 cards)