final Flashcards

(226 cards)

1
Q

Markov means RL agents are amnesiacs and forget everything up until the current state.

A

True. A process is defined as Markov if the transition to the next state is fully defined by the current state (only the present matters).

However, the current state can be expanded to include past states to force non-Markov processes into a Markovian framework.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

In RL, recent moves influence outcomes more than moves further in the past.

A

False. One of the main ideas of RL is the future value of some state or action may be dependent on states far back in the history. This is modified by gamma (the discount factor) though: we can use larger gamma values (but strictly less than one for infinite horizon situations) to make the agent care about longer term rewards.

e.g., you can lose a chess game on your first move..

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

In the gridworld MDP in “Smoov and Curly’s Bogus Journey”, if we add 10 to each state’s rewards (terminal and non-terminal) the optimal policy will not change

A

True. Adding the same scalar value to all states leaves the underlying MDP unchanged.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

An MDP given a fixed policy is a Markov Chain with rewards.

A

True. A fixed policy means a fixed action is taken given a state. In this case, this MDP totally depends on the current state, and is is now reduced to a Markov chain with rewards.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

It is not always possible to convert a finite horizon MDP to an infinite horizon MDP.

A

False. We can always convert to infinite horizon by adding a terminal state with a self-loop (with a reward of 0).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

If we know optimal Q values, we can get the optimal V values only if we know the environment’s transition function/matrix.

A

False. Knowing the optimal Q values is essentially the same as knowing the optimal V values.

i.e. max V(s) == max Q(s, a)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

The value of the returned policy is the only way to evaluate a learner.

A

False. Time and space complexities of the learner are also among the indicators.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

The optimal policy for any MDP can be found in polynomial time.

A

True. For any (finite) MDP, we can form the associated linear program (LP) and solve it in polynomial time (via interior point methods, for example, although for large state spaces an LP may not be practical). Then we take the greedy policy and voila!

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

A policy that is greedy - with respect to the optimal value function - is not necessarily an optimal policy.

A

False. Optimal action on an optimal value function is optimal by definition.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

In TD learning, the sum of the learning rates must converge for the value function to converge.

A

False. The sum of the SQUARES of the learning rates must converge for the value function to converge.

The sum of learning rates must diverge for the value function to converge.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Monte Carlo is an unbiased estimator of the value function compared to TD methods. Therefore it is the preferred algorithm when doing RL with episodic tasks.

A

False. TD(1) is actually equivalent to MC. It is true that MC is an unbiased estimator compared to TD, but it has high VARIANCE. TD, on the other hand, has high bias but low variance, which often makes it better for learning from sequential data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Backward and forward TD(lambda) can be applied to the same problems.

A

True (S&B chapter 12 discusses these views and shows their equivalence). However, in practice backward TD(lambda) is usually easier to compute.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Offline algorithms are generally superior to online algorithms.

A

False. It depends on the problem context. Online algorithms update values as soon as new information is available and makes most efficient use of experiences.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Given a model (T,R) we can also sample in, we should first try TD learning.

A

False. You have a model - use it!

TD learning like Q learning is better suited when model(transition, rewards) are not known.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

TD(1) slowly propagates information, so it does better in the repeated presentations regime rather than with single presentations.

A

False. TD(0) propagates slowly while TD(1) propagates information all the way in each presentation.

TD(1) is equivalent to MC, so it samples from full trajectories. TD(0) is a one-step method and suffers from slow propagation if samples aren’t presented repeatedly until convergence.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

In TD(lambda), we should see the same general curve for best learning rate (lowest error), regardless of lambda value.

A

True, We see the same general curves but not the same lowest error (Fig. 4 in Sutton1988) regardless of the lambda values.

Sutton TD experiments.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

In general, an update rule which is not a non-expansion will not converge.

A

False. Coco-Q is an exception as noted in Lecture 4. In general though, you should expect this to hold up.

..confusing..

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

MDPs are a type of Markov game.

A

True, MDPs are a special case of Markov games, where there is only one agent.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Contraction mappings and non-expansions are concepts used to prove the convergence of RL algorithms, but are otherwise unrelated concepts.

A

False, a non-expansion is a type of contraction mapping.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Linear programming is the only way we are able to solve MDPs in linear time.

A

False, linear programming can solve MDPs in polynomial time, not linear. Also they quickly become impractical for larger state space MDPs.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

The objective of the dual LP presented in lecture is minimization of “policy flow”. (The minimization is because we are aiming to find an upper bound on “policy flow”.)

A

False, the objective of dual LP is to maximize the “policy flow”.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Any optimal policy found with reward shaping is the optimal policy for the original MDP.

A

False, only potential-based reward shaping must preserve the original MDP’s optimal policy.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Potential-based shaping will find an optimal policy faster than an unshaped MDP.

A

False, this depends on the selected potential. It is possible to get stuck in a sub-optimal loop before eventually finding the optimal policy.

not necessarily..

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Rmax will always find the optimal policy for a properly tuned learning function.

A

False, Rmax is not guaranteed to find the optimal policy. But Rmax can help obtain near optimal results.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q-learning converges only under certain exploration decay conditions.
False, Q-learning can converge even when random actions are performed since this is an off-policy algorithm.
26
The trade-off between exploration and exploitation is not applicable to finite bandit domains since we are able to sample all options.
False, depending on the confidence level we feel comfortable with, we stop exploring bandits we "believe" we have obtained the optimal solution for.
27
Since SARSA is an “online” algorithm, it will never learn the optimal policy for an environment.
False. "online" does not matter, SARSA converges with infinite exploration at the limit and a greedy policy.
28
online vs offline learning
Online means learning from each sample as it comes in whereas offline means learning from the corpus of data as a whole after the fact.
29
on policy vs off policy
on-policy uses the actual learned policy to pick the next actions off-policy estimates the q values (state-action value) directly from the Q function regardless of the policy being followed by the agent. (greedily updates the policy despite the fact that its not following a greedy policy)
30
RL with linear function approximation only works on toy problems.
False, by doing feature engineering can be used broadly
31
KWIK algorithms are an improvement of MB (mistake bound) algorithms and as such will have same or better time complexity for all classes of problems.
False, KWIK may have worse time complexity than MB. | such as when a problem state is very large.
32
With a classic update using linear function approximation, we will always converge to some values, but they may not be optimal.
False, when using a classic update using linear function approximation, you are not guaranteed to converge. for example: Baird’s star problem
33
Residual algorithms are always guaranteed to converge, but often do so too slowly to be practical.
True, Convergence is guaranteed. It can be slow if you choose psi poorly. But, it is fast for the some value of psi.
34
Applying generalization with an "averager" on an MDP results in another MDP
True, any generalization of an MDP would result in another MDP.
35
Problems that can be represented as POMDPs cannot be represented as MDPs.
False. You can go from POMDP over some state-space to a larger MDP over a belief-space. POMDP is a generalization of MDP, so there exists a MDP that can be represented as MDP and POMDP.
36
The only algorithms that work in POMDPs are planning algorithms
False. Lots of algorithms will work here that involve using the POMDP: qlearning, sarsa, etc...
37
We can represent any POMDP as a PSR (predictive state representation), however some classes of POMDPs will be too large to deal with as PSRs.
False. any state that can be represented as a POMDP can be represented as a PSR. The size of the POMDP does not affect whether or not it can be represented as PSR
38
Options over an MDP form another MDP.
False. Options over MDP forms a semi-MDP where action in an atomic timestep in MDP is abstracted to become a sub-policy over K steps.
39
Nash equilibria can only be pure, not mixed.
False: Nash equilibria can be both, pure and mixed. The notion of a Nash equilibrium is that all agents are not willing to change their policy if they could. That holds also for mixed strategies.
40
An optimal pure strategy does not necessarily exist for a two-player, zero-sum finite deterministic game with perfect information.
False. A two player, finite, deterministic game of perfect information always has a pure optimal strategy. Also maximin=minimax in this game type. von Neumann's theorem states that there always exists a pure strategy for each player in that game situation.
41
The "folk theorem" states that the notion of threats can stabilize payoff profiles in one-shot games.
False. The folk theorem states that the threat of retaliation opens the door to cooperation, which implies games that aren’t one shot. the theorem states that the notion of threats can stabilize payoff profiles in repeated games
42
If following the repeated game strategy "Pavlov", we will cooperate until our opponent defects. Once an opponent defects we defect forever.
False. That is Grim Trigger. In Pavlov we defect until we agree.
43
Correlated equilibria rely on coordination, like side payments.
False. Coco uses side payments. Correlated equilibria can be reached without side payments. CE requires a third-party to provide some level of information about what the other party will do. The third party is like a consultant, but not a middleman for a payment.
44
"Subgame perfect" means that every stage of a multistage game has a Nash equilibrium.
False. A subperfect game means that the payouts can’t be improved by changing the history of the game play. the wording seems misused here. Subgame perfect refers to a Nash equilibrium that remains a Nash equilibrium for all subgames. subgame perfect is when the strategy is always the best response, independent of history including starting states.
45
Inverse RL means that we invert the reward function before putting an agent in an environment.
False, Inverse RL utilizes behavior observations and an environment to create a reward structure. IRL means that we're learning the reward function based on observed behaviors, rather than classical RL where we know the reward function in advance and want to learn behaviors to maximize the reward.
46
DEC-POMDPs include communication, this communication allows agents to plan.
True. The agents must plan to balance communication and action as each have a cost when taking independent actions with a shared reward function. ..Where communication refers to shared reward between players.
47
Policy shaping requires a completely correct oracle to give the RL agent advice.
False, policy shaping requires a certain amount of confidence, but that does not mean it needs a completely correct oracle.
48
supervised learning
y=f(x) - Function approximation - Given x’s and y’s, find f
49
unsupervised learning
utilizes clustering, descriptions, to create a set of labels, f, for certain observances based on their attributes such that f(x). given x's, find f
50
Reinforcement learning
finds an underlying function, f,  given a reward, z,  based on an observance, x, and a predicted value, y, such that y = f(x)|z. Given x’s and z’s, find f and y
51
Markov Decision Process (MDP)
A framework of discrete states and actions to model decision making. States, S Actions, A Transition function T(s,a,s’) between states. State transitions result in a reward. Reward model, R(s, a, s’), describes the reward for moving from state s to s’ via action a.  Policy Pi(s) is the function that returns an action, a, for any state, s. The optimal policy is referred to as Pi* and maximizes long term, expected reward.
52
Markovian Property
Only the present matters.  That is, we only need to evaluate the current state and the sequence of events that we’ve taken to arrive at our current state does not matter. Historical states/acrtions can be rolled into current state representation to still satisfy markovian property.
53
Temporal Credit Assignment problem (TCA)
For a reward that happens at the end of a sequence of events, how do we ascertain the event which impacted the reward? you don’t know your immediate rewards based on the state and action you just took … only after you complete a sequence of actions may you get a delayed reward The issue is compounded by the fact that a reward now may be more impactful than a reward in the future.   As such, minor changes to the reward model matter.
54
Infinite Horizons (aka stationary)
refers to the scenario where we can play a game forever until we reach some sort of known terminal state. whereas finite horizons refers to a scenario with a specific number of time steps.  The optimal policy we decide may depend on whether we have an infinite, large, or small number of time steps
55
stationarity of preferences (utility of sequences)
the phenomena that we prefer a sequence of states regardless of time step: If U(S0, S1, S2, ...) >U(S0, S1', S2', ..)  then U(S1, S2, ...) >U(S1', S2', ..)   When considering a game with an infinite horizon, if the utility function over a set of states (utility of  sequence) is greater than the utility of another sequence, then this inequality holds true from any other time step
56
The sum of all rewards over an infinite horizon would be infinite
true
57
The sum of all discounted rewards over an infinite horizon results in a geometric series bounded by a maximum reward
true
58
Utility vs Reward
Reward is the immediate benefit from taking an action or transitioning to a state, while the utility is the long term benefits of being in a certain state.
59
Optimal Policy
the policy that maximizes the expected value of our utility function, were it to be employed. Pi* We can then rewrite our optimal policy for any given state in terms of an action that maximizes the utility
60
utility for any given state S
is given as the expected value of the discounted sum of rewards given we follow a certain policy in that state
61
the utility of being in any given state is can be represented as the Bellman Equation
true
62
value iteration
start with an arbitrary set of utility for each state, and update according to the Bellman equation, then repeat. The reason this approach works is that we are adding the real R(s) for every state as we walk over the MDP
63
policy iteration
Start with an initial policy, evaluate the utility at a specific step given that policy, and then set the policy for the same step for next step based on the action that maximizes utility for that step
64
behavior structure: plan
a fixed sequence of actions These plans are poor during learning stages because we don’t know enough about the environment, or when there’s stochasticity in our transitions. Issue: during learning, it’s hard to make a plan, but after the agent has learned the game, then it can make the plan Issue: doesn’t work with stochasticity (i.e., if there is a probability that an intended action doesn’t actually work as intended)
65
behavior structure: conditional plan
plan which includes if statements that depend on environmental variables. This isn’t similar to “dynamic replanning” where if something doesn’t work as intended, then make a new plan Instead, there are built-in if statements that are part of the original plan Since only state is the variable, we must have a state to represent every variable in the environment, so conditional plans are huge.
66
behavior structure: universal/stationary plan
mapping from states to action (similar to MDP) Essentially you have the same if-statements at every state Very large space … makes the learning problem difficult Always optimal stationary policy
67
evaluating a policy
can first sum the discounted values of the returns and take the expected value of this return in addition to return, should also evaluate the computational complexity (time and space). Experience/sample complexity (time) - How much data it needs
68
planning
You are given an MDP and you find an optimal policy.
69
off policy
estimates the q values (state-action value) directly from the Q function regardless of the policy being followed by the agent. Off-policy has 2 parts: behavior policy and estimated policy. behavior policy is how it collects data. estimated policy is the policy learned with the data. QLearning is off policy in that it estimates the return (total discounted future reward) for state-action pairs assuming a greedy policy were followed despite the fact that it's not following a greedy policy. A well known off-policy algorithm is Q-Learning, as its update rule uses the action which will yield the highest Q-Value, while the actual policy used might restrict that action or choose another.
70
on policy
updates its Q-values using the Q-value of the next state s′ and the current policy action a’. It estimates the return for state-action pairs assuming the current policy continues to be followed. SARSA is an example of on-policy qlearning
71
Decaying Reward encourages the agent to end the game quickly instead of running around and gathering more reward
True, as reward decays the total reward for the episode decreases, so the agent is encouraged to maximize total reward by ending the game quickly.
72
R(s) and R(s,a) are equivalent.
True, it just happens that it's easier to think about one vs the other in certain situations.
73
Reinforcement Learning is harder to compute than a simple MDP.
True, you can just use the Bellman Equations for an MDP, but Reinforcement Learning requires that you make observations and then summarize those observations as values.
74
An optimal policy is the best possible sequence of actions for an MDP.
True, with a single caveat. The optimal policy is a policy that maximizes reward over an entire episode by taking the argmax of resulting values of actions + rewards. But MDPs are memoryless, so there is no concept of "sequence" for a policy.
75
Temporal Difference Learning is the difference in reward you see on subsequent time steps.
False, Temporal Difference Learning is the difference in value estimates on subsequence time steps.
76
RL falls generally into 3 different categories: Model-Based, Value-Based, and Policy-Based.
True Model-Based is essentially using the Bellman Equations to solve a problem Value-Based is Temporal Difference Learning Policy-Based is similar to Value-Based, but it solves in a finite amount of time with a certain amount of confidence (in Greedy it's guaranteed).
77
TD Learning is defined by Incremental Estimates that are Outcome Based.
True, TD Learning thinks of learning in terms of "episodes", which it uses to estimate the transition functions rather than having a predefined model.
78
Model/transition model
is the function of a state, an action, and another state Gives you the probability of transitioning to the next state s’ given you’re in a particular state s and performed action a The model doesn’t change overtime … therefore it is stationary
79
Policy is the solution to the Markov Decision Process
true Tells you given a state s, which action a to take The optimal policy pi* is the policy that maximizes the long term expected reward There is a policy for each state, so given a state it outputs an action (it doesn’t tell you the sequence of actions)
80
The Bellman Equation is non-linear
true because of the max function
81
steps to solve bellman equation
1) Start with arbitrary utilities 2) Update utilities based on neighbor 3) Repeat step 2 until convergence
82
policy relationship with utility
If you have the utilities, you can find pi/the policy … however, if you have the policy than there are many utilities that could result from that
83
policy iteration steps
1) Start with an initial guess of the policy pi0 2) Evaluate: given pi_t, calculate U_t=U^pi_t 3) Improve: pi_t+1 = argmaxa sum(transitional probability * utility of next state at t) 4) Repeat steps 2 and 3 until convergence
84
model-based RL
Takes state/action/reward, feeds into a modeller, the modeller learns the transitions and rewards, feeds into an MDP solver, outputs an optimal value function Q*, finds the argmax of the actions, gets the policy. The learning part is where you take the output of the modeller and feed it back into the modeller to learn better
85
model-free (value function based) RL
Take the state/action/reward to form a value, output an estimate of the optimal value function Q, take that estimate of the optimal value function and keep iterating on the value, then use the argmax of the actions to get the policy
86
policy based RL / policy search
Take the state/action/reward and keep iterating on the policy until it converges
87
Temporal Difference Learning: TD(lambda)
Essentially learning how to predict over time Looking at a sequence of states from the initial to the final (and the associated reward obtained between states), we try to predict the expected discounted reward You can think of TD(0) as considering immediate information You can think of TD(1) as considering all information The value of your state (no actions considered) is the immediate reward plus the discounted reward In TD(0), you update your value (for the state that you just left) based on the error of the next reward and the predicted value … else keep the same value as the last time for all other states
88
properties of learning rates
The limit of of V_t(S) will approach the actual value V(S) … when T is big enough The sum of all learning rates must be equal to infinity (i.e., unbounded) - Must be big enough so that you can iteratively update the value towards the actual value of the state The sum of the square of all learning rates must be less than infinity - Must be small enough that it doesn’t overwhelm the noise
89
k step estimators vs td(lambda)
Essentially, TD(ƛ) is just a weighted linear combination of all the k-step estimators Each estimator tells you their preference for what we should change the value of V(S) towards … and we’ll take a weighted combination of all these estimates each time we do an update for V(S) If lambda = 1, then we put all our weighting on Einfinity … TD(1) If lambda = 0, then we put all our weight on E1 … TD(0) If lambda = (0,1), then we have a weight combination of all estimates in between
90
Bellman operator
Let B be an operator, or mapping from value functions to value functions E.g., you give it a Q function, it will give you back possibly a different Q function We look up the value of the next state using whatever the Q function that was provided … BQ comes out which is a new value function Q*=BQ* is another way to say the Bellman equation Q_t = BQ_{t-1} is another way of saying value iteration
91
contraction mappings
B is an operator if for all F,G and some 0 <= gamma < 1, then B is a contraction mapping I.e., the distance between B applied to F, and B applied to G is less than or equal to gamma times the original distance between F and G It’s making things get closer together
92
contraction properties
If B is a contraction mapping, then 1) F*=BF* has a solution and its unique 2) F_t = BF_{t-1} => F_t -> F* - As you use value iteration to update the function, then the function converges to the optimal function F* Everytime you apply the operator, you are guaranteed to have your vlues closer and closer to F* than you were before
93
Bellman operator contracts
Expanding the contraction mapping between two Q value functions … essentially it’s just the Bellman equation that uses the state-action pair that results in the largest distance between the Q value functions Both Q value functions will have the same reward At the end, you get two different a’ s (because the best action for the next state is going to be different for given the different Q value function)
94
Max is a non-expansion
true
95
If you get a good enough value function, then you can put a bound on how far away you are from optimal value function
true, You don’t even need to know how far away you are, you just need to stop when the value of all states are less than some epsilon
96
Value iteration doesn’t allow you to solve the MDP in polynomial time
true, Linear programming is an optimization framework in which you can give linear constraints and linear functions, and in polynomial time solve the problem
97
linear programming / bellman equation
max is a set of linear constraints: max(-3,7,2,5) = x By setting it up this way, you get a set of values … you need to do min(x) in order to choose the value in the set that is in (-3,7,2,5) Using that, we can make the value function an objective function where we’re minimizing over the value function of all the states This is called PRIMAL
98
duality (linear programming)
a property where you can change the variables into constraints, and constraints into variables
99
policy flow
Q_sa how much agent-ness is running through a policy This is subject to - The sum of the policy flow should equal the number of times that next state is visited (sum over all states-actions that resulted in this state, the policy flow through that state-action pair times the transition probability), discounted - The policy flow at each state-action has to be positive
100
policy iteration 2
Initialize the Q value of all states to be arbitrary (e.g., all zeros) Compute the policy pi_t Evaluate the policy to get a new Q function, and assign that to the next time step’s Q function Repeat steps 2 and 3 until convergence
101
properties of policy iteration
Q_t -> Q* convergence is exact and complete in finite time converges at least as fast as VI
102
trade offs with value iteration
Faster convergence, at the expense of taking more computational resources
103
Why might we want to change the reward function for an MDP?
Make it an easier MDP to solve/represent and similar to what it would have learned Don’t have a reward function yet other reasons: Semantics Efficiency (speed, space, solvability)
104
MDPs are usually defined as
true
105
ways to change reward functions
- Multiply by positive constant - Shifting by constant - Non-linear transformations (potential-based rewards)
106
reward shaping
Although you are encouraging the learner to move towards the goal by setting intermediate rewards, you are introducing spurious optimal policies So need to set a mechanism to prevent these local optimal policies Potential functions - Instead of putting intermediate rewards on actions occurring, put intermediate rewards on states of the world
107
Potential based shaping
Here Psi represents the bonus reward you get for being in a state The reward for moving from state s to state s’ via action a is subtracting the Psi from the state you just left and adding gamma*Psi of the state you’re entering This means that Q’(s,a) = Q(s,a) - Psi(s) Note that this doesn’t affect the optimal policy because it doesn’t affect the max function
108
Q learning with potentials
It’s worth noting that Q learning with the potential function, ends up being the same as regular Q learning with the Q-function initialized as the potential function - This means the same actions occur because the max_a are the same - But the Q values are different - So how you initialize your Q values mean that you’re putting emphasis on states where the initial Q values are bigger … so there’s an argument against randomizations because initial values matter Injecting a potential function could impact learning time - If you initialize the Q values optimistically, then you can bound the learning time
109
K-armed bandits problem
- K bandits (i.e., slot machine), each with 1 arm - Each slot machine has a probability of winning a payoff - We don’t know which slot machine arm to pull because we don’t know what each of the payoffs are … if we knew, then we can just pull the argmax slot machine - Assume that the payoff is either 0 or 1 - 1 arm pull is not enough information to know the probabilities/payoffs of each slot machine observed - Denominator is number of pulls - Numerator is observed number of payoffs
110
Confidence based exploration (bandits)
Maximum likelihood - Each pull recalculate the expected value, but choose the one with the highest expected value - In the example, if 'a' never returned any payoff for the observations but 'b' returned at least one payoff, then this will always choose to maximize 'b' since its expected value will never be 0 Maximum confidence - Whichever had the initial most observations will always be explored ``` Minimum confidence (if same confidence, then randomly choose) - You’ll always uniformly choose all the slot machines ```
111
Exploration-exploitation dilemma
Want to somehow balance maximum likelihood and minimum confidence
112
metrics for bandits
1) identify optimal arm in the limit 2) maximize (discounted) expected reward Want to do some exploration, but also exploitation so that you can take advantage of the observations you made during exploration However, very hard computationally to do this Gittins Index: s/d (s = # of successes, d = # of observations) can be mapped to some index such that if you did this for all slot machines and maximized on the index, you maximize the discounted expected reward - The idea is that what expected payoff is enough for you to always choose this specific slot machine, and never explore again - Effectively combines maximum likelihood and low confidence - Issue: only really works for the bandit problem, and doesn’t really generalize 3) Maximize expected reward over finite horizon - Similar to (2) but still hard to compute 4) identify near-optimal arm (epsilon) with high probability (1-delta) in time Tau(k,epsilon, delta) … polynomial in k, 1/delta, 1/epsilon) - I.e., don’t need to get the best arm, but just the one that’s epsilon away from the best with some probability 1-delta - Similar to PAC learning 5) nearly maximize reward (epsilon) with high probability (1-delta) in time Tau’(k, epsilon, delta) … polynomial in k, 1/delta, 1/epsilon 6) pull a non-near optimal arm (epsilon) no more than Tau’’(k, epsilon, delta) times with high probability (1-delta) - I.e., the number of times you pull a non-optimal arm is small … minimize mistakes
113
Hoeffding
X1,...,Xn are random variables We don’t know u_i, but we know u_hat because of performing maximum likelihood (i.e., average) If delta is big, than z_delta is big, and so the width of the confidence bound is big If we have a lot of data (big n), then this makes the bound smaller This tells us that given we have a way of knowing how close we are to estimating hidden parameter from a sample, we can then use that to figure ou thow many times we have to pull an arm before we know fairly accurate what the payoff for that arm is, and therefore how many times we have to pull each arm so that we’re fairly confident that we find something nearly suboptimal
114
general stochastic MDP
Stochastic (Hoeffding bound until estimates are accurate) | Sequential (unknown state-action pairs assumed optimistic)
115
General RMAX algorithm
1) Keep track of MDP 2) Any unknown state-action pair is RMAX Let’s have some parameter c = state-action pair is unknown unless we tried it out c times I.e., try the unknown state-action pair c times … then use the ML estimate 3) Solve the MDP 4) Take action from Pi*
116
Explore or exploit lemma
If all transitions are either accurately estimated or unknown, optimal policy is either near optimal or an unknown state is reached “quickly” Therefore we can’t run for a very long time without either learning something new or reaching near optimal
117
Generalization idea
We can assign features to states to generalize/categorize them We can learn the policy, which maps states to actions We can learn the value function, which maps state/action to estimated returns We can learn the model (transition and reward) - This is similar to supervised learning since you are observing state/actions and the next state
118
Basic update rule
We have a Q function that is the function of weights/parameters (w^a) and features (f(s)) and feed that into a function approximator (F) Based on input , we perform an update such that we move a little closer towards the reward and The TD error is the difference between the Bellman equation, and the current Q value We update the weight depending on how much the weight’s direction influenced the change in the Q value In a way we are bootstrapping: we are using our current predictions as a target for future predictions
119
Partial Q(s,a)
successes: 3-layer backprop: Tesauro, Wang/Dietterich - However, difficult to replicate CMAC: Sutton Linear nets: Singh, Parr - Very important that you choose the right features - 1) computationally efficient - 2) value-informative Deep nets: Minh et al (Atari) Tesauro (implemented on TD Gammon) -> Boyan/Moore (examples where things should have worked well, but in reality they perform poorly and need not work) -> Sutton (replicated Boyan/Moore’s experiments but changed the way training happened and it worked … so it can work if you do it right) Pollack: showed that genetic algorithms did well on TD Gammon too So TD Gammon itself as a game is probably a good example Probably do to it being consistently random so you don’t fall into traps
120
Baird’s Counter Example
if you update the sequence badly, then it can spiral out of control
121
Averagers
The averager is simply another MDP, so it converges It’s like if we move from one state to a second state, and from the second state do a mini-transition to all the basis states in order to figure out the value function over everything, by solving the MDP over the basis states Interpolate between anchor points We define some points x1, x2, x3 Then find a function approximator that the rest of the points are filled in between these points Doesn’t matter what we do, as long as we define any point we are trying to find as a convex combination of the anchor points - This means that any point between two anchor points has to be a value between the two anchor points For the spaces beyond the min or max anchor point, it’s interpolative - So any point x’ still has to be defined wrt the anchor points Supervised learners that can be expressed as an averager: K-NN, distance weighted K-NN - Find things closest to you and let them vote Kernel functions
122
POMDP
A way of talking about “non-Markov” environments The MDP isn’t directly observable to the agent, so insteadintead they make decisions based on their observations Z O is the observation function that given a state and the observation, spits out the probability of seeing Z in state s * Generalize MDPs - If the POMDP is the true MDP, then the observables are simply the states - O(s,Z) is 1 iff s=Z
123
State estimation
Belief state b(s): gives you probability we are in state s - Vector … probability distribution over states b,a,z -> b’ We are taking the POMDP and turning it into a belief MDP Rewards are not directly observed, but people may write the rewards as a function of the observables Need to derive b’(s’) = probability of being in state s’ after b,a,z This just turned a POMDP into an MDP with infinite states If we run linear programming, policy iteration, or value iteration, these are polynomial in number of states, so they take forever
124
Value iteration in POMDPs
At t = 0, define the value = 0 everywhere Reward isn’t a function of b and a, so need to change it Claim: value of the belief state is piecewise and convex Here, there are many linear functions that alpha can take on We take the max (i.e., the blue line) of all these functions Base case: there exists a set of Gamma_0 such that
125
RL for POMDPs
Planning: you know everything (model) - Planning in POMDPs is hard/undecidable RL: don’t know everything, so need to interaction Model-based RL: learned the model and use it - Learn the POMDP Model-free RL: don’t learn the model - Map observations to actions
126
Bayesian RL
The idea is to keep a Bayesian posterior over possible MDPs that we’re in and use it to optimally balance exploration-exploitation Optimal policy: explore to figure out which MDP you’re in … then exploit to get high reward RL is actually planning in a kind of continuous POMDP The hidden states are the set of parameters of the MDP
127
What are the four (4) components (the "Problem" portion) of an MDP?
State, model, actions, reward
128
What two requirements are necessary to meet for a valid MDP?
1. Must fulfill Markovian property, i.e. only the present matters 2. It must be stationary, i.e. the physics of the world (the "model"), actions available, etc. must not change as a function of time.
129
What is a policy, and what is an optimal policy?
A policy is a mapping from states to actions. An optimal policy is one that maps states to actions in such a way that it maximizes the expected reward the agent will receive over time.
130
What is the difference between a plan and a policy?
A policy is a mapping from states to actions. A plan is simply a sequence of actions. It's an important distinction because the stochastic nature of the world means that just because you plan to go somewhere doesn't guarantee you'll get there. A policy is more general in that it simply says "if I'm in a state, what action do I take?"
131
What is the credit assignment problem?
It's the problem of how we figure out how states and actions from far in the past influenced where we are now?
132
What is the "Stationarity of Preferences"?
The idea that if I prefer a set of states today over another set today, then that same inequality will hold the next day, and the next day, and so on. This is important because it leads to the natural decision to simply add rewards together to calculate the utility (i.e. value) of a sequence of states.
133
What does "discounting" rewards achieve? What must be true of the discount factor gamma for this condition to be met?
It allows us to calculate an infinite series in finite time. Gamma must be 0<=gamma<1
134
What is the difference between a reward and utility (i.e. value)?
R != V, i.e. reward only encompasses immediate gratification. Value tells us how good it is to be in some state based on the immediate reward + the discounted future value (i.e. it accounts for DELAYED rewards).
135
Why does value iteration work?
Because at each step, we get another glimpse of reality, and as we take more and more steps, we encompass more and more of the real dynamics of the problem.
136
Does policy iteration always converge?
Yes
137
What does the 'Q' in Q values stand for?
Quality, more specifically, how good it is to take some action.
138
Why do we bother having V functions AND Q functions? Isn't V enough?
Q functions are very useful from learning from samples of data, because we don't have to have access to the transition function (which we need for the value function because value iteration requires us to be able to calculate the value of the next state for each update).
139
What are the 3 "classes" of solution methods for solving RL problems? (bonus: what category do TD methods fall into?)
1. Model-based 2. Value-based (TD methods fall into this) 3. Policy-based
140
What properties must the learning rate have for RL?
1. SUM of all learning rate values must be infinite | 2. SUM OF SQUARES of learning rate values must be finite
141
Names some of the differences between TD(0) and TD(1)
TD(0): - Slow to propagate information - High bias, low variance - Maximum likelihood estimate (MLE) TD(1): - Equivalent to MC, samples full trajectories - Requires full trajectory in order to update - Low bias, high variance
142
What values lambda tend to work well (empirically speaking) when used in TD(lambda)?
0.3-0.7
143
Does Q-learning always converge? If so, what does it converge to?
Yes, it does converge, and in fact converges to the optimal value Q*
144
What is non-expansion/contractions?
...
145
What things are contraction mappings / non-expansions?
Order statistics, FIXED convex combinations
146
RL with control vs RL without control
the difference is whether or not there are actions chosen by the learner...
147
what is a bellman operator
transformation from value functions to value functions. Q -> B -> Q' => BQ -> Q' Q* = BQ*: Bellman equation Q_t = BQ_t+1: value iteration
148
what is a contraction mapping
causes 'things' to get closer together. if for all F,G and some 0 <= gamma < 1, ||BF - BG||_inf <= gamma * ||F-G||_inf, then B is a contraction mapping. ::: is there some gamma between 0 and 1 that when applied, the values get smaller and smaller (contract) abs(BX - BY) <= gamma abs(B-Y)
149
contraction mapping properties
if B is a contraction mapping, | F* = BF* has a solution and its unique F_t = BF_t-1 => F_t -> F* value iteration converses
150
DEC-POMDPS allow us to wrap coordinating and communicating into choosing actions to maximize utility
True Decentralized Partially Observable Markov Decisions Actions are taken simultaneously by a finite set of agents (not just 1) in DEC-POMDPS DEC-POMDP all share reward function (they are working together)
151
DEC-POMDPs are represented by Ri (diff reward for each agent)
FALSE, there is a shared Reward (all agents working together). If reward wasn’t shared the model would be POSG
152
Properties of DEC-POMDP
1. Elements of game theory and POMDPS | 2. NEXP-complete (for finite horizon)
153
Inverse RL
Input behavior and environment and OUTPUT reward no reward function is given. Instead, the reward function is inferred given an observed behavior from an expert. The idea is to mimic observed behavior, which is often optimal or close to optimal.
154
In reward shaping, if the human believes X, Y, Z with 2/3, 1/6, 1/6 and the algorithm believes 1/10, 1/10, 8/10. What action should they choose?
Choose action Z because pairwise product is highest. Argmax p(a|poliicy1)*p(a|policy2)
155
Drama Management
There is a 3rd agent, the “Author” that wants to build an agent that interferes with the player. Author created packman, agent is game itself, player is player obvi
156
Value Iteration Algorithm
Start w/ arbitrary utilities Update utilities based on reward + neighbors (discounted future reward) Repeat until convergence
157
markovian property details
Only the present matters AND things are stationary (rules/the world doesn't change over time)
158
Policy Iteration won't converge
False
159
TD update rule always converges with any learning rate
False. It will converge if you sum all of the learning rates at each time, t > infinity and if you sum all the learning rates squared at time < infinity
160
TD(1) is the same as outcome-based updates (if no repeated states)
True. and with even more learning because updates don't have to wait for the episode to be over
161
Maximum Likelihood Estimate vs outcome-based estimate (TD(1))
Maximum Likelihood uses all of the examples, but TD(1) uses just individual runs so if a rare thing happens on TD(1) it can be biased (high variance). (this leads us to TD(lambda)
162
TD(0) is the same as maximum likelihood estimate
TRUE if we run over the data over and over again
163
TD(lambda) is weighted combination of k step estimators
True
164
TD(1) typically has less error than TD(0)
False. TD(1) typically has more error than TD(0)
165
TD(0) has the least amount of error usually
False. TD(lambda) performs best. Usually 0.3 - 0.7 is best
166
Temporal Difference is
Difference between reward (value estimates) as we go from one step to another
167
Model is the environment
False. Model is the agent's idea of the environment
168
RL vs Planning
RL: model is unknown, the agent performs actions Planning: model is known, the agent performs computations (know planning)
169
Backward and forward TD(lambda) can be applied to the same problems.
True. As also noted, you would have to wait for a complete episode to do it forwards(i.e. forwards does not work "online")...
170
gamma of 0 means
you care only about the current (short-sighted)
171
gamma of 1 means
all future reward is equally weighted (far-sighted)
172
why use gamma and discount future rewards
avoid infinite returns and also account for a model not being perfect (future reward is not guaranteed)
173
You can always flatten an MDP to a Markov Chain
True
174
Difference between MDP and Markov Reward Process
MDP is an extension of MC the difference is the addition of actions (allowing choice) and rewards (giving motivation).
175
it has been proven that both forward and backward view of TD(lambda) are equivalent
True
176
What is forward TD(lambda)
The forward view looks at all n-steps ahead and uses λ to essentially decay those future estimates.
177
What is backward TD(lambda)
The backward view of TD(λ) updates values at each step. So after each step in the episode you make updates to all prior steps. Uses eligibility trace
178
Online learning vs offline learning
Online learning means that you are doing it as the data comes in. Offline means that you have a static dataset.
179
SARSA vs Q- Learning
In Q learning, you update the estimate from the maximum estimate of possible next actions, regardless of which action you took. Whilst in SARSA, you update estimates based on and take the same action.
180
In general, an update rule which is not a non-expansion will not converge.. duplicate?
True. A non-expansion is {constant, contraction}, and a not non-expansion is not a constant and not a contraction, therefore it is an expansion and expansions diverge.
181
Contraction mappings and non-expansions are concepts used to prove the convergence of RL algorithms, but are otherwise unrelated concepts. duplicate?
False | A non-expansion must be followed by a contraction, in order to provide convergence guarantees.
182
examples of non-expansion
order statistics (max, min), fixed convex combinations
183
Policy Iteration converges at least as fast as VI
True
184
Domination
policy 1 dominates policy 2 if for all states the value of that state for policy 1 is greater than or equal to the value of that state for policy 2
185
policy iteration won't get stuck in a local optimal
True
186
T/F initializing q values is the same as using potential based functions
True. Initialize to 0. Random means you are biasing
187
Potential-based shaping will find an optimal policy faster than an unshaped MDP.
False... if the selected potential is wrong, it will slow things down.
188
potential functions can speed things up
True
189
Rmax will always find the optimal policy for a properly tuned learning function.
False (although true in spirit). The problem is the word 'always.' As Jacob alludes to, in these PAC-style bounds, there is always a small probability that the desired goal isn't reached, and if reached it's not quite optimal.
190
one shot games
This is a game that is played only once The pay-off may be such that a game might be impossible to play twice E.g. mutually assured nuclear destruction Slightly different with tactical / conventional warfare In one-shot interaction, people often have an incentive to behave opportunistically / selfishly Consider a one shot Prisoners’ Dilemma game where an individual wants to minimise their own sentence
191
repeated games
a game played more than once a crucial point is the reaction to a defective strategy by another player Tit for Tat Strategy – if you defect, I defect in the next round Grim Strategy – if you defect, I will defect in all future rounds
192
Q-learning converges only under certain exploration decay conditions.
False as we are only concerned with exploration decay.convergence is guaranteed by the contraction property of the Bellman operator, which does not include any assumptions on the exploration rate.
193
The tradeoff between exploration and exploitation is not applicable to finite bandit domains since we are able to sample all options. Dup?
False. You may still need to explore "non-optimal" arms of the bandit since there may be noise in the estimate of its value.
194
Since SARSA is an “online” algorithm, it will never learn the optimal policy for an environment.
False. SARSA converges to an optimal policy as long as all state-action are visited an infinite number of times and the policy converges to a greedy policy
195
KWIK algorithms are an improvement of MB (mistake bound) algorithms and as such will have same or better time complexity for all classes of problems.
False. Some hypothesis classes are exponentially harder to learn in the KWIK setting than in the MB setting.
196
With a classic update using linear function approximation, we will always converge to some values, but they may not be optimal.
False. Baird's counter-example shows how classic update using linear function approximation can lead to divergence.
197
Applying generalization with an "averager" on an MDP results in another MDP.
True. As noted in Gordon (1995), you will have fewer states...
198
Hoeffding and union bounds
a way to know how much we know (certainty). know when we are certain enough
199
Rmax
optimism in the face of uncertainty
200
Can combine Rmax bandits
stochastic + sequential
201
When does generalization work
essentially anything that generalizes a linear model (ANN, CMAC, linear nets, deep nets)
202
Generalization duplicate
some state may look like other states so take into account features of states and weight the importance of the feature when making a decision. Replace update rule with function approximation
203
averages can converge to optimal solution (but does not always)
True. Need anchor points to converge
204
value iteration converges for POMDPs
False, but you can get near-optimal
205
Model based POMDP
use expectation maximization to learn the model
206
Properties of Monte Carlo Tree Search
useful for large states, needs lot of samples to get good estimates, planning time indepedent of size of state space, running time is exponential in the horizon
207
Gittens index only works for
bandits
208
How does monte carlo tree search get q value estimates
random simulations
209
Goal abstrction
Modular RL and arbitration
210
Temporal abstractions (hierarchical RL)
options and semiMDPs
211
Pavlov strategy
cooperate if agree, defect if disagree
212
Subgame perfect
always best response independent of history
213
Properties of minimax 1 for 2 player 0 sum games
value iteration works, minimaX Q converges, efficent
214
Properites of nash q for multiplayer
value iteration does not work, minimax q does not coverge
215
Does coco generalize to n>2 players
no
216
what is coco
cooperative competitive values
217
POMDPs can be planned
false
218
Semi-MDP
Instead of looking at a very small action space, we can create new actions that transitions that could help reduce the number of steps to learn and move towards a goal.
219
because SMDPs are abstracting MDP they inherit the hierarchical optimality, convergence and stability of an MDP
FALSE, true with the vaceat that we should choose the right options. Also, the sttates don't matter and helps with exploration
220
residual algorithms
A new class of algorithms, residual gradient algorithms, is proposed, which perform gradient descent on the mean squared Bellman residual, guaranteeing convergence. It is shown, however, that they may learn very slowly in some cases. A larger class of algorithms, residual algorithms, is proposed that has the guaranteed convergence of the residual gradient algorithms, yet can retain the fast learning speed of direct algorithms. In fact, both direct and residual gradient algorithms are shown to be special cases of residual algorithms, and it is shown that residual algorithms can combine the advantages of each approach. At any given time during learning, the Q function that has been learned so far may not exactly satisfy the Bellman equation. The Bellman residual is the difference between the two sides of the equation. This is slightly different from the TD error. The TD error is the difference between the two sides of the equation without the expected value, evalutated after just a single transition for (x,u). So the Bellman residual is the expected value of the TD error. The TD error is the Bellman residual plus zero-mean noise.
221
dual LP
The dual of a given linear program (LP) is another LP that is derived from the original (the primal) LP in the following schematic way: Each variable in the primal LP becomes a constraint in the dual LP; Each constraint in the primal LP becomes a variable in the dual LP; The objective direction is inversed – maximum in the primal becomes minimum in the dual and vice-versa. The solution to the dual problem provides a lower bound to the solution of the primal (minimization) problem.[1]
222
primal LP
In mathematical optimization theory, duality or the duality principle is the principle that optimization problems may be viewed from either of two perspectives, the primal problem or the dual problem. The solution to the dual problem provides a lower bound to the solution of the primal (minimization) problem.[1] However in general the optimal values of the primal and dual problems need not be equal.
223
pure nash
In plain terms, a pure Nash equilibrium is a strategy profile in which no player would benefit by deviating, given that all other players don't deviate. Some games have multiple pure Nash equilib ria and some games do not have any pure Nash equilibria.
224
mixed nash
Mixed strategy Nash equilibria are equilibria where at least one player is playing a mixed (random) strategy.
225
folk theorem
The Folk Theorem suggests that if the players are patient enough and far-sighted, then repeated interaction can result in virtually any average payoff in an SPE equilibrium.
226
Probably approximately correct learning (PAC learning)
In computational learning theory, probably approximately correct learning is a framework for mathematical analysis of machine learning. It was proposed in 1984 by Leslie Valiant. In this framework, the learner receives samples and must select a generalization function from a certain class of possible functions. helps analyze whether and under what conditions a learner L will probably output an approximately correct classifier.