Final Flashcards

1
Q

It is not always possible to convert a finite horizon MDP to an infinite horizon MDP.

A

False. You can always convert a terminal state into an absorbing state with a transition to itself and reward 0

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

In RL, recent moves influence outcomes more than moves further in the past.

A

False, while that can be the case (and often works well), some algorithms, like TD(1) give equal weight to each move and it is also possible to give more weight to earlier moves as well. It depends on the problem you’re trying to solve.

False. You can lose a chess game on your first move – and apparently someone did.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

An MDP given a fixed policy is a Markov chain with rewards.

A

True, since fixed policy means the agent doesn’t have an option to choose an action in each state. The agent transitions from state to state according to this fixed policy (without choosing any actions) which is Markov chain.

True. A fixed policy means a fixed action is taken given a state. In this case, this MDP totally depends on the current state, and it is now reduced to a Markov chain with rewards.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

If we know the optimal Q values, we can get the optimal V values only if we know the environment’s transition function/matrix.

A

False, you don’t need the transition function.

False. For an optimal policy, max V(s) is equal to the max Q(s, a). No need to know the environment transition matrix.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

In the gridworld MDP in “Smoov and Curly’s Bogus Journey”, if we add 10 to each state’s reward (terminal and non-terminal) the optimal policy will not change.

A

True, assuming an infinite horizon the optimal policy will be unchanged

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Markov means RL agents are amnesiacs and forget everything up until the current state.

A

True here: the current state is all the agent should know.

Now if you want to discuss what a “current state” is… Well that can get more complicated.

True. In a MDP the present is a sufficient statistic of the past.

Only the current state matters and what past states were is irrelevant to the agent.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

A policy that is greedy–with respect to the optimal value function–is not necessarily an optimal policy.

A

False as noted… Optimal action on an optimal value function is optimal by definition.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

In TD learning, the sum of the learning rates used must converge for the value function to converge.

A

False, the sum of learning rates must diverge for the value function to converge.

Also, the sum of learning rates squared must converge for the value function to converge.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Monte Carlo is an unbiased estimator of the value function compared to TD methods. Therefore, it is the preferred algorithm when doing RL with episodic tasks.

A

False. (second sentence)

True –> Monte Carlo is an unbiased estimator of the value function compared to TD methods –> TD methods start with an initial estimate of q values and tend to be biased on these

False –> Most episodic tasks are too long and the computational advantages of TD updates favor TD methods over MC methods.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

The value of the returned policy is the only metric we care about when evaluating a learner.

A

False. We can evaluate a learner on other metrics including computational complexity and how much data is required to effectively learn.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

T/F POMDP allow us to strike a balance actions to gain reward and actions to gain information

A

TRUE this was all folded into one model (there was not special info to do this)

Partially Observable Markov Decision Processes (POMDPs) have succeeded in planning domains that require balancing actions that increase an agent’s knowledge and actions that increase an agent’s reward.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

T/F DEC-POMDPS allow us to wrap coordinating and communicating into choosing actions to maximize utility

A

True.

The decentralized partially observable Markov decision process (Dec-POMDP) is a model for coordination and decision-making among multiple agents.

It is a probabilistic model that can consider uncertainty in outcomes, sensors and communication (i.e., costly, delayed, noisy or nonexistent communication). it is a generalization of a Markov decision process (MDP) and a partially observable Markov decision process (POMDP) to consider multiple decentralized agents.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

DEC-POMDP stands for

A

Decentralized Partially Observable Markov Decision Process

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

The primary difference between POMDPs and DEC-POMDPS

A

In DEC-POMDPS, actions are taken simultaneously by a finite set of agents (not just 1)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

DEC-POMDPS vs POSG

A

DEC-POMDP all share reward function (they are working together)

—> Note: what is POSG?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

T/F DEC-POMDPs are represented by Ri (diff reward for each agent)

A

FALSE, there is a shared Reward (all agents working together). If reward wasn’t shared the model would be POSG

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Properties of DEC-POMDP

A
  1. Elements of game theory and POMDPS

2. NEXP-complete (for finite horizon) –> NEXP-complete when the input is represented as a brief circuit

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Inverse RL

A

Input behavior and environment and OUTPUT reward

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

MLRL

A
  • Guess R
  • Compute policy
  • Measure probability of data given policy
  • Compute gradient R to find how to change reward to make it fit the data better
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

In reward shaping, if the human believes X, Y, Z with 2/3, 1/6, 1/6 and the algorithm believes 1/10, 1/10, 8/10. What action should they choose?

A

Choose action Z because pairwise product is highest.

Argmax p(a|poliicy1)*p(a|policy2)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Drama Management

A

There is a 3rd agent, the “Author” that wants to build an agent that interferes with the player. Author created packman, agent is game itself, player is player obvi

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

TTD-MDPs vs MDPs

A
  • rather than states: have trajectories (sequence so far)
  • rather than rewards: have target distribution p(T).

Trajectory Distribution Markov Decision Process

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Value Iteration Algorithm

A

Start w/ arbitrary utilities
Update utilities based on reward + neighbors (discounted future reward)
Repeat until convergence

Note that value iteration is obtained simply by turning the Bellman optimality equation into an update rule. Also note how the value iteration backup is identical to the policy evaluation backup (4.5) except that it requires the maximum to be taken over all actions.

Value iteration effectively combines, in each of its sweeps, one sweep of policy evaluation and one sweep of policy improvement.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Supervised Learning

A

y = f(x). Function approximation, find f that maps y to x

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

Unsupervised Learning

A

f(c). Clusters descripition

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Reinforcement Learning

A

y = f(x) but given x and z. Still trying to find f to generate y. We see so r is the z

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

MDP stands for

A

Markov Decision Processes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

MDPs are made up of

A

States, transitions (model), actions, rewards and create a policy.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

Markovian Property

A

Only the present matters AND things are stationary (rules/the world doesn’t change over time)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

Delayed Rewards

A

In chess example, if you make a bad move early on that you can never recover from. That bad move needs to be reflected in the reward

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

Temporal Credit Assignment Problem

A

The problem of determining the actions that lead to a certain outcome in sequence.

For example, in football, at each second, each football player takes an action. In this context, an action can e.g. be “pass the ball”, “dribbe”, “run” or “shoot the ball”. At the end of the football match, the outcome can either be a victory, a loss or a tie. After the match, the coach talks to the players and analyses the match and the performance of each player. He discusses the contribution of each player to the result of the match. The problem of determinig the contribution of each player to the result of the match is the (temporal) credit assignment problem.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

How to change policies to account for finite horizons

A

Policy(s,t).. policy is function of state AND time

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

Utility of Sequences (stationary preferences)

A

if you prefer one sequence of states over another today, you prefer the same sequence tomorrow

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

How to calculate infinite horizons without infinity?

A

Use discounted future rewards (use gamma)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

Reward vs Utility

A

Reward is immediate payoff of state. Utility is long term payoff of action, it takes into account delayed reward

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

Policy Iteration

A

start with initial policy (guess)

evaluate: given policy, calculate utility
improve: Policy at t+1 is the action that maximizes the utility

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

T/F Policy Iteration won’t converge

A

False

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

On Policy vs Off-Policy

A

off policy estimates the q values (state-action value) directly from the Q function regardless of the policy being followed by the agent. (Q-learning)

on-policy is that it updates its Q-values using the Q-value of the next state 𝑠′ and the current policy’s action 𝑎″. It estimates the return for state-action pairs assuming the current policy continues to be followed. (SARSA)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

T/F TD update rule always converges with any learning rate

A

False. It will converge if you sum all of the learning rates at each time, t > infinity and if you sum all the learning rates squared at time < infinity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

T/F TD(1) is the same as outcome-based updates (if no repeated states)

A

True. and with even more learning because updates don’t have to wait for the episode to be over

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

Maximum Likelihood Estimate vs outcome-based estimate (TD(1))

A

Maximum Likelihood uses all of the examples, but TD(1) uses just individual runs so if a rare thing happens on TD(1) it can be biased (high variance). (this leads us to TD(lambda)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
42
Q

T/F TD(0) is the same as maximum likelihood estimate

A

TRUE if we run over the data over and over again

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
43
Q

T/F TD(lambda) is weighted combination of k step estimators

A

True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
44
Q

T/F TD(1) typically has less error than TD(0)

A

False. TD(1) typically has more error than TD(0)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
45
Q

T/F TD(0) has the least amount of error usually

A

False. TD(lambda) performs best. Usually 0.3 - 0.7 is best

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
46
Q

Temporal Difference is

A

Difference between reward (value estimates) as we go from one step to another

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
47
Q

T/F reward must be scalar

A

TRUE

Scalar value means single value.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
48
Q

T/F environment is visible to the agent

A

False, usually

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
49
Q

T/F all representations of state have same results

A

False. Cheese/rat example in David Slivers. remembering past 3 states is diff than past 4 or 5

50
Q

Partially Observable

A

Agent indirectly view state (agent state != env state). ex: robot with camera. POMDP

51
Q

T/F Model is the environment

A

False. Model is the agent’s idea of the environment

52
Q

RL vs Planning

A

RL: model is unknown, the agent performs actions
Planning: model is known, the agent performs computations (know planning)

53
Q

B is a Contraction Mapping when

A

||BF - BG|| <= ||F-G|| so max difference between f-g vs B applied to F and B applied to G

54
Q

Backward and forward TD(lambda) can be applied to the same problems.

A

True. As also noted, you would have to wait for a complete episode to do it forwards(i.e. forwards does not work “online”)…

55
Q

gamma of 0 means

A

you care only about the current (short-sighted)

56
Q

gamma of 1 means

A

you care about the future a lot (far-sighted)

57
Q

why use gamma and discount future rewards

A

avoid infinite returns and also account for a model not being perfect (future reward is not guaranteed)

58
Q

T/F You can always flatten an MDP to a Markov Chain

A

True

59
Q

Difference between Markov Decision Process and Markov Reward Process

A

An Markov Decision Process is a Markov Reward Process with decisions. Includes actions and stochasticity

60
Q

T/F it has been proven that both forward and backward view of TD(lambda) are equivalent

A

True

61
Q

What is forward TD(lambda)

A

The forward view looks at all n-steps ahead and uses λ to essentially decay those future estimates.

62
Q

What is backward TD(lambda)

A

The backward view of TD(λ) updates values at each step. So after each step in the episode you make updates to all prior steps. Uses eligibility trace

63
Q

Online learning vs offline learning

A

Online learning means that you are doing it as the data comes in. Offline means that you have a static dataset.

Let’s say you want to build a classifier to recognize spam. You can acquire a large corpus of e-mail, label it, and train a classifier on it. This would be offline learning. Or, you can take all the e-mail coming into your system, and continuously update your classifier (labels may be a bit tricky). This would be online learning

64
Q

SARSA vs Q- Learning

A

Q-Learning is an off-policy TD control policy. It’s exactly like SARSA with the only difference being — it doesn’t follow a policy to find the next action A’ but rather chooses the action in a greedy fashion.

65
Q

T/F Offline algorithms are generally superior to online algorithms.

A

False Generally, because it depends on the problem and a number of other things.

66
Q

T/F Given a model (T,R) we can also sample in, we should first try TD learning.

A

false, if you have the model one would expect it to be more efficient to use it.

67
Q

T/F TD(1) slowly propagates information, so it does better in the repeated presentations regime rather than with single presentations.

A

False. TD(0) propagates more slowly.

68
Q

T/F In TD(lambda), we should see the same general curve for best learning rate (lowest error), regardless of lambda value.

A

Generally True, same shape but different minima

69
Q

T/F In general, an update rule which is not a non-expansion will not converge.

A

True.

A non-expansion is {constant, contraction}, and a not non-expansion is not a constant and not a contraction, therefore it is an expansion and expansions diverge.

70
Q

T/F MDPs are a type of Markov game.

A

True. Markov games with only one player action space and a corresponding single reward function are MDPs.

71
Q

T/F Contraction mappings and non-expansions are concepts used to prove the convergence of RL algorithms, but are otherwise unrelated concepts.

A

False. They are totally related. A non-expansion must be followed by a contraction, in order to provide convergence guarantees.

72
Q

T/F Linear programming is the only way we are able to solve MDPs in linear time.

A

False.

First of all, LP takes super-linear polynomial time. Secondly, LP is only one part of a potential RL algorithm. Third of all, there are many methods that don’t use LP that solve MDPs (though none in linear time).

73
Q

T/F The objective of the dual Language Processing (LP) presented in lecture is minimization of “policy flow”. (The minimization is because we are aiming to find an upper bound on “policy flow”.)

A

False. The objective of the dual Language Processing (LP) is to maximize “policy flow”, subject to conservation of flow constraints. The minimization is for the primary LP problem, in order to find the least upper bound of value functions over states and actions.

74
Q

T/F max is non-expansion

A

True

The non-expansion is a vital and widely-used sufficient condition to guarantee the convergence of value iteration.

75
Q

T/F Any optimal policy found with reward shaping is the optimal policy for the original MDP.

A

False, only potential based shaping might avoid introducing an sub-optimal policy loop.

76
Q

Generalized MDPs

A

Their basic concept is that in the Bellman equations, the operation $ \sum_y P(x,a,y) \ldots$ (i.e., taking expected value w.r.t. the transition probabilities) describes the effect of the environment, while the operation $ \max_a …$ describes the effect of an optimal agent (i.e., selecting an action with maximum expected value). Changing these operators, other well-known models can be recovered.

as long as operators doing updates are non-expansion, we get convergence of q learning, value iteration,

https://sites.ualberta.ca/~szepesva/papers/gmdp.ps.pdf

77
Q

examples of non-expansion

A
  • order statistics (max, min)

- fixed convex combinations

78
Q

T/F If you get a good enough approximation of the value function then you are bound on how far off you are

A

True

79
Q

T/F Policy Iteration converges at least as fast as Value Iteration (VI)

A

True

80
Q

T/F Policy Iteration is more computationally expensive than Value Iteration (VI)

A

True

Policy iteration is also guaranteed to converge to the optimal policy and it often takes less iterations to converge than the value-iteration algorithm.

Comparing to each other, policy-iteration is computationally efficient as it often takes considerably fewer number of iterations to converge although each iteration is more computationally expensive.

81
Q

Domination

A

policy 1 dominates policy 2 if for all states the value of that state for policy 1 is greater than or equal to the value of that state for policy 2

82
Q

T/F policy iteration won’t get stuck in a local optimal

A

True

If this is the case, then the policy improvement step will not stop at the local optimum state-action space Π#, as there exists at least one state-action in Π∗ which is different from Π# and yields a higher value of 𝑣∗ compared to 𝑣#

83
Q

T/F value non-deprovement in steps of policy iteration

A

true

84
Q

How can we change the MDP reward function w/o changing optimal policy (3 ways)

A

(1. ) Multiply by positive constant
(2. ) Shift by constant
(3. ) Non-linear potential based rewards

85
Q

T/F initializing q values is the same as using potential based functions

A

True. Initialize to 0. Random means you are biasing

86
Q

Potential-based shaping will find an optimal policy faster than an unshaped MDP.

A

False… if the selected potential is wrong, it will slow things down.

87
Q

T/F potential functions can speed things up

A

True

88
Q

What is the problem with reward shaping

A

We can screw things up and create suboptimal policies optimized for these “helper” scenarios

89
Q

T/F Rmax will always find the optimal policy for a properly tuned learning function.

A

False (although true in spirit). The problem is the word ‘always.’ As Jacob alludes to, in these PAC-style bounds, there is always a small probability that the desired goal isn’t reached, and if reached it’s not quite optimal.

90
Q

Q-learning converges only under certain exploration decay conditions.

A

False as we are only concerned with exploration decay.convergence is guaranteed by the contraction property of the Bellman operator, which does not include any assumptions on the exploration rate.

91
Q

The tradeoff between exploration and exploitation is not applicable to finite bandit domains since we are able to sample all options.

A

False. You may still need to explore “non-optimal” arms of the bandit since there may be noise in the estimate of its value.

92
Q

T/F Since SARSA is an “online” algorithm, it will never learn the optimal policy for an environment.

A

False. SARSA converges to an optimal policy as long as all state-action are visited an infinite number of times and the policy converges to a greedy policy.

93
Q

T/F RL with linear function approximation only works on toy problems.

A

False. Nonlinear approximation can be approximated by linear piece-wise construction.

94
Q

T/F KWIK “know what it knows” algorithms are an improvement of MB (mistake bound) algorithms and as such will have same or better time complexity for all classes of problems.

A

False. Some hypothesis classes are exponentially harder to learn in the KWIK setting than in the MB setting.

A KWIK algorithm is an online learning algorithm that is considered to be self-aware i.e.,
if and only if the learner believes that it has insufficient experience to predict on a new sample, does the learner ask the expert for the answer

95
Q

T/F With a classic update using linear function approximation, we will always converge to some values, but they may not be optimal.

A

False. Baird’s counter-example shows how classic update using linear function approximation can lead to divergence.

96
Q

T/F Applying generalization with an “averager” on an MDP results in another MDP.

A

True. As noted in Gordon (1995), you will have less states…

97
Q

Hoeffding and union bounds

A

a way to know how much we know (certainty). know when we are certain enough

98
Q

Rmax

A

Optimism in the face of uncertainty

R-max is a very simple model-based reinforcement learning algorithm which can attain near-optimal average reward in polynomial time. In R-max, the agent always maintains a complete, but possibly inaccurate model of its environment and acts based on the optimal policy derived from this model.

99
Q

Can combine Rmax bandits

A

stochastic + sequential

100
Q

When does generalization work

A

essentially anything that generalizes a linear model (ANN, CMAC, linear nets, deep nets)

101
Q

Generalization

A

Machine learning is a discipline in which given some training data/environment, we would like to find a model that optimizes some objective, but with the intent of performing well on data that has never been seen by the model during training.

This is usually referred to as Generalization, or the ability to learn something that is useful beyond the specifics of the training environment.

some state may look like other states so take into account features of states and weight the importance of the feature when making a decision. Replace update rule with function approximation

102
Q

T/F averages can converge to optimal solution (but does not always)

A

True. Need anchor points to converge

103
Q

T/F value iteration converges for POMDPs

A

False, but you can get near-optimal

104
Q

T/F POMDPs are easy to solve

A

False. Piecewise linear convex can do it and throw away a lot that are striclty dominated

POMDPs - We’re unsure which state we’re in. The current state emits observations.

  • we have control of over the state transitions
  • the states are not completely observable
105
Q

Model based POMDP

A

Use expectation maximization to learn the model

The model-based approach — learning a POMDP model of the world, and computing an optimal policy for the learned model — may generate superior results in the presence of sensor noise, but learning and solving a model of the environment is a difficult problem.

When sensor noise increases, model-free methods provide less accurate policies.

106
Q

memoryless POMDP

A

helps to be random

107
Q

Properties of Monte Carlo Tree Search

A
  • useful for large states
  • needs lot of samples to get good estimates
  • planning time independent of size of state space
  • running time is exponential in the horizon
108
Q

Gittens index only works for

A

bandits

109
Q

How does monte carlo tree search get q value estimates

A

random simulation

110
Q

Goal abstraction

A

Modular RL and arbitration.

Abstraction is a technique to reduce the complexity of a problem by filtering out irrelevant properties while preserving all the important ones necessary to still be able solve a given problem

111
Q

Temporal abstractions (hierarchical RL)

A

Options and semiMDPs

Applying reinforcement learning (RL) to real-world problems will require reasoning about action-reward correlation over long time horizons. Hierarchical reinforcement learning (HRL) methods handle this by dividing the task into hierarchies, often with hand-tuned network structure or pre-defined subgoals.

The learned abstraction allows us to learn new tasks on higher level more efficiently. We convey a significant speedup in convergence over benchmark learning problems. These results demonstrate that learning temporal abstractions is an effective technique in increasing the convergence rate and sample efficiency of RL algorithms.

112
Q

Pavlov strategy

A

Cooperate if agree, defect if disagree

113
Q

Subgame perfect

A

always best response independent of history

A Nash equilibrium is said to be subgame perfect if and only if it is a Nash equilibrium in every subgame of the game.

114
Q

Properties of minimax 1 for 2 player 0 sum games

A
  • value iteration works
  • minimaX Q converges
  • efficient
115
Q

Properties of nash q for multiplayer

A
  • value iteration does not work

- minimax q does not converge

116
Q

Does coco generalize to n>2 players

A

No, Coco-Q, like coco values in normal-form games, is not defined for games with three or more players

Coco (“cooperative/competitive”) values are a solution concept for two-player normalform games with transferable utility, when binding agreements and side payments between players are possible.

117
Q

what is coco

A

cooperative competitive values

118
Q

T/F POMDPs can be planned

A

False

119
Q

Semi-MDP

A

Instead of looking at a very small action space, we can create new actions that transitions that could help reduce the number of steps to learn and move towards a goal.

120
Q

T/F b/c SMDPs are abstracting MDP they inherit the hierarchical optimality, convergence and stability of an MDP

A

FALSE, true with the vaceat that we should choose the right options. Also, the sttates don’t matter and helps with exploration