Final Flashcards

Question 1

Q

It is not always possible to convert a finite horizon MDP to an infinite horizon MDP.

Answer

A

False. You can always convert a terminal state into an absorbing state with a transition to itself and reward 0

Question 2

Q

In RL, recent moves influence outcomes more than moves further in the past.

Answer

A

False, while that can be the case (and often works well), some algorithms, like TD(1) give equal weight to each move and it is also possible to give more weight to earlier moves as well. It depends on the problem you’re trying to solve.

False. You can lose a chess game on your first move – and apparently someone did.

Question 3

Q

An MDP given a fixed policy is a Markov chain with rewards.

Answer

A

True, since fixed policy means the agent doesn’t have an option to choose an action in each state. The agent transitions from state to state according to this fixed policy (without choosing any actions) which is Markov chain.

True. A fixed policy means a fixed action is taken given a state. In this case, this MDP totally depends on the current state, and it is now reduced to a Markov chain with rewards.

Question 4

Q

If we know the optimal Q values, we can get the optimal V values only if we know the environment’s transition function/matrix.

Answer

A

False, you don’t need the transition function.

False. For an optimal policy, max V(s) is equal to the max Q(s, a). No need to know the environment transition matrix.

Question 5

Q

In the gridworld MDP in “Smoov and Curly’s Bogus Journey”, if we add 10 to each state’s reward (terminal and non-terminal) the optimal policy will not change.

Answer

A

True, assuming an infinite horizon the optimal policy will be unchanged

Question 6

Q

Markov means RL agents are amnesiacs and forget everything up until the current state.

Answer

A

True here: the current state is all the agent should know.

Now if you want to discuss what a “current state” is… Well that can get more complicated.

True. In a MDP the present is a sufficient statistic of the past.

Only the current state matters and what past states were is irrelevant to the agent.

Question 7

Q

A policy that is greedy–with respect to the optimal value function–is not necessarily an optimal policy.

Answer

A

False as noted… Optimal action on an optimal value function is optimal by definition.

Question 8

Q

In TD learning, the sum of the learning rates used must converge for the value function to converge.

Answer

A

False, the sum of learning rates must diverge for the value function to converge.

Also, the sum of learning rates squared must converge for the value function to converge.

Question 9

Q

Monte Carlo is an unbiased estimator of the value function compared to TD methods. Therefore, it is the preferred algorithm when doing RL with episodic tasks.

Answer

A

False. (second sentence)

True –> Monte Carlo is an unbiased estimator of the value function compared to TD methods –> TD methods start with an initial estimate of q values and tend to be biased on these

False –> Most episodic tasks are too long and the computational advantages of TD updates favor TD methods over MC methods.

Question 10

Q

The value of the returned policy is the only metric we care about when evaluating a learner.

Answer

A

False. We can evaluate a learner on other metrics including computational complexity and how much data is required to effectively learn.

Question 11

Q

T/F POMDP allow us to strike a balance actions to gain reward and actions to gain information

Answer

A

TRUE this was all folded into one model (there was not special info to do this)

Partially Observable Markov Decision Processes (POMDPs) have succeeded in planning domains that require balancing actions that increase an agent’s knowledge and actions that increase an agent’s reward.

Question 12

Q

T/F DEC-POMDPS allow us to wrap coordinating and communicating into choosing actions to maximize utility

Answer

A

True.

The decentralized partially observable Markov decision process (Dec-POMDP) is a model for coordination and decision-making among multiple agents.

It is a probabilistic model that can consider uncertainty in outcomes, sensors and communication (i.e., costly, delayed, noisy or nonexistent communication). it is a generalization of a Markov decision process (MDP) and a partially observable Markov decision process (POMDP) to consider multiple decentralized agents.

Question 13

Q

DEC-POMDP stands for

Answer

A

Decentralized Partially Observable Markov Decision Process

Question 14

Q

The primary difference between POMDPs and DEC-POMDPS

Answer

A

In DEC-POMDPS, actions are taken simultaneously by a finite set of agents (not just 1)

Question 15

Q

DEC-POMDPS vs POSG

Answer

A

DEC-POMDP all share reward function (they are working together)

—> Note: what is POSG?

Question 16

Q

T/F DEC-POMDPs are represented by Ri (diff reward for each agent)

Answer

A

FALSE, there is a shared Reward (all agents working together). If reward wasn’t shared the model would be POSG

Question 17

Q

Properties of DEC-POMDP

Answer

A

Elements of game theory and POMDPS

2. NEXP-complete (for finite horizon) –> NEXP-complete when the input is represented as a brief circuit

Question 18

Q

Inverse RL

Answer

A

Input behavior and environment and OUTPUT reward

Question 19

Q

MLRL

Answer

A

Guess R
Compute policy
Measure probability of data given policy
Compute gradient R to find how to change reward to make it fit the data better

Question 20

Q

In reward shaping, if the human believes X, Y, Z with 2/3, 1/6, 1/6 and the algorithm believes 1/10, 1/10, 8/10. What action should they choose?

Answer

A

Choose action Z because pairwise product is highest.

Argmax p(a|poliicy1)*p(a|policy2)

Question 21

Q

Drama Management

Answer

A

There is a 3rd agent, the “Author” that wants to build an agent that interferes with the player. Author created packman, agent is game itself, player is player obvi

Question 22

Q

TTD-MDPs vs MDPs

Answer

A

rather than states: have trajectories (sequence so far)
rather than rewards: have target distribution p(T).

Trajectory Distribution Markov Decision Process

Question 23

Q

Value Iteration Algorithm

Answer

A

Start w/ arbitrary utilities
Update utilities based on reward + neighbors (discounted future reward)
Repeat until convergence

Note that value iteration is obtained simply by turning the Bellman optimality equation into an update rule. Also note how the value iteration backup is identical to the policy evaluation backup (4.5) except that it requires the maximum to be taken over all actions.

Value iteration effectively combines, in each of its sweeps, one sweep of policy evaluation and one sweep of policy improvement.

Question 24

Q

Supervised Learning

Answer

A

y = f(x). Function approximation, find f that maps y to x

Question 25

Q

Unsupervised Learning

Answer

A

f(c). Clusters descripition

Question 26

Q

Reinforcement Learning

Answer

A

y = f(x) but given x and z. Still trying to find f to generate y. We see so r is the z

Question 27

Q

MDP stands for

Answer

A

Markov Decision Processes.

Question 28

Q

MDPs are made up of

Answer

A

States, transitions (model), actions, rewards and create a policy.

Question 29

Q

Markovian Property

Answer

A

Only the present matters AND things are stationary (rules/the world doesn’t change over time)

Question 30

Q

Delayed Rewards

Answer

A

In chess example, if you make a bad move early on that you can never recover from. That bad move needs to be reflected in the reward

Question 31

Q

Temporal Credit Assignment Problem

Answer

A

The problem of determining the actions that lead to a certain outcome in sequence.

For example, in football, at each second, each football player takes an action. In this context, an action can e.g. be “pass the ball”, “dribbe”, “run” or “shoot the ball”. At the end of the football match, the outcome can either be a victory, a loss or a tie. After the match, the coach talks to the players and analyses the match and the performance of each player. He discusses the contribution of each player to the result of the match. The problem of determinig the contribution of each player to the result of the match is the (temporal) credit assignment problem.

Question 32

Q

How to change policies to account for finite horizons

Answer

A

Policy(s,t).. policy is function of state AND time

Question 33

Q

Utility of Sequences (stationary preferences)

Answer

A

if you prefer one sequence of states over another today, you prefer the same sequence tomorrow

Question 34

Q

How to calculate infinite horizons without infinity?

Answer

A

Use discounted future rewards (use gamma)

Question 35

Q

Reward vs Utility

Answer

A

Reward is immediate payoff of state. Utility is long term payoff of action, it takes into account delayed reward

Question 36

Q

Policy Iteration

Answer

A

start with initial policy (guess)

evaluate: given policy, calculate utility
improve: Policy at t+1 is the action that maximizes the utility

Question 37

Q

T/F Policy Iteration won’t converge

Question 38

Q

On Policy vs Off-Policy

Answer

A

off policy estimates the q values (state-action value) directly from the Q function regardless of the policy being followed by the agent. (Q-learning)

on-policy is that it updates its Q-values using the Q-value of the next state 𝑠′ and the current policy’s action 𝑎″. It estimates the return for state-action pairs assuming the current policy continues to be followed. (SARSA)

Question 39

Q

T/F TD update rule always converges with any learning rate

Answer

A

False. It will converge if you sum all of the learning rates at each time, t > infinity and if you sum all the learning rates squared at time < infinity

Question 40

Q

T/F TD(1) is the same as outcome-based updates (if no repeated states)

Answer

A

True. and with even more learning because updates don’t have to wait for the episode to be over

Question 41

Q

Maximum Likelihood Estimate vs outcome-based estimate (TD(1))

Answer

A

Maximum Likelihood uses all of the examples, but TD(1) uses just individual runs so if a rare thing happens on TD(1) it can be biased (high variance). (this leads us to TD(lambda)

Question 42

Q

T/F TD(0) is the same as maximum likelihood estimate

Answer

A

TRUE if we run over the data over and over again

Question 43

Q

T/F TD(lambda) is weighted combination of k step estimators

Question 44

Q

T/F TD(1) typically has less error than TD(0)

Answer

A

False. TD(1) typically has more error than TD(0)

Question 45

Q

T/F TD(0) has the least amount of error usually

Answer

A

False. TD(lambda) performs best. Usually 0.3 - 0.7 is best

Question 46

Q

Temporal Difference is

Answer

A

Difference between reward (value estimates) as we go from one step to another

Question 47

Q

T/F reward must be scalar

Answer

A

TRUE

Scalar value means single value.

Question 48

Q

T/F environment is visible to the agent

Answer

A

False, usually

Question 49

Q

T/F all representations of state have same results

Answer

A

False. Cheese/rat example in David Slivers. remembering past 3 states is diff than past 4 or 5

Question 50

Q

Partially Observable

Answer

A

Agent indirectly view state (agent state != env state). ex: robot with camera. POMDP

Question 51

Q

T/F Model is the environment

Answer

A

False. Model is the agent’s idea of the environment

Question 52

Q

RL vs Planning

Answer

A

RL: model is unknown, the agent performs actions
Planning: model is known, the agent performs computations (know planning)

Question 53

Q

B is a Contraction Mapping when

Answer

A

||BF - BG|| <= ||F-G|| so max difference between f-g vs B applied to F and B applied to G

Question 54

Q

Backward and forward TD(lambda) can be applied to the same problems.

Answer

A

True. As also noted, you would have to wait for a complete episode to do it forwards(i.e. forwards does not work “online”)…

Question 55

Q

gamma of 0 means

Answer

A

you care only about the current (short-sighted)

Question 56

Q

gamma of 1 means

Answer

A

you care about the future a lot (far-sighted)

Question 57

Q

why use gamma and discount future rewards

Answer

A

avoid infinite returns and also account for a model not being perfect (future reward is not guaranteed)

Question 58

Q

T/F You can always flatten an MDP to a Markov Chain

Question 59

Q

Difference between Markov Decision Process and Markov Reward Process

Answer

A

An Markov Decision Process is a Markov Reward Process with decisions. Includes actions and stochasticity

Question 60

Q

T/F it has been proven that both forward and backward view of TD(lambda) are equivalent

Question 61

Q

What is forward TD(lambda)

Answer

A

The forward view looks at all n-steps ahead and uses λ to essentially decay those future estimates.

Question 62

Q

What is backward TD(lambda)

Answer

A

The backward view of TD(λ) updates values at each step. So after each step in the episode you make updates to all prior steps. Uses eligibility trace

Question 63

Q

Online learning vs offline learning

Answer

A

Online learning means that you are doing it as the data comes in. Offline means that you have a static dataset.

Let’s say you want to build a classifier to recognize spam. You can acquire a large corpus of e-mail, label it, and train a classifier on it. This would be offline learning. Or, you can take all the e-mail coming into your system, and continuously update your classifier (labels may be a bit tricky). This would be online learning

Question 64

Q

SARSA vs Q- Learning

Answer

A

Q-Learning is an off-policy TD control policy. It’s exactly like SARSA with the only difference being — it doesn’t follow a policy to find the next action A’ but rather chooses the action in a greedy fashion.

Answer 61

A

False Generally, because it depends on the problem and a number of other things.

Answer 62

A

false, if you have the model one would expect it to be more efficient to use it.

Answer 63

A

False. TD(0) propagates more slowly.

Answer 64

A

Generally True, same shape but different minima

Answer 65

A

True.

A non-expansion is {constant, contraction}, and a not non-expansion is not a constant and not a contraction, therefore it is an expansion and expansions diverge.

Answer 66

A

True. Markov games with only one player action space and a corresponding single reward function are MDPs.

Answer 67

A

False. They are totally related. A non-expansion must be followed by a contraction, in order to provide convergence guarantees.

Answer 68

A

False.

First of all, LP takes super-linear polynomial time. Secondly, LP is only one part of a potential RL algorithm. Third of all, there are many methods that don’t use LP that solve MDPs (though none in linear time).

Answer 69

A

False. The objective of the dual Language Processing (LP) is to maximize “policy flow”, subject to conservation of flow constraints. The minimization is for the primary LP problem, in order to find the least upper bound of value functions over states and actions.

Answer 70

A

True

The non-expansion is a vital and widely-used sufficient condition to guarantee the convergence of value iteration.

Answer 71

A

False, only potential based shaping might avoid introducing an sub-optimal policy loop.

Answer 72

A

Their basic concept is that in the Bellman equations, the operation $ \sum_y P(x,a,y) \ldots$ (i.e., taking expected value w.r.t. the transition probabilities) describes the effect of the environment, while the operation $ \max_a …$ describes the effect of an optimal agent (i.e., selecting an action with maximum expected value). Changing these operators, other well-known models can be recovered.

as long as operators doing updates are non-expansion, we get convergence of q learning, value iteration,

https://sites.ualberta.ca/~szepesva/papers/gmdp.ps.pdf

Answer 73

A

order statistics (max, min)

- fixed convex combinations

Answer 74

A

True

Policy iteration is also guaranteed to converge to the optimal policy and it often takes less iterations to converge than the value-iteration algorithm.

Comparing to each other, policy-iteration is computationally efficient as it often takes considerably fewer number of iterations to converge although each iteration is more computationally expensive.

Answer 75

A

policy 1 dominates policy 2 if for all states the value of that state for policy 1 is greater than or equal to the value of that state for policy 2

Answer 76

A

True

If this is the case, then the policy improvement step will not stop at the local optimum state-action space Π#, as there exists at least one state-action in Π∗ which is different from Π# and yields a higher value of 𝑣∗ compared to 𝑣#

Answer 77

A

(1. ) Multiply by positive constant
(2. ) Shift by constant
(3. ) Non-linear potential based rewards

Answer 78

A

True. Initialize to 0. Random means you are biasing

Answer 79

A

False… if the selected potential is wrong, it will slow things down.

Answer 80

A

We can screw things up and create suboptimal policies optimized for these “helper” scenarios

Answer 81

A

False (although true in spirit). The problem is the word ‘always.’ As Jacob alludes to, in these PAC-style bounds, there is always a small probability that the desired goal isn’t reached, and if reached it’s not quite optimal.

Answer 82

A

False as we are only concerned with exploration decay.convergence is guaranteed by the contraction property of the Bellman operator, which does not include any assumptions on the exploration rate.

Answer 83

A

False. You may still need to explore “non-optimal” arms of the bandit since there may be noise in the estimate of its value.

Answer 84

A

False. SARSA converges to an optimal policy as long as all state-action are visited an infinite number of times and the policy converges to a greedy policy.

Answer 85

A

False. Nonlinear approximation can be approximated by linear piece-wise construction.

Answer 86

A

False. Some hypothesis classes are exponentially harder to learn in the KWIK setting than in the MB setting.

A KWIK algorithm is an online learning algorithm that is considered to be self-aware i.e.,
if and only if the learner believes that it has insufficient experience to predict on a new sample, does the learner ask the expert for the answer

Answer 87

A

False. Baird’s counter-example shows how classic update using linear function approximation can lead to divergence.

Answer 88

A

True. As noted in Gordon (1995), you will have less states…

Answer 89

A

a way to know how much we know (certainty). know when we are certain enough

Answer 90

A

Optimism in the face of uncertainty

R-max is a very simple model-based reinforcement learning algorithm which can attain near-optimal average reward in polynomial time. In R-max, the agent always maintains a complete, but possibly inaccurate model of its environment and acts based on the optimal policy derived from this model.

Answer 91

A

stochastic + sequential

Answer 92

A

essentially anything that generalizes a linear model (ANN, CMAC, linear nets, deep nets)

Answer 93

A

Machine learning is a discipline in which given some training data/environment, we would like to find a model that optimizes some objective, but with the intent of performing well on data that has never been seen by the model during training.

This is usually referred to as Generalization, or the ability to learn something that is useful beyond the specifics of the training environment.

some state may look like other states so take into account features of states and weight the importance of the feature when making a decision. Replace update rule with function approximation

Answer 94

A

True. Need anchor points to converge

Answer 95

A

False, but you can get near-optimal

Answer 96

A

False. Piecewise linear convex can do it and throw away a lot that are striclty dominated

POMDPs - We’re unsure which state we’re in. The current state emits observations.

we have control of over the state transitions
the states are not completely observable

Answer 97

A

Use expectation maximization to learn the model

The model-based approach — learning a POMDP model of the world, and computing an optimal policy for the learned model — may generate superior results in the presence of sensor noise, but learning and solving a model of the environment is a difficult problem.

When sensor noise increases, model-free methods provide less accurate policies.

Answer 98

A

helps to be random

Answer 99

A

useful for large states
needs lot of samples to get good estimates
planning time independent of size of state space
running time is exponential in the horizon

Answer 100

A

random simulation

Answer 101

A

Modular RL and arbitration.

Abstraction is a technique to reduce the complexity of a problem by filtering out irrelevant properties while preserving all the important ones necessary to still be able solve a given problem

Answer 102

A

Options and semiMDPs

Applying reinforcement learning (RL) to real-world problems will require reasoning about action-reward correlation over long time horizons. Hierarchical reinforcement learning (HRL) methods handle this by dividing the task into hierarchies, often with hand-tuned network structure or pre-defined subgoals.

The learned abstraction allows us to learn new tasks on higher level more efficiently. We convey a significant speedup in convergence over benchmark learning problems. These results demonstrate that learning temporal abstractions is an effective technique in increasing the convergence rate and sample efficiency of RL algorithms.

Answer 103

A

Cooperate if agree, defect if disagree

Answer 104

A

always best response independent of history

A Nash equilibrium is said to be subgame perfect if and only if it is a Nash equilibrium in every subgame of the game.

Answer 105

A

value iteration works
minimaX Q converges
efficient

Answer 106

A

value iteration does not work

- minimax q does not converge

Answer 107

A

No, Coco-Q, like coco values in normal-form games, is not defined for games with three or more players

Coco (“cooperative/competitive”) values are a solution concept for two-player normalform games with transferable utility, when binding agreements and side payments between players are possible.

Answer 108

A

cooperative competitive values

Answer 109

A

Instead of looking at a very small action space, we can create new actions that transitions that could help reduce the number of steps to learn and move towards a goal.

Answer 110

A

FALSE, true with the vaceat that we should choose the right options. Also, the sttates don’t matter and helps with exploration

Brainscape's Knowledge GenomeTM

Final Flashcards

Brainscape's Knowledge Genome^TM