Final Flashcards

Question

Unsupervised Learning

Answer 1

f(c). Clusters descripition

Answer 2

y = f(x) but given x and z. Still trying to find f to generate y. We see so r is the z

Answer 3

Markov Decision Processes.

Answer 4

States, transitions (model), actions, rewards and create a policy.

Answer 5

Only the present matters AND things are stationary (rules/the world doesn't change over time)

Answer 6

In chess example, if you make a bad move early on that you can never recover from. That bad move needs to be reflected in the reward

Answer 7

The problem of determining the actions that lead to a certain outcome in sequence. For example, in football, at each second, each football player takes an action. In this context, an action can e.g. be "pass the ball", "dribbe", "run" or "shoot the ball". At the end of the football match, the outcome can either be a victory, a loss or a tie. After the match, the coach talks to the players and analyses the match and the performance of each player. He discusses the contribution of each player to the result of the match. The problem of determinig the contribution of each player to the result of the match is the (temporal) credit assignment problem.

Answer 8

Policy(s,t).. policy is function of ***state*** AND ***time***

Answer 9

if you prefer one sequence of states over another today, you prefer the same sequence tomorrow

Answer 10

Use discounted future rewards (use gamma)

Answer 11

Reward is immediate payoff of state. Utility is long term payoff of action, it takes into account delayed reward

Answer 12

start with initial policy (guess) evaluate: given policy, calculate utility improve: Policy at t+1 is the action that maximizes the utility

Answer 13

off policy estimates the q values (state-action value) directly from the Q function regardless of the policy being followed by the agent. (Q-learning) on-policy is that it updates its Q-values using the Q-value of the next state 𝑠′ and the current policy's action 𝑎″. It estimates the return for state-action pairs assuming the current policy continues to be followed. (SARSA)

Answer 14

False. It will converge if you sum all of the learning rates at each time, t > infinity and if you sum all the learning rates squared at time < infinity

Answer 15

True. and with even more learning because updates don't have to wait for the episode to be over

Answer 16

Maximum Likelihood uses all of the examples, but TD(1) uses just individual runs so if a rare thing happens on TD(1) it can be biased (high variance). (this leads us to TD(lambda)

Answer 17

TRUE if we run over the data over and over again

Answer 18

False. TD(1) typically has more error than TD(0)

Answer 19

False. TD(lambda) performs best. Usually 0.3 - 0.7 is best

Answer 20

Difference between reward (value estimates) as we go from one step to another

Answer 21

TRUE Scalar value means single value.

Answer 22

False, usually

Answer 23

False. Cheese/rat example in David Slivers. remembering past 3 states is diff than past 4 or 5

Answer 24

Agent indirectly view state (agent state != env state). ex: robot with camera. POMDP

Answer 25

False. Model is the agent's idea of the environment

Answer 26

RL: model is unknown, the agent performs actions Planning: model is known, the agent performs computations (know planning)

Answer 27

||BF - BG|| <= ||F-G|| so max difference between f-g vs B applied to F and B applied to G

Answer 28

True. As also noted, you would have to wait for a complete episode to do it forwards(i.e. forwards does not work "online")...

Answer 29

you care only about the current (short-sighted)

Answer 30

you care about the future a lot (far-sighted)

Answer 31

avoid infinite returns and also account for a model not being perfect (future reward is not guaranteed)

Answer 32

An Markov Decision Process is a Markov Reward Process with decisions. Includes actions and stochasticity

Answer 33

The forward view looks at all n-steps ahead and uses λ to essentially decay those future estimates.

Answer 34

The backward view of TD(λ) updates values at each step. So after each step in the episode you make updates to all prior steps. Uses eligibility trace

Answer 35

Online learning means that you are doing it as the data comes in. Offline means that you have a static dataset. Let's say you want to build a classifier to recognize spam. You can acquire a large corpus of e-mail, label it, and train a classifier on it. This would be offline learning. Or, you can take all the e-mail coming into your system, and continuously update your classifier (labels may be a bit tricky). This would be online learning

Answer 36

Q-Learning is an off-policy TD control policy. It’s exactly like SARSA with the only difference being — it doesn’t follow a policy to find the next action A’ but rather chooses the action in a greedy fashion.

Answer 37

False Generally, because it depends on the problem and a number of other things.

Answer 38

false, if you have the model one would expect it to be more efficient to use it.

Answer 39

False. TD(0) propagates more slowly.

Answer 40

Generally True, same shape but different minima

Answer 41

True. A non-expansion is {constant, contraction}, and a not non-expansion is not a constant and not a contraction, therefore it is an expansion and expansions diverge.

Answer 42

True. Markov games with only one player action space and a corresponding single reward function are MDPs.

Answer 43

False. They are totally related. A non-expansion must be followed by a contraction, in order to provide convergence guarantees.

Answer 44

False. First of all, LP takes super-linear polynomial time. Secondly, LP is only one part of a potential RL algorithm. Third of all, there are many methods that don't use LP that solve MDPs (though none in linear time).

Answer 45

False. The objective of the dual Language Processing (LP) is to maximize "policy flow", subject to conservation of flow constraints. The minimization is for the primary LP problem, in order to find the least upper bound of value functions over states and actions.

Answer 46

True The non-expansion is a vital and widely-used sufficient condition to guarantee the convergence of value iteration.

Answer 47

False, only potential based shaping might avoid introducing an sub-optimal policy loop.

Answer 48

Their basic concept is that in the Bellman equations, the operation $ \sum_y P(x,a,y) \ldots$ (i.e., taking expected value w.r.t. the transition probabilities) describes the effect of the environment, while the operation $ \max_a ...$ describes the effect of an optimal agent (i.e., selecting an action with maximum expected value). Changing these operators, other well-known models can be recovered. as long as operators doing updates are non-expansion, we get convergence of q learning, value iteration, https://sites.ualberta.ca/~szepesva/papers/gmdp.ps.pdf

Answer 49

- order statistics (max, min) | - fixed convex combinations

Answer 50

True Policy iteration is also guaranteed to converge to the optimal policy and it often takes less iterations to converge than the value-iteration algorithm. Comparing to each other, policy-iteration is computationally efficient as it often takes considerably fewer number of iterations to converge although each iteration is more computationally expensive.

Answer 51

policy 1 dominates policy 2 if for all states the value of that state for policy 1 is greater than or equal to the value of that state for policy 2

Answer 52

True If this is the case, then the policy improvement step will not stop at the local optimum state-action space Π#, as there exists at least one state-action in Π∗ which is different from Π# and yields a higher value of 𝑣∗ compared to 𝑣#

Answer 53

(1. ) Multiply by positive constant (2. ) Shift by constant (3. ) Non-linear potential based rewards

Answer 54

True. Initialize to 0. Random means you are biasing

Answer 55

False... if the selected potential is wrong, it will slow things down.

Answer 56

We can screw things up and create suboptimal policies optimized for these "helper" scenarios

Answer 57

False (although true in spirit). The problem is the word 'always.' As Jacob alludes to, in these PAC-style bounds, there is always a small probability that the desired goal isn't reached, and if reached it's not quite optimal.

Answer 58

False as we are only concerned with exploration decay.convergence is guaranteed by the contraction property of the Bellman operator, which does not include any assumptions on the exploration rate.

Answer 59

False. You may still need to explore "non-optimal" arms of the bandit since there may be noise in the estimate of its value.

Answer 60

False. SARSA converges to an optimal policy as long as all state-action are visited an infinite number of times and the policy converges to a greedy policy.

Answer 61

False. Nonlinear approximation can be approximated by linear piece-wise construction.

Answer 62

False. Some hypothesis classes are exponentially harder to learn in the KWIK setting than in the MB setting. A KWIK algorithm is an online learning algorithm that is considered to be self-aware i.e., if and only if the learner believes that it has insufficient experience to predict on a new sample, does the learner ask the expert for the answer

Answer 63

False. Baird's counter-example shows how classic update using linear function approximation can lead to ***divergence.***

Answer 64

True. As noted in Gordon (1995), you will have less states...

Answer 65

a way to know how much we know (certainty). know when we are certain enough

Answer 66

Optimism in the face of uncertainty R-max is a very simple model-based reinforcement learning algorithm which can attain near-optimal average reward in polynomial time. In R-max, the agent always maintains a complete, but possibly inaccurate model of its environment and acts based on the optimal policy derived from this model.

Answer 67

stochastic + sequential

Answer 68

essentially anything that generalizes a linear model (ANN, CMAC, linear nets, deep nets)

Answer 69

Machine learning is a discipline in which given some training data/environment, we would like to find a model that optimizes some objective, but with the intent of performing well on data that has never been seen by the model during training. This is usually referred to as Generalization, or the ability to learn something that is useful beyond the specifics of the training environment. some state may look like other states so take into account features of states and weight the importance of the feature when making a decision. Replace update rule with function approximation

Answer 70

True. Need anchor points to converge

Answer 71

False, but you can get near-optimal

Answer 72

False. Piecewise linear convex can do it and throw away a lot that are striclty dominated POMDPs - We’re unsure which state we’re in. The current state emits observations. - we have control of over the state transitions - the states are not completely observable

Answer 73

Use expectation maximization to learn the model The model-based approach — learning a POMDP model of the world, and computing an optimal policy for the learned model — may generate superior results in the presence of sensor noise, but learning and solving a model of the environment is a difficult problem. When sensor noise increases, model-free methods provide less accurate policies.

Answer 74

helps to be random

Answer 75

- useful for large states - needs lot of samples to get good estimates - planning time independent of size of state space - running time is exponential in the horizon

Answer 76

random simulation

Answer 77

Modular RL and arbitration. Abstraction is a technique to reduce the complexity of a problem by filtering out irrelevant properties while preserving all the important ones necessary to still be able solve a given problem

Answer 78

Options and semiMDPs Applying reinforcement learning (RL) to real-world problems will require reasoning about action-reward correlation over long time horizons. Hierarchical reinforcement learning (HRL) methods handle this by dividing the task into hierarchies, often with hand-tuned network structure or pre-defined subgoals. The learned abstraction allows us to learn new tasks on higher level more efficiently. We convey a significant speedup in convergence over benchmark learning problems. These results demonstrate that learning temporal abstractions is an effective technique in increasing the convergence rate and sample efficiency of RL algorithms.

Answer 79

Cooperate if agree, defect if disagree

Answer 80

always best response independent of history A Nash equilibrium is said to be subgame perfect if and only if it is a Nash equilibrium in every subgame of the game.

Answer 81

- value iteration works - minimaX Q converges - efficient

Answer 82

- value iteration does not work | - minimax q does not converge

Answer 83

No, Coco-Q, like coco values in normal-form games, is not defined for games with three or more players Coco (“cooperative/competitive”) values are a solution concept for two-player normalform games with transferable utility, when binding agreements and side payments between players are possible.

Answer 84

cooperative competitive values

Answer 85

Instead of looking at a very small action space, we can create new actions that transitions that could help reduce the number of steps to learn and move towards a goal.

Answer 86

FALSE, true with the vaceat that we should choose the right options. Also, the sttates don't matter and helps with exploration

Final Flashcards

(120 cards)