Reinforcement Learning Flashcards

1
Q

Q: In a game, if all players were aware of their opponents’ strategies but could not increase their own reward by changing their strategy, this game is in a state known as the “Nash equilibrium”

A

A: True. If no one can gain more reward by changing their individual strategy, that is a Nash equilibrium (Video Lectures – Lesson 12)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Q: The “Pavlov” strategy is sub-game perfect with respect to the Prisoner’s Dilemma

A

A: True. If you feed two opposing players any combination of strategies, they would eventually cooperate repeatedly if they both employ the Pavlov strategy. (Video Lectures – Lesson 13)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Q: The Credit Assignment Problem refers to the state that gives the most reward on an MDP.

A

A: False, the Credit Assignment problem refers to the retrospective view of a trajectory when given a reward and the determination of which state/action is most responsible for the ultimate result.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Q: Model-based reinforcement learning is the use of supervised learning models such as neural networks to solve large state space RL problems.

A

A: False, model-based reinforcement learning is the process of iteratively building up a model in the learner and performing actions based on the current model’s transition function, reward function, and state space.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Q: Value iteration is one of the most important model-free reinforcement learning methods.

A

A: False, Value iteration is model-based.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Q: Off-policy agents learn the value of a policy different than the policy they are acting under.

A

A: True, off-policy examples include DQN or Q-learning which involve random actions or actions determined by previously stored values. The target policy differs from the behavior policy.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Q: Only non-linear function approximation has been proven to converge when used with the right hyper-parameters.

A

A: False, linear function approximation has been proven to converge, non-linear has not.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Q: POMDP are partially-observable because they are missing the MDP. An example of this is model-free reinforcement learning problems.

A

A: False, POMDP are partially-observable because the states are not mapped 1-1 with observations, that is observed environment does not uniquely determine a state.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Q: Grim Trigger strategy means a player will cooperate for the entire game regardless of other player’s action.

A

A: False, GT cooperates until other does not, then defects forever.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Q: Bayesian RL is a model-based approach that relies heavily in statistical methods such as bayes rule.

A

A: False, Bayesian RL is not necessarily model-based.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Q: DEC-POMDPs is a modeling framework for cooperative problems under uncertainty.

A

A: False, DEC-POMDP models are not cooperative in general.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Q: Model-based reinforcement learning agents can solve environments with continuous state variables because they are able to learn the transition and reward function.

A

A: False, [this applies when function approximation is used, but the] transition and reward function modeling does not imply continuous state variable modeling.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Q: TD(1) is equivalent to a K-Step estimator with K = 1.

A

A: False TD(0) is equivalent to a K-Step estimator with K = 1. TD(1) is equivalent to a K-Step estimator with K = inf

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Q: Potential based reward shaping is used to indirectly shape the optimal policy by modifying rewards.

A

A: False: Potential based reward shaping (by design) improves learning efficiency but does not change the optimal policy. In Q-learning it is equivalent to having a good initial value for the Q function.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Q: An MDP reward function can be scaled, shifted by a constant, or augmented with non-linear potential-based rewards without changing the optimal policy.

A

A: False. The scale factor must be positive for it not to change the optimal policy.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Q: When exploring deterministic MDPs using the mistake bounded optimal algorithm, we assume any unknown state-action pair has a reward self loop of Rmax (equal to the largest reward seen so far) to ensure that every state-action pair is eventually explored.

A

A: False. While exploring the MDP we could become stuck in a strongly connected component of the MDP graph that does not include all of the states (i.e. some parts of the MDP may not be reachable based on prior actions)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Q: Policy Search continuously updates the policy directly via a value update. This update is based on the which you receive.

A

A: False, policy search updates the policy via policy updates.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Q: Following a plan and constantly checking if the action was successful (and changing the plan if it was not) is called conditional planning.

A

A: False, This is dynamic re-planning.
A: False, Conditional plan can have ‘if/else’ statements in the plan.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Q: V(s) can be expressed from Q(s,a) and vice versa

A

A: True
V(s) = maxa Q(s,a)
Q(s,a) = R(s,a) + ɣ ∑’s(T(s,a,s’) * V(s’))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Q: With potential-based learning, the agent receives higher rewards when it’s closer to the positive terminal state.

A

A: False, In potential-based learning, the environment designer adds additional rewards to states that will guide the agent to the desired terminal state. Those additional rewards are also deducted if the agent leaves those states.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Q: Current eligibility traces of past events are used with current TD errors to compute updates for TD(λ) backward view.

A

A: True, Future rewards and states are used to compute updates for TD(λ) forward view.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Q: TD(1) gives a maximum likelihood estimate.

A

A: False, TD(0) gives a maximum likelihood estimate
A: False, TD(1) gives the Monte Carlo estimate, which is the minimum mean squared error estimate.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Q: Temporal difference learning falls into the model-based learning.

A

A: False, TD is a class of model-free RL techniques that is value-based. It builds up estimates of the value incrementally.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Q: Experience / sample complexity relates to how much data is needed to converge on the answer.

A

A: True, This is one of the criteria for evaluating an agent.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

Q: Q-Learning is on-policy because it might not use the selected action at to update the q-values.

A

A: False, This is why it’s off-policy

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Q: potential-based shaping is equivalent to modifying initial Q-values. That is, the Q-values are the same.

A

A: False, The policy will be same but the actual Q-Values might be different.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

Q: Markov games are a type of MDP.

A

A: False, MDP are a subset of markov games.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

Q: Coco is defined as cooperative-competitive values and this solution concept is aimed at general-sum game.

A

A: True, It adds the concept of sharing via side payments.
A: True, This solution concept is a “Pareto optimal”.

29
Q

“Subgame perfect” means that every stage of a multistage game has a Nash equilibrium.

A

A Nash equilibrium is said to be subgame perfect if an only if it is a Nash equilibrium in every subgame of the game SOURCE

30
Q

The “folk theorem” states that the notion of threats can stabilize payoff profiles in one-shot games.

A

“folk theorem” states the notion of threats can stabilize payoff profiles in REPEATED games. SOURCE

31
Q

If following the repeated game strategy “Pavlov”, we will cooperate until our opponent defects. Once an opponent defects we defect forever.

A

This is actually describing Grim

32
Q

DEC-POMDPs include communication, this communication allows agents to plan.

A

A group of agents control the environment jointly. Each agent receives a separate partial observation and agents plan to optimize a single reward function

33
Q

Policy shaping requires a completely correct oracle to give the RL agent advice.

A

the paper from the reading shows that human advice can do lead to better results than the oracle because humans are willing to try out multiple winning strategies Lesson 15

34
Q

It is not always possible to convert a finite horizon MDP to an infinite horizon MDP.

A

False, because you can always add a self-transition with no reward to the terminal state(s)?

35
Q

In RL, recent moves influence outcomes more than moves further in the past.

A

False. In a Markov process, only the current state matters, not historical actions. recent moves don’t necessarily influence the outcomes more than the past moves. There are some RL techniques which assigns more credit the recent moves given an outcome. Markov process is that future state is affected by current state but current state has all info from history.

36
Q

An MDP given a fixed policy is a Markov chain with rewards.

Now how about the opposite?

A

True, due to the fixed policy we can’t make any decisions.

“If we know the optimal V values, we can get the optimal Q values only if we know the environment’s transition function/matrix.”

37
Q

If we know the optimal Q values, we can get the optimal V values only if we know the environment’s transition function/matrix.

A

False. V∗=maxaQ∗

38
Q

In the gridworld MDP in “Smoov and Curly’s Bogus Journey”, if we add 10 to each state’s reward (terminal and non-terminal) the optimal policy will not change.

A

The answer here is that the optimal policy will not change because this is just a linear shift in rewards for all states

39
Q

Markov means RL agents are amnesiacs and forget everything up until the current state.

If a state encodes the the information of previous states is that not part of the state? Is there a disadvantage to this? How would your agent interact with this sort of deep state?

A

False. It just means that the current state doesn’t depend on prior states. The agents themselves can use information from prior states. The current states contains an implicit accumulation of all the past states that were used to get it to the current state.

40
Q

A policy that is greedy–with respect to the optimal value function–is not necessarily an optimal policy.

A

the optimal value function will have taken the long term rewards into account. A greedy policy in this case is the optimal policy.

41
Q

In TD learning, the sum of the learning rates used must converge for the value function to converge.

A

the sum of the learning rates must be infinite but the sum of the squares must be finite to guarantee convergence of the value function.

42
Q

Monte Carlo is an unbiased estimator of the value function compared to TD methods. Therefore, it is the preferred algorithm when doing RL with episodic tasks.

A

False. Although MC methods are unbiased estimators of the value function when compared to TD methods, TD methods have a significant advantage vs MC methods as they are naturally implemented in an online, fully incremental fashion. With MC methods one must wait until the end of an episode, because only then is the return
known, whereas with TD methods one need wait only one time step. Surprisingly often this turns out to be a critical consideration. Therefore, TD methods are preferred over MC methods.

MC has high variance, zero bias; good for convergence properties, even with function approximation; not very sensitive to the initial value; very simple to understand and use.

TD has low variance, some bias; usually more efficient than MC; TD(0) converges to Vpi(s), but not always with function approximation; more sensitive to initial value.

43
Q

Backward and forward TD(lambda) can be applied to the same problems.

A

false, in forward view, since we need all time steps to do the computation, it may not be feasible.
whereas in backward view , using eligibility traces, we can do the updates.

assuming complete data, they are equivalent

44
Q

Offline algorithms are generally superior to online algorithms.

A

For very short episodic problems with deterministic policies that might be justifiable but for very long episodes with complex state space and probabilistic policies an online algorithm is likely going to arrive at an optimal policy in a much smaller number of episodes.

45
Q

Given a model (T,R) we can also sample in, we should first try TD learning.

A

False. Since we already have the model, it is preferable to first try something like policy or value iteration which is faster and doesn’t require sampling.

46
Q

TD(1) slowly propagates information, so it does better in the repeated presentations regime rather than with single presentations.

A

TD(1), for starters, propagates information very quickly. TD(0) does it the slowest. TD(0) outperforms TD(1) empirically according to Sutton as well in the repeated presentations regime. This is because the slow but accurate propagation of rewards is accomodated. it would perform worse in the single presentation regime.

in case of single presentation of dataset, the error will be highly dependent on the learning rate and lambda and will climb up for higher values of λ i.e. 1.0

47
Q

In TD(lambda), we should see the same general curve for best learning rate (lowest error), regardless of lambda value.

So, what does the error rate look like at TD(0) and TD(1)? What about in between them?

A

each lambda seems to have the same general trend (starting at the same initial error, going down a little bit, reaching the minimum error, and going back up). But, as Ameet said, the curves do differ significantly in how extreme these different features are. So, I guess it depends on what general curve means.

for λ of 0.3, we get a concave or boat like shape whereas for values of 0.8 and 1, we get a “hockey stick” or check mark shape.
That indicates that the error climbs up more sharply for those for higher values of α

48
Q

In general, an update rule which is not a non-expansion will not converge.

A

False, update rules that are non-expansions imply convergence, but convergence is possible with update rules that are not non-expansions. in general, but there might be cases that not non-expansions converge (Coco-Q is an example).

Additionally, a non-expansion is not necessarily a contraction…

49
Q

MDPs are a type of Markov game.

A

True. Markov Games are a generalization of MDPs. MDPs are a single agent Markov game.

50
Q

Linear programming is the only way we are able to solve MDPs in linear time.

A

False, LP is not solvable in linear time (nor PI/VI/Q).

See Littman, Dean & Kaelbling (2013).

51
Q

The objective of the dual LP presented in lecture is minimization of “policy flow”. (The minimization is because we are aiming to find an upper bound on “policy flow”.)

A

False, we are trying to maximize the “policy flow.” The dual of the primal is its inverse, if you maximize the dual and minimize the primal, it will be the same answer.

52
Q

Any optimal policy found with reward shaping is the optimal policy for the original MDP.

A

True, if correctly implemented, reward shaping does not change the optimal policy.

53
Q

Potential-based shaping will find an optimal policy faster than an unshaped MDP.

A

False. A well designed potential-based shaping may help an algorithm to converge faster, but there’s no guarantee of that.

54
Q

Rmax will always find the optimal policy for a properly tuned learning function.

A

True. Assuming the reward is bounded (and the bound is known), the optimal policy will be the same. Rmax encourages the learner to explore only if a better reward may be discovered. If the learner has already explored the best rewards, the greedy policy is already optimal.

55
Q

Q-learning converges only under certain exploration decay conditions.

A

True. Q-learning is guaranteed to converge under the assumption that the states are explored enough.

56
Q

The tradeoff between exploration and exploitation is not applicable to finite bandit domains since we are able to sample all options.

A

False, you cannot simply explore forever because to accurately gauge the reward from a bandit takes infinite samples. You must first explore some, then exploit as much as possible before the finite game ends.

57
Q

Since SARSA is an “online” algorithm, it will never learn the optimal policy for an environment.

A

False, if entire state and action space has been presented repeteadly ( which may be possible due to correct setting of exploration and exploitation ),SARSA can converge to the optimal policy.

Although it highly depends on the provided data, It’s possible to learn the optimal policy for an environment.

If it happened to be presented with perfect information, it could converge to an optimal policy. Never say never.

58
Q

RL with linear function approximation only works on toy problems.

A

False. Linear Function may not work, but there is no guarantee of that. You can always create an example that can blow up the approximation but this is not always the case

Though it’s not likely, saying “only” is too strong.

Non-linear approximators (DNN) are used heavily in theory and practice to approximate complex value functions.

59
Q

KWIK algorithms are an improvement of MB (mistake bound) algorithms and as such will have same or better time complexity for all classes of problems.

A

False, there are certain problem classes that KWIK is worse than MB for, such as guessing boolean conjunctions.

60
Q

With a classic update using linear function approximation, we will always converge to some values, but they may not be optimal.

A

False. Linear function approximations are not guaranteed to converge.

61
Q

Applying generalization with an “averager” on an MDP results in another MDP.

A

True. The averager itself can be seen as MDP.

62
Q

Problems that can be represented as POMDPs cannot be represented as MDPs.

A

False, any MDP can be represented as a POMDP and they can obviously be represented as MDPs.

The question is a bit tricky because there are POMDPs that cannot be represented as MDPs, but this is not always the case as the question implies.

63
Q

The only algorithms that work in POMDPs are planning algorithms.

A

False. The definition of ‘work’ is a bit loose here. Algorithms may inefficiently ‘work’.

They demonstrate value iteration for pomdps In the lecture

64
Q

Options over an MDP form another MDP.

A

True. Options can be treated as temporal abstract actions, so MDP with options form another Semi-MDP

65
Q

Nash equilibria can only be pure, not mixed.

A

False. The Nash equilibrium coincides with the minimax strategy which is mixed.

Rock Paper Scissors I think is a counterexample

66
Q

An optimal pure strategy does not necessarily exist for a two-player, zero-sum finite deterministic game with perfect information.

A

False, it always exists because in a 2 player zero-sum finite deterministic game with perfect information, minimax = maximin so there always exists an optimal pure strategy

Rock paper scissors is an imperfect information game
edit: to add, also not deterministic (especially if your opponent uses a randomized policy!), so not a counterexample

67
Q

Does an MDP have a gamma?

A

.

68
Q

The value of the returned policy is the only metric we care about when evaluating a learner.

A

there are more things to consider. Time, space, and efficiency as noted are all important considerations.

69
Q

Contraction mappings and non-expansions are concepts used to prove the convergence of RL algorithms, but are otherwise unrelated concepts.

A

False, they are strongly related. Some RL algorithms are based on the Bellman Operator which is a contraction mapping.

contraction mappings are non-expansions. It does not hold the other way though.