Review Questions Flashcards

1
Q

True or False: A policy that is greedy – with respect to the optimal value function – is not necessarily an optimal policy. Why?

A

False. Behaving greedily with respect to an optimal value function is an optimal policy because it optimizes the long-term cumulative reward.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

True or False: The optimal policy for any MDP can be found in polynomial time. Why?

A

True for any finite MDP. We can reduce the MDP to linear program and solve it in polynomial time.

False for an infinite MDP

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

True or False: The value of the returned policy is the only way to evaluate a learner. Why?

A

False, you might also care about the computational- or memory- or sample-efficiency

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

True or False: If we know the optimal Q values, we can get the optimal V values only if we know the environment’s transition function/matrix. Why?

A

False. V = max(Q)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

True or False: It is not always possible to convert a finite horizon MDP to an infinite horizon MDP. Why?

A

False. We can always convert a finite horizon MDP to an infinite horizon MDP by adding a self-loop with the terminal state.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

True or False: An MDP given a fixed policy is a Markov chain with rewards. Why?

A

True. A fixed policy means same action selection for a given state. That means MDP is a a Markov chain with rewards.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

True or False: In the gridworld MDP in “Smoov and Curly’s Bogus Journey”, if we add 10 to each state’s reward (terminal and non-terminal) the optimal policy will not change. Why?

A

True. Because the underlying differential rewards across the states do not change by adding 10 to all rewards.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

True or False: In RL, recent moves influence outcomes more than moves further in the past. Why?

A

False. It depends on the discount factor.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

True or False: The Markov property means RL agents are amnesiacs and forget everything up until the current state. Why?

A

True. Markov property means the future can be predicted only with the present.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Does policy iteration or value iteration converge faster? If so, why?

A

Policy iteration is guaranteed to converge at least as fast as value iteration

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Is it possible that value iteration just never converges on your formulation of MDP with a proper convergence criterion?

A

No value iteration is guaranteed to converge.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the meaning of the bellman equations? What is it saying? What is your way of visualizing this equation in your head?

A

The long term return for an action is the sum of the current reward and all possible discounted future rewards.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are the bellman equations? (try to write it down without looking at notes since this is the single most important equation to know for this class)

A

U(s,a,s’) = r(s’) + gamma * max_a [Sum_s’ T(s,a,s’)*U(s’)]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is a policy?

A

Choice function for action for a given state.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is Value Iteration?

A

Value iteration is an iterative algorithm to compute the optimal value function which is iteratively updating a value function that we use to create a policy. Contrasting with policy iteration where we directly update the policy.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is the Markov property?

A

Markov property is a memoryless property: your future is only dependent on your present.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Is it possible to have multiple terminal states in an MDP?

A

There is no restriction for an MDP to have multiple terminal states, although they can be abstracted into a single terminating state.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is an MDP? What are the components of an MDP?

A

States, Actions, Model (transition function, reward function), Policy

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

True or False: Contraction mappings and non-expansions are concepts used to prove the convergence of RL algorithms, but are otherwise unrelated concepts. Why?

A

My impression is that non-expansion is a general case of contraction mapping. For contraction mapping |F(a) - F(b)| <= gamma*|a-b| where gamma needs to be strictly smaller than 1. But for non-expansion the gamma value can be 1 so it allows the distance to be smaller or remain equal

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

True or False: An update rule which is not a non-expansion will not converge without exception. Why?

A

In general, True. But as an ultimatum, this is False.

Non-expansion is a condition that guarantees convergence. However, the lack thereof does not guarantee the lack of convergence.
Coco-Q is a counter example.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

True or False: In TD(λ), we should see the same general curve for best learning rate (lowest error), regardless of λ value. Why?

A

True. See figure 4 in Suttons TD(λ) paper.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

True or False: TD(1) slowly propagates information, so it does better in the repeated presentations regime rather than with single presentations. Why?

A

False.
TD(0) is == to Monte Carlo learning which updates very slowly. TD(1) propagates information all the way down the “chain” for every presentation.
TD(1) has high variance that is why we need repeated presentation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

True or False: Given a model (T,R) we can also sample in, we should first try TD learning. Why?

A

False.

While TD learning was revolutionary and is used in many algorithms today, value or policy iterations have better convergence properties and should by tried first.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

True or False: Offline algorithms are generally superior to online algorithms. Why?

A

False.

Online algorithms update values as soon as new information is presented, therefore making the most efficient use of experiences.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

True or False: Backward and forward TD(λ) can be applied to the same problems. Why?

A

True.

Both backward and forward TD(λ) will converge on the same problem. Backward TD(λ) typically is easier to compute.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

True or False: Monte Carlo is an unbiased estimator of the value function compared to TD methods. Therefore, it is the preferred algorithm when doing RL with episodic tasks. Why?

A

False. Even though MC is unbiased, it has higher variances, therefore TD methods may outperform MC in terms of learning performance. Additionally, they could require more samples to converge on the correct values

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

True or False: In TD learning, the sum of the learning rates used must converge for the value function to converge. Why?

A

False. Sum of learning rates must diverge. Sum of the square of learning rates must converge.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

What are some advantages of TD methods compared to dynamic programming methods? to Monte Carlo methods?

A

Compared to DP: it’s model free

Compared to MC: lower variance, online, incremental method

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

What can we say about the bias-variance trade-off between a 1-step estimator versus a Monte Carlo estimator (in other words, which estimator has a higher variance and which estimator has a higher bias?)

A

MC: High variance, 0 bias. Good Convergence. Not sensitive to initial value. Simple to understand and use. Effective in non-Markov.

TD: low variance, medium/high bias. More efficient than MC. TD(0) converges with LP to optimal V(s). It is not guaranteed to converge with function approximation. More sensitive to initial value. Exploits Markov Property.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

What is the value of the 5-step estimator of the terminal state? What about other n-step estimators of the terminal state?

A

The state value of the terminal state in an episodic problem should always be 0 (since the agent terminates)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

What is a Monte-Carlo return?

A

The value of the state is the expected return. (Sutton and Barto, p. 92)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

What is the prediction problem?

A

‘Using past experience with an incompletely known system to predict its future behavior’

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

What are model-free methods?

A

In model-free methods, we do not need to learn transition probabilities and rewards explicitly. We learn value function and/or optimal policy directly instead.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

What is TD? In which algorithms is TD used?

A

TD is short for temporal difference, and is a value-based learning algorithm.

35
Q

In Sutton 88, why are all the curves in figure 4 convex?

A

It indicates for a given lambda, the error is minimum for some intermediate alpha. For low alpha, the weight update is taking place slowly which might lead to high bias. For high alpha, the weight update is taking place too fast which might lead to high variance.

36
Q

In Sutton 88, if there are way more data than 100 sequences for each training set in experiment one, what kind of graph of figure 3 would you expect? Why?

A

The shape of the curve should be similar, but the error values should all decrease because the impact of the random noise goes down as the number of sequences increases

37
Q

In Sutton 88, why doesn’t TD(0) do well in figure 5?

A

TD(0) is MC and requires multiple iterations to fully train itself. in Fig 5, only 1 iteration was presented which did not allow TD(0) to converge on optimal value function.

38
Q

In Sutton 88, experiment 1, what is the convergence criterion? What effect does it have on the TD algorithm?

A

The weight vector was not updated after each sequence and only used to update the weight after the complete presentation of the training set. This likely was done to account for the stochastic nature of the problem and allow for convergence in minimal sequences and training sets.

39
Q

In Sutton 88, is the TD algorithm used in the paper the same as the TD we saw in class?

A

I think the one in the class was applied to the Bellman equation. Therefore, it has the reward terms as well as the value terms in the equation. For the Sutton paper, it was much simpler.

40
Q

In Sutton 88, why does TD(0) do best in figure 3 of Prof. Sutton’s paper (you can use the intuition from the above question)?

A

In the linear case, TD(0) minimizes the error for future experience and converges to the ML estimate of the underlying Markov process.

41
Q

Intuitively, why would TD(0) converge to “the maximum likelihood estimate of the underlying MDP”?

A

With repeated presentations of finite data in TD(0), the next state for a given state occurs with a certain probability. The value of a given state is updated more often towards the target (sum of the immediate reward r plus single discounted value of immediate next state V(s’)) according to how often those next states occur (frequency is proportional to their transition probabilities from the originating state), leading essentially to the expected value of that state at convergence. Maximum likelihood estimates are also computed based on this same probabilities, hence it is expected that TD(0) will converge towards MLE

42
Q

What is an eligibility trace?

A

A temporary record of the occurrence of an event (think the exponentially decaying graph showing (1-λ)^n * λ)

43
Q

True or False: The trade-off between exploration and exploitation is not applicable to finite bandit domains since we are able to sample all options. Why?

A

False. The exploration is still needed since there can be noise of the estimated value

44
Q

What are the benefits of an off-policy algorithm?

A

Off policy allows the agent to evaluate and improve a policy that is different from the Policy that is used for action selection. Target Policy != Behavior Policy.

This allows for continuous exploration, learning from demonstration, and parallel learning.

45
Q

Is it possible that Q-learning doesn’t converge? Is it possible that the point it converges to is not optimal?

A

Neither is possible. We can prove that Q-learning always converges, and it converges to Q*. But it may take intractably long.

46
Q

Why can you find an optimal policy despite finding sub-optimal Q-values (e.g. via Q-learning)?

A

If Q is close enough to Q, then we could find pi through Q. This is because the policy contains less information than Q function.

47
Q

True or False: Policy shaping requires a completely correct oracle to give the RL agent advice. Why?

A

False. Policy shaping needs a certain degree of confidence, but a completely correct oracle is not necessary (it would help though).

48
Q

True or False: Since Sarsa is an “online” algorithm, it will never learn the optimal policy for an environment. Why?

A

False. Sarsa will use the new information to update its Q table, and may learn the optimal policy if the Q table is accurate enough.

49
Q

True or False: Rmax will always find the optimal policy for a properly tuned learning function. Why?

A

False. It does not guarantee optimal, but it can help get to near-optimal results.

50
Q

True or False: Potential-based shaping will always find an optimal policy faster than an unshaped MDP. Why?

A

False. Potential-based shaping is not magic. It is only a way to redefine your rewards, and does not always converge faster. But If you initialize Q(s,a) as all zeros, it can give you some head start.

51
Q

True or False: Any optimal policy found with reward shaping is the optimal policy for the original MDP. Why?

A

From the lecture, reward shaping via scaling, shifting, or potential functions preserves the optimal policy. Considering there are other reward shaping techniques that exists, I’d say this is False. When not careful, an agent could enter a sort of sub-optimal, feedback loop with the environment, i.e. just dribbling a soccer ball without shooting because rewards were given for time spent dribbling.

52
Q

One perspective of Sarsa is that it looks a bit like policy iteration? Can you tell us which part of Sarsa is policy evaluation and which part is policy improvement?

A

policy evaluation: update the Q(s,a) value

policy improvement: choose the A’ using epsilon-greedy policy

53
Q

In an assignment document, it says Sarsa uses TD to do control. What exactly does it mean? Which part of Sarsa is TD?

A

Sarsa still uses TD error to update the Q function, just as Q-learning.

54
Q

Why is Sarsa on policy? How does this compare to an off-policy method like Q-learning by following a random policy to generate data?

A

SARSA is on policy because it directly takes the action A’ going to S’ using an e-greedy policy. Q-learning takes action A and observes the R, S’ for all A and chooses the maximum value.

SARSA: Q(S,A) = Q(S,A) + alpha[Rt+1 + gamma(Q(S’,A’) - Q(S,A)]

Q-LEarning: Q(S,A) = Q(S,A) + alpha[Rt+1 + gamma(max_a Q(S’,a) - Q(S,A)]

55
Q

In HW3, we saw Sarsa is a ‘control’ algorithm? What is a ‘control’ task? Are any TD algorithms a ‘control’ algorithm?

A

Control task means the action is included. It estimates the action-value function Q(s,a). TD algorithms can be used for either prediction or control. Not all TD methods a control algorithm.

56
Q

True or False: The only algorithms that work in POMDPs are planning algorithms. Why?

A

False. RL algorithm also works for POMDP.

57
Q

True or False: Problems that can be represented as POMDPs cannot be represented as MDPs. Why?

A

False. MDP is a special kind of POMDP.

58
Q

True or False: Applying generalization with an “averager” on an MDP results in another MDP. Why?

A

True. Any generalization of an MDP results in another MDP

59
Q

True or False: With a classic update using linear function approximation, we will always converge to some values, but they may not be optimal. Why?

A

False. it may not even converge. Consider the Baird counterexample.

60
Q

True or False: RL with linear function approximation will not work on environments having a continuous state space. Why?

A

True. Because a linear function approximation would fail to capture non-linearities and feature interactions in a continuous state space.

61
Q

Let’s say you want to use Q-learning with some function approximator. Recall that we learned a convergence theorem and we used that to conclude that Q-learning converges. Can we apply that theorem to prove that your Q-learning with some function approximator converges? Why or why not?

A

False. Adding functional approximator might leads to divergence. Like we have seen in DQN for project 2.

62
Q

Let’s say you want to use a function approximator like we learned in class. What function(s) are you approximating? What’s the input of that function and what’s the output of that function?

A

We can approximate action value, Q, which takes state and action pairs as inputs.

63
Q

We learned about reward shaping in class. Could it be useful for solving Lunar Lander? If so, why and how?

A

Reward shaping could be useful given the high-dimensional state space and the agent is being trained to reach a particular point. I think it will accelerate the learning.

64
Q

Observe that the biggest difference between P2’s Lunar Lander problem and HW4’s Taxi problem is that there are infinitely many states in Lunar Lander. What are some good methods to handle this case? What are their pros and cons?

A

DQN is very useful to handle high-dimensional state space such as Lunar Landing. Other methods include Model-based DreamerV2, imitation learning, and different policy gradient algorithms such as REINFORCE, PPO, A2C, and SAC. While these algorithms provide superior accuracy, they are difficult to train because of non-convexity.

65
Q

Using the coco value, we calculate the side payments as: P = coco(u, u_bar). But why does u getting P amount of side payment make sense?

A

It makes sense in the case that by cooperating neither party is loosing more than by not cooperating. By taking the coco equation the maximum of both cooperating plus the maxmin (maximum negative reward inflicted on other agent) needs to be >0 for coco side payments to work. Each Coco agent will have an inverse side payment (if one agent gives one (-1) the other agent gets one (+1)).

66
Q

True or False: “Sub-game perfect” means that every stage of a multistage game has a Nash equilibrium. Why?

A

False the strategy is itself an equilibrium from multistage perspective, it doesn’t require each stage it’s the Nash, nor it requires the existence of Nash for each stage, as long as multistage angle it’s a Nash.

67
Q

True or False: If following the repeated game strategy “Pavlov”, we will cooperate until our opponent defects. Once an opponent defects we defect forever. Why?

A

False. This is grim trigger.

In pavlov, once the opponent defects we will defect until the opponent defects again. In which case, we will cooperate.

68
Q

True or False: The “folk theorem” states that the notion of threats can stabilize payoff profiles in one-shot games. Why?

A

False, the folks theorem describes a set of payoffs that can results from Nash strategies in a repeated game. Therefore, it does not apply in one-shot game since it only has 1 episode.

69
Q

True or False: An optimal pure strategy does not necessarily exist for a two-player, zero-sum finite deterministic game with perfect information. Why?

A

False, in 2 player, zero-sum, deterministic game with perfect information, there is always a pure optimal strategy.

70
Q

True or False: The objective of the dual LP presented in lecture is minimization of “policy flow” (the minimization is because we are aiming to find an upper bound on “policy flow”). Why?

A

False. The objective of the dual LP is to maximize the reward carried by the policy flow

71
Q

True or False: LP is the only way to efficiently solve an MDP for its optimal policy. Why?

A

True. LP is the only way to guarantee convergence in polynomial time.

72
Q

True or False: MDPs are a type of Markov game. Why?

A

True. MDP can be considered as a single-agent Markov game, or it could be a multi-agent Markov game when there is only one agent contributes to the reward and transition function.

73
Q

True or False: KWIK algorithms are an improvement of MB (mistake bound) algorithms and as such will have same or better time complexity for all classes of problems. Why?

A

False. The class of singletons H can be learned in a Mistake-bound model with mistake bound of only 1, while in the worst case KWIK can take 2^{n}-1 examples to learn since it doesn’t know the answer until it has seen its first positive example.(https://www.cs.cmu.edu/~avrim/Papers/kwikmb.pdf)

74
Q

What if instead of being given “fight_occurred” you are given “did peacemaker come?” and “did instigator come?” Can you create an KWIK algorithm that outputs fewer “don’t knows” than HW5?

A

If we are given two answers “did peacemaker come” and “did instigator come” I think we can apply Algorithm 6 from the KWIK paper. Since the size of each hypothesis class is n, the combined KWIK learner has a worst-case bound of (1-k) + sum(Hi) = 2n-1

75
Q

What’s the worst case number of “don’t know” answers in your algorithm for HW5?

A

n*(n-1) - 1

76
Q

KWIK is a learning framework like PAC (probably approximately correct) learning. Why do we even need a learning framework like PAC or KWIK?

A

To improve efficiency of exploration

77
Q

What is the KWIK framework? What are the strengths of this framework and its algorithms?

A

Know What It Knows, designed particularly for its utility in learning settings where active exploration can impact the training examples the learner is exposed to. It can also be used for anomaly detection where it requires some degree of reasoning about whether a recently presented input is predictable from previous examples.

78
Q

True or False: Options over an MDP form another MDP. Why?

A

True. Options over MDPs form SMDPs which can be converted to MDPs.

79
Q

Can we do Q learning on SMDP? If so, how do we do that?

A

True, we use observations in place of actions and discounted rewards in place of rewards

80
Q

Is value iteration guaranteed to converge to the optimal policy for an SMDP?

A

With a particular choice of option there is no guarantee that it would end up with an optimal policy

81
Q

Is value iteration guaranteed to converge for an SMDP?

A

Value iteration is guaranteed to converge with an SMDP but not necessarily to the optimal policy. Am I missing something?

82
Q

How is SMDP defined? Why is it called semi-Markov?

A

SMDPs are Markov Decision Processes that use options instead of discrete “atomic” actions. These options are allowed to take variable time instead of discrete time.

83
Q

How is an option defined?

A

(I, Pi, Beta) : I = the initialization set of states, Pi = the policy to take with that option, and Beta = the termination set of states.