W2 MDP & Tabular Value-based Flashcards

1
Q

What is the 5-tuple of a Markov Decision Process?

A

(S, A, T_a, R_a, gamma)
S: state space
A: action space
T_a: transition fuction in the environment
R_a: reward
gamma: discount factor representing the difference between future and present rewards.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Value-based methods, policy-based methods work well for discrete or continuous spaces?

A

Value-based: Discrete
Policy-based: Discrete & Continuous

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Actions & Rewards learning in a tree, which directions?

A

Actions: downward to the leaves
Rewards: upward, backpropagation to the root

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

what is a sequential decision problem?

A

the agent has to make a sequence of decisions in order to solve a problem

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

what is the Markov property?

A

the next state depends only on the current state and the actions available in it (no influence of historical memory of previous states)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is a policy pi(a|s)?

A

a conditional probability distribution that for each possible state specifies the probability of each possible action.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What’s on-policy?

A

The learning takes place by consistently backing up the value of the selected action back to the same behavior policy function that was used to select the action
SARSA is on-policy

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What’s off-policy?

A

the learning takes place by backing up values of another action, not the one selected by the behavior policy
Q-learning is off-policy and it is greedy: backup the value of the best action
convergence in off-policy can be slower, since older, non-current, values are used.

The behavior policy and the target policy are different in Q-learning, but they are the same in SARSA.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

SARSA update formula? (𝛼=learning rate, 𝛾=discount factor)

A

𝑄(𝑠𝑑, π‘Žπ‘‘) ← 𝑄(𝑠𝑑, π‘Žπ‘‘) + 𝛼[π‘Ÿπ‘‘+1 + 𝛾𝑄(𝑠𝑑+1, π‘Žπ‘‘+1) βˆ’ 𝑄(𝑠𝑑, π‘Žπ‘‘)]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Q-learning update formula?

A

𝑄(𝑠𝑑, π‘Žπ‘‘) ← 𝑄(𝑠𝑑, π‘Žπ‘‘) + 𝛼[π‘Ÿπ‘‘+1 + 𝛾 maxπ‘Žπ‘„(𝑠𝑑+1, π‘Ž) βˆ’ 𝑄(𝑠𝑑, π‘Žπ‘‘)]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

In reinforcement learning the agent can choose which training examples are generated. Why is this beneficial? What is a potential problem?

A

We can generate a endless dataset ourselves, through simulations. But on the other hand, we don’t have the β€˜gold standard’ actions given a state, nothing is labeled. We have to derive the correct policy ourselves.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is Grid world?

A

Grid worlds are the simplest environments, it consists of a rectangular grid of squares, with a start square, and a goal square.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

In a tree diagram, is successor selection of behavior up or down?
In a tree diagram, is learning values through backpropagation up or down?

A

Learning up

Selecting down

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is 𝜏?
What is πœ‹(s)
What is 𝑉(𝑠)?
What is 𝑄(𝑠, π‘Ž)?

A

𝜏: Trace, a full rollout of a simulation.
πœ‹(s): the policy function answers the question how the different actions π‘Ž at state 𝑠 should be chosen
𝑉(𝑠): The expected cumulative discounted future reward of a state
𝑄(𝑠, π‘Ž): Q-value estimate, estimated value of taking action π‘Ž in state 𝑠

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is dynamic programming?

A

In the context of RL, in Dynamic Programming we recursively traverse the state space. An example algorithm is Value-iteration.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is recursion?

A

Calling the same code segment within that code segment.

17
Q

Do you know a dynamic programming method to determine the value of a state?

A

Value-iteration: it repeatedly updates 𝑄(𝑠, π‘Ž) and 𝑉(𝑠) values, looping over the states and their actions, until convergence occurs.

18
Q

Is an action in an environment reversible for the agent?

A

No, just like in the real world, there is no undo.

19
Q

Is the action space of games typically discrete or continuous?
Is the action space of robots typically discrete or continuous?
Is the environment of games typically deterministic or stochastic?
Is the environment of robots typically deterministic or stochastic?

A

Discrete
Continuous
Deterministic
Stochastic

20
Q

What is the goal of reinforcement learning?

A

to find the sequence of actions that gives the highest reward, or, more formally, to find the optimal policy that gives in each state the best action to take

generally, the objective is to achieve the highest possible average return from the start state

find the optimal policy pi*(a|s) = argmax V^pi (s_0), so for start state s_0

21
Q

Which of the five MDP elements is not used in episodic problems?

A

Gamma, it is useless in episodic problems like the supermarket problem.

22
Q

Which model or function is meant when we say β€œmodel-free” or β€œmodel-based”?

A

When we say β€œmodel-free” we refer to the absence of the environments model, such as the transition function.

23
Q

What type of action space and what type of environment are suited for value-based methods?

A

Discrete action space and discrete/continuous state space/environment.

24
Q

Why are value-based methods used for games and not for robotics?

A

Because robotics is in the real-world where we have continuous action spaces.

25
Q

Name two basic Gym environments.

A

The cartpole problem and the lunar lander