W3 Deep Value-based Flashcards

1
Q

What is Gym?

A

A collection of reinforcement learning environments

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the Stable Baselines?

A

A collection of algorithms, like gym but for the algorithms

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

The loss function of DQN uses the Q-function as target. What is a consequence?

A

The Q-function value keeps updating therefore using it as the target makes the convergence hard.

the loss function of deep Q-learning
minimizes a moving target, a target that depends on the network being optimized.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Why is the exploration/exploitation trade-off central in reinforcement learning?

A

We want the agent to utilize the current knowledge it possesses but also explore the environment so it won;t stuck in local optimas.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Name one simple exploration/exploitation method

A

Linear annealing/epsilon greedy/softmax

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is bootstrapping?

A

Bootstrapping in RL can be read as “using one or more estimated values in the update step for the same kind of estimated value”.

In most TD update rules, you will see something like this SARSA(0) update:

Q(s,a)←Q(s,a)+α(Rt+1+γQ(s′,a′)−Q(s,a))

The value Rt+1+γQ(s′,a′)
is an estimate for the true value of Q(s,a), and also called the TD target. It is a bootstrap method because we are in part using a Q value to update another Q value. There is a small amount of real observed data in the form of Rt+1, the immediate reward for the step, and also in the state transition s→s′.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Describe the architecture of the neural network in DQN with TN

A

The DQN architecture has two neural nets, the Q network and the Target networks, and a component called Experience Replay. The Q network is the agent that is trained to produce the Optimal State-Action value.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Why is deep reinforcement learning more susceptible to unstable learning than deep supervised learning?

A

Because the target value keeps changing, making the learning process high-variance. Converging to a moving target is also hard

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the deadly triad?

A

function approximation, bootstrapping, and off-
policy learning. Together, they are called deadly triad.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How does function approximation reduce stability of Q-learning?

A

Function approximation may attribute values to states inaccurately It may thus cause mis-identification of states, and reward values and Q-values that are not assigned correctly

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the role of the replay buffer?

A

The replay buffer serves as the memory of the network, and decreases correlation between samples.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How can correlation between states lead to local minima?

A

Due to correlation between different states, it may result in biased training. The bias can result in the so-called specialization trap (when there is too much exploitation, and too little exploration).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Why should the coverage of the state space be sufficient?

A

Because otherwise we might not cover the optimal solution

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What happens when deep reinforcement learning algorithms do not converge?

A

Algorithms can have trouble with the moving target, since it is also based on the parameters we are optimizing.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How large is the state space of chess, Go, StarCraft estimated to be? 10^47, 10^170 or 10^1685?

A

10^47, 10^170 or 10^1685, respectively

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What does the rainbow in the Rainbow paper stand for, and what is the main message?

A

It is an ensemble of 7 different improvements on the original DQN. By combining all their strengths, the rainbow model outperforms them by a large margin.

17
Q

Which statement about the benefit of DQN compared to tabular Q-learning is True? (pick the most convincing reason)
A DQN is faster.
B DQN outperforms tabular Q-learning.
C DQN can better deal with high-dimensional input.
D DQN is more data-efficient.

A

C

18
Q

Zhao is implementing a replay buffer for DQN and was wondering whether you
had some tips regarding sampling methods. Your recommendation is:
A Prioritized Experience Replay.
B Uniform sampling.
C Compare both to find out which one works best on his problem, as their performance varies per application.

A

C

19
Q

Why is diversity important in learning?
A Through de-correlation it improves stability in reinforcement learning.
B Through correlation it prevents over-generalization in supervised learning.
C Through correlation it prevents over-generalization in reinforcement learning.
D Through de-correlation it improves stability in supervised learning.

A

A

20
Q

Which of the following DQN Extensions addresses overestimated action values?
 Double DQN
 Dueling DQN
 Prioritized Action Replay
 Distributional DQN

A

Double DQN