W4 Policy-based Flashcards

1
Q

Why are value-based methods difficult to use in continuous action spaces?

A

We can no longer use the argmax operator in a continuous action space to choose the best action.

Alternatively, you could discretize the action space, but this will lead to suboptimal solutions compared to the continuous actions. As you increase the number of actions (smaller bins) you might approach the continuous action space, but it will also get more difficult to learn.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is MuJoCo? Can you name a few example tasks?

A

MuJoCo is a physics model. It is often used for learning visuo-motor skills (eye-hand coordination, grasping) and learning of different locomotion gaits of multi-legged “animals”.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is an advantage of policy-based methods?

A

Parameterization is at the core of policy-based methods, making them a good match for deep learning (no stability problems).

They can easily find stochastic policies (value-based find deterministic policies) and there is natural exploration, so no need for epsilon greedy.

They are effective in large or continuous action spaces, small changes in tau also lead to small changes in pi (no suffering from convergence and stability issues)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is a disadvantage of full-trajectory policy-based methods?

A

They are high-variance, because a full trajectory is generated randomly (no guidance at each step).
Consequences:

  1. policy improvement happens infrequently, leading to slow convergence compared to value-based methods
  2. often a local optimum is found, since convergence to the global optimum takes too long
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

From what two sources does the high variance of policy methods originate? And what is the fix that Actor Critic uses for these two problems?

A

(1) high variance in the cumulative reward estimate -> bootstrapping for better reward estimates
(2) high variance in the gradient estimate -> baseline subtraction

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the difference between actor critic and vanilla policy-based methods?

A

Actor critic solves the high-variance problem of the vanilla policy-based methods by using the value-based elements. The actor stands for the action, or policy-based, approach; the critic stands for the value-based approach.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How many parameter sets are used by actor critic? How can they be represented in a neural network?

A

You have two parameter sets, one for the actor and one for the critic. The actor outputs the policy and the critic predicts the state value.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Describe the relation between Monte Carlo REINFORCE, 𝑛-step methods, and temporal difference bootstrapping.

A

They increase from no bootstrapping (MC REINFORCE) to bootstrapping every single step (TD bootstrapping)
https://discord.com/channels/1110894851249143892/1110894851802800192/1116023667864911962

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the advantage function?

A

The advantage function is the state-action value estimate 𝑄 minus the state value V (which is the baseline).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Give two actor-critic approaches to further improve upon bootstrapping and advantage functions, that are used in high-performing algorithms such as PPO and SAC.

A

using a special loss function with an additional constraint on the optimization problem

?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is A3C?

A

 A3C is an efficient, distributed, implementation of Actor Critic.
 A3C is an asynchronous algorithm, is calculates multiple results in parallel.
 A3C can be used for Atari Learning Environment.

multiple actor-learners are dispatched to separate instantiations of the environment. they all interact with the environment and collect experience and asynchronously push their gradient updates to a central target network. This has a stabilizing effect on training

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

what is TRPO?

A

Trust Region Policy Optimization: it aims to further reduce the high variability in the policy parameters.

We want to take the largest possible improvement step on a policy parameter without causing performance collapse: we use an adaptive step size that depends on the output of the optimization progress.

If the quality of approximation still good, we can expand the region, if the divergence between the new and current policy gets large we shrink it

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

what is PPO?

A

proximal policy optimization: a simpler version of TRPO with better complexity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly