C4 Flashcards

1
Q

policy-based methods

A

do not use a separate value function but find the policy directly. They start with a policy function which they improve, episode by episode, with policy gradient methods

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

why do we need policy-based methods?

A

for environments with discrete actions, using the action with the best value in a state works well, because it is clearly separate from the next-best action, but for continuous environments this becomes unstable, because with value-based methods small perturbations in Q-values may lead to large changes in the policy

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

why do stochastic policies not need separate exploration methods?

A

they perform exploration by their nature, because they return a distribution over actions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

what is a potential disadvantage of purely episodic policy-based methods?

A

they are high variance, they may find local instead of global optima and converge slower than value-based methods

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

how do policy-based methods learn?

A

they learn a parameterized policy, that selects actions without having to consult a value function, so the policy function is represented directly, allowing policies to select a continuous action

so, the policy is represented by a set of parameters tau, which map the states S to action probabilities A. We randomly sample a new policy, and if it is better, adjust the parameters in the direction of this new policy

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

how do we measure the quality of a policy?

A

the value of the start state:
J(tau) = V^pi (S_0)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

how do we maximize the objective function J?

A

we apply gradient ascent, so in each time step we do this update:
πœƒ_{𝑑+1} = πœƒ_𝑑 + 𝛼 Β· βˆ‡_πœƒ 𝐽 (πœƒ)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

how is the following update rule derived?
tau_next = tau_now + alpha * Q_hat(s, a) * deriv_tau * log pi_tau (a | s)

A

from πœƒ_{𝑑+1} = πœƒ_𝑑 + 𝛼 Β· βˆ‡πœƒ 𝐽 (πœƒ), we can fill in this:
πœƒ
{𝑑+1} = πœƒ_𝑑 + 𝛼 Β· βˆ‡πœ‹_πœƒ_𝑑 (a* | s), because we want to push the parameters in the direction of the optimal action

we don’t know which action is best, but we can take a sample trajectory and use estimates (Q_hat) of the action values of the sample. Then we get πœƒ_{𝑑+1} = πœƒ_𝑑 + 𝛼 * Q_hat(s, a)βˆ‡πœ‹_πœƒ_𝑑 (π‘Ž|𝑠)

Problem: we are going to push harder AND more often on actions with a high value, so they are doubly improved. We fix this by dividing by the general probability πœ‹_πœƒ (π‘Ž|𝑠). The fraction can be written as βˆ‡_πœƒ log πœ‹_πœƒ (π‘Ž|𝑠)

This formula is the core of REINFORCE

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

what is meant with the online approach?

A

the updates are performed as the timesteps of the trajectory are traversed, so information is used as soon as it is known. This is the opposite of the batch approach, where gradients are summed over the states and actions and updates are performed at the end of the trajectory

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

name 3 advantages of policy-based methods

A
  1. parameterization is at the core of policy-based methods, making them a good match for deep learning (no stability problems)
  2. they can easily find stochastic policies (value-based find deterministic policies) and there is natural exploration, so no need for epsilon greedy
  3. they are effective in large or continuous action spaces, small changes in tau also lead to small changes in pi (no suffering from convergence and stability issues)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

what are the disadvantages of policy-based methods?

A

they are high-variance, because a full trajectory is generated randomly (no guidance at each step). Consequences:

  1. policy improvement happens infrequently, leading to slow convergence compared to value-based methods
  2. often a local optimum is found, since convergence to the global optimum takes too long
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

why do we need Actor-Critic bootstrapping?

A

we want to combine the advantage of the value-based approach (low variance) with the advantage of the policy-based approach (low bias)

bootstrapping gives us a better reward estimate, so it reduces the variance that comes from the cumulative reward estimate. It uses the value function to compute the intermediate n-step values per episode. The n-step values are inbetween full-episode Monte Carlo and single step temporal difference targets. We compute the n-step target (instead of just the trace return):
Q_hat_n (s_t, a_t) = βˆ‘οΈ(0 to π‘›βˆ’1) π‘Ÿ_{𝑑+π‘˜} + 𝑉_πœ™ (𝑠_{𝑑+𝑛})
and use this improved estimate to update the policy

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

why baseline subtraction?

A

it reduces the variance, but leaves the expectation unaffected: we only push up on actions that are higher than the average and push down on actions below average, instead of pushing everyhing up that is positive. As a baseline we choose the value function. We obtain the advantage function:
A(s_t, a_t) = Q(s_t, a_t) - V(s_t)
it estimates how much better a particular action is, compared to the expectation of a particular state

we can now fill in the estimated cumulative reward and use the estimated advantage to update the policy

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

what is A3C?

A

multiple actor-learners are dispatched to separate instantiations of the environment. they all interact with the environment and collect experience and asynchronously push their gradient updates to a central target network. This has a stabilizing effect on training

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

what is TRPO?

A

Trust Region Policy Optimization: it aims to further reduce the high variability in the policy parameters.

We want to take the largest possible improvement step on a policy parameter without causing performance collapse: we use an adaptive step size that depends on the output of the optimization progress.

If the quality of approximation still good, we can expand the region, if the divergence between the new and current policy gets large we shrink it

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

what is soft actor critic (SAC)?

A

it uses entropy regularization in the update rule, to insure that we move to the optimal policy, while also ensuring that the policy stays as wide as possible

it also uses a replay buffer

17
Q

what is PPO?

A

proximal policy optimization: a simpler version of TRPO with better complexity