W5 Model-based RL Flashcards

1
Q

What is model-based RL? What is the advantage of model-based over model-free methods?

A

In model-based methods a transition model is learned that is then used with planning to augment the policy function, reducing sample complexity.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Why may the sample complexity of model-based methods suffer in high-dimensional problems?

A

(Not sure)
High dim problems you need to sample more states to build your transition model because there are too many states building a high accuracy of T(s,s’) takes more time

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Which functions are part of the dynamics model?

A

Transition function of the agent?
(Not 100% sure)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Mention four deep model-based approaches.

A

2 for leaning the model, 2 for planning

learning the model:
1) uncertainty modeling [ensembling]
2) latent models [Value Prediction Network (VPN)]

planning:
3) Model-Predictive Control (MPC)
4) Value Iteration Networks (VIN)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Do model-based methods achieve better sample complexity than model-free?
Do model-based methods achieve better performance than model-free?

A

1) Generally yes because once the agent has a high accuracy transition model it can sample its local entries to update the policy and doesn’t need the environment like the model-free agents do

2) Not really. The downside is that the learned transition model may be inaccurate, and the resulting policy may be of low quality. No matter how many samples can be taken for free from the model, if the agent’s local transition model does not reflect the
environment’s real transition model, then the locally learned policy function will not work in the environment.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

In Dyna-Q the policy is updated by two mechanisms: learning by sampling the environment and what other mechanism?

A

planning with random actions

Dyna-Q uses the Q-function as behavior policy 𝜋(𝑠) to perform 𝜖-greedy sampling of the environment.
It then updates this policy with the reward, and an explicit model 𝑀. When the model 𝑀 has been updated, it is used 𝑁 times by planning with random actions to update the Q-function.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Why is the variance of ensemble methods lower than of the individual machine learning approaches that are used in the ensemble?

A

(model uncertainty) Ensemble methods combine multiple learning algorithms to achieve better predictive performance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What does model-predictive control do and why is this approach suited for models with lower accuracy?

A

In MPC the model is optimized for a limited time into the future, and then it is re-learned after each environment step.

In this way small errors do not get a chance to accumulate and influence the outcome greatly.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the advantage of planning with latent models over planning with actual models?

A

The latent-model approach reduces the dimensionality of the observation space. By using features extracted from the world model as inputs to the agent, a compact and simple policy can be trained to solve a task, and planning occurs in the compressed world.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are latent models?

A

Latent models focus on dimensionality reduction of high-dimensional problems.
The idea behind latent models is that in most high-dimensional environments some elements are less important. Latent models reduce dimenstionality by learning to represent the elements of the input and the reward. Since planning and learning are now possible in a lower-dimensional latent space, the sampling complexity of learning from the latent models improves

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How are latent models trained?

A

VPN Value prediction network

The networks are trained with 𝑛-step Q-learning and TD search. Trajectories are generated with an 𝜖-greedy policy using the planning algorithm from Alg. 5.5. VPN achieved good results on Atari games such as Pacman and Seaquest, outperforming model-free DQN, and outperforming observation-based planning in stochastic domains.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Mention four typical modules that constitute the latent model.

A

VPN has four functions:
1. encoding func: maps s_actual to s_latent (abstract state) (dimensionality reduction)
2. reward func: maps s_latent and option o to the reward and discount factor
3. value func: maps the abstract state to its value using a separate neural network
4. transition func: maps s_latent to s’_latent using option o

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is the advantage of end-to-end planning and learning?

A

hand-crafted planning algorithms are replaced by differentiable approaches, so the system can learn to plan and make decisions directly from raw input data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly