TD Learning Flashcards

1
Q

What are the three main categories of RL algorithms?

A
Model based (most information required, but generally easy)
Value function based/model free 
Policy search (most general/simple, but generally difficult)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Describe model based learning

A

Model based learning attempts to derive the overall model for the MDP and compute Q* and Pi from the model using an MDP solver (such as VI/PI)
(s,a,r)* -> model learner <->T/R -> MDP Solve -> Q* -> argmax - Policy

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Describe value function (model free) based learning

A

Attempts to directly learn the value function based on states, actions, and rewards.
(s,a,r)* -> Value update <-> Q -> argmax -> Policy

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Describe policy search learning

A

Attempts to directly derive the optimal policy by updating the policy directly.
(s,a,r)* -> policy update <-> policy

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Under what criteria does the value function estimate Vt(S) converge to the true value V(s) as t -> inf? In general what values satisfy these criteria?

A
  1. Sum(alpha_t) -> inf
  2. Sum(alpha_t^2) < inf

alpha_t = 1/t^n where n in (1/2, 1]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the TD(lambda) update rule?

A

Episode T
For all s, e(s) = 0 at start of episode, V_t(s) = V_t-1(s)
After s_t-1 -> r_t -> s_t (step t):
e(s_t-1) = e(s_t-1) +1
For all s:
V_t(s) = V_t(s) + alpha_t (r_t + gamma*V_t-1(s_t) - V_t-1(s_t-1)) e(s)
e(s) = lambda * gamma * e(s)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are some issues with TD(1)?

A

TD(1) does not use all data that is available. It can get stuck with bad estimates for a long time if it does not see certain paths. Inefficient, high variance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Which TD(lambda) version is equivalent to maximum likelihood? Under which criteria?

A

TD(0), if finite data is presented infinitely

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Empirically, how does TD(lambda) tend to behave when varying lambda?

A

Error decreases initially as lambda increases but eventually increases as lambda approaches one.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are some issues with TD(0)

A

We get a biased estimate because we are using one step estimation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly