8.1 Temporal Difference Learning Flashcards

1
Q

Credit assignment problem

A

Example: Rat in maze (maze has one branch point)
One decision, one step to outcome, one observation outcome pair. Learning depends on knowing the outcome

^ something we can do with RW model, but learning only happens when we found outcome of decision in this model

However, most mazes are much more complex and there are multiple decisions. Rat only finds out if it’s correct at the end. How do we learn from the early choices?

Other examples:

  • Game of chess. Many moves, but winner only decided at end
  • Predicting the weather next week. We continually update the prediction as the sky changes.
  • Almost anything we do in real life…

Summary: How do we evaluate the value of intermediate decisions when we only find out at the end if they are beneficial or not?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Temporal difference learning

A

Equation:
DeltaV_t = a * (R_(t+1) + y * V_(t+1) - V_t)

Similar to RW model. We model change of learning, learning rate, expectation at time t or trial n.

DIFFERENCE:
RW - surprise is difference of what we observe and expect
TDL - Combination of obtained at t+1 and expected at t+1. We learn from the changes of our expectations

Example:
Thursday, we have idea of weather for Friday (V_t and V_(t+1))
Friday, we have idea, but we also observe the Friday weather. (R_(t+1) and V_(t+1))
TDL change retrospectively our model of how the weather works by changing deltaV_t

deltaVt = change in estimated value at time t
a = learning rate
R_(t+1) = Reward at time t+1
V_(t+1) = Estimated value at time t+1
V_t = = Estimated value at time t
Summary:
RW vs TDL (more like real life)
- One step vs Multiple steps
- Discrete vs continuous
- Per trial vs temporal
- Outcome vs changes in expectation
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Dopamine codes prediction errors

A

Dopamine neurons mimic the error function from TDL

Experiment conducted with monkeys with micro-electrodes

Tone -2s-> Juice
Response depends on phases:
- Before learning
Firing of dopamine neurons increases after reward
\_\_\_\_\_/\_
  • After learning
    Signal migrates to just after tone, and there is no response after juice. It’s like satisfaction of juice is received at cue, not the actual reward
    __/\_____
  • After learning with No juice
    Early response to tone, the negative response due to absence of expected juice
    __/\__ __
    \/
    Cue becomes more important than the rewards, Expectation is what matters here.

Circling from value to learning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Reward and probability coding

A

Even more studies that support this neuron link

Experiment conducted with monkeys with micro-electrodes

Image -2s-> Juice, but chance of juice and amount is changed with probability based on picture. Manipulate amount of reward and likelihood

Conclusion: Response is proportional to reward times probability

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Delay coding

A

Dopamine neurons also code for delay

Rats
Odours | Delayed rewards
Somewhat predictive of the reward and delay

Multiple odours -variable-> Reward (multiple delays)

Conclusion: Response to anticipated reward is influenced by temporal discounting
Response on reward delivery is independent of the delay
Review Graph

Why is there still a response on delivery even though there is a delay?
Because the odour is only somewhat predictive of the reward

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Intrinsic reward for learning

A

Learning is satisfying

Computer simulation of learning and evolution:

  • Rewards linked to fitness vs
  • Rewards for learning

Simulation showed agents evolved reward functions that were not directly related to fitness. They adapted individually to their environment through learning based on their reward function
The reward functions evolved globally at the level of the population to value learning

Most likely answer: Maximises long-term evolutionary fitness in changing environments

There is a parallel here with the distinction between RW and TDF learning. What did we say we learn from in each case?

RW: Outcomes -> survival
TDF: Changes in expectation -> learning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Simulation of task choice

A

What does intrinsic reward for learning mean?
Amount of improvement of the prediction error, decrease in surprise = deltaV

4 Types of activities (Graph description of Errors in prediction):
1 Too easy --------- low
2 Too difficult  ----------- high
3 Initial task S-curve steep
4 Next task S-curve  slight

Review graph on errors in prediction

Review % of time spent in each activity graph
1 Too easy: Low time
2 Too difficult:  Low time
3 Initial task: High time, then low
4 Next task: Low time, then high

We spend time on something when learning is high, but as it slows down we move onto something else

This can be challenging when we want to master something

How well did you know this?
1
Not at all
2
3
4
5
Perfectly