Lecture 22 Flashcards

1
Q

How do reinforcement learning problems mainly differ from sequential decision problems?

A

No Markovian transition model. Agent has to deduce optimal policy from rewards. Via trial and error.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are two key parts about reinforcement learning and the environment? How is this done?

A

Agent needs to exploit best actions (as dictated by policy) but also needs to explore to find other possible strategies (don’t always take best action).
The optimal choice has the largest chance of being picked, but other options can occur, with higher randomness early in the training.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What do we compute instead of state utility for reinforcement learning?

A

The action-utility representation: For each state there’s an array of values corresponding to actions, policy picks the best value.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is temporal difference learning?

A

Q value for a state and an action is gotten by taking the reward from the current state and adding it to the max Q value of the resulting state from the taken action.
This must be done gradually to ensure stochastic properties of environment come through.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What can we do if there are too many states in a reinforcement learning problem?

A

Use a function approximator such as a neural network to map states to actions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is on-policy and off-policy?

A

Updates in Q-learning use value of best action from next state- on-policy algorithm
Update Q values based on value of action taken, rather than best possible.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly