Reinforcement Learning all 6 exercise videos Flashcards

1
Q

What is an interaction loop?

A

Humans and animals learn from interaction with our environment without examples
Learning is goal-directed.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Two types of learning in psychology (associative learning)

A
  • classic conditioning

- operant conditioning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

classic conditioning

A
  • subject learns the relationship between an initially neutral conditioned stimulus (cs) an unconditioned stimulus (us) that reflexively produces a conditioned response (cr).
cs = stimulus once neutral but now leads to response
us = automatic response
cr = learned response
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

operant conditioning

A
  • subject learns the relationship btw. stimulus and its behavior
  • stimulus is only shown in response to an action and serves as a reinforcer that increases or decreases the probability of that action.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

reinforcement learning cycle

A
  1. State St exisit.
  2. Agent takes action At
  3. Environment is influences and is now in state St+1
  4. Agent gets reward Rt
    repeat
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

reward hypothesis

A

Goals and purpose can be thought of as the maximization of the expected value of the cumulative sum of a received scalar signal (called return)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Markov Process (MDP)

A
  • a sequence is a markov process if the probability of the next state only depends on the predecessor state
  • MDP: actions steer states in a desired direction
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

state-value function

A

is the expected return when a specific policy is followed

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

action-value function

A

expected return when a specific policy is followed after choosing an action in a particular state

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

General Policy Interation (GPI)

A
  • a value function depends on the policy. The policy depends on the value function.
  • We need to iteratively apply value evaluation and policy improvement.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How is policy evaluation called in classical conditioning

A

prediction

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How is policy evaluation called in operant conditioning

A

control

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

DP (maybe: dynamic programming) prediction

A
  • bootstrapping: propagating value between consecutive states by iteratively exploiting the recursive relationship that is formulated by the Bellman equation.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

value iteration

A
  • a variation of policy iteration that is not using exhaustive evaluation but a single sweep
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Monte Carlo Prediction (MC)

A
  • does not require knowledge of the MDP as it learns form sampled state trajectories
  • MC methods are an approach to learn without prior knowledge
  • return is calculated for all states in each sampled trajectory. Experienced return are averaged
  • goal: estimate state-action values
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Temporal Difference

A
  • mixture of DP and MC that samples and bootstraps

- Bellman equation is employed by iteratively updating value after every time step.

17
Q

Dilemma

A

Learning action values is conditional on subsequent optimal behavior, but behaving non-optimally is necessary in order to explore all actions and find the optimal ones.

18
Q

How can the agent learn about the optimal policy while behaving according to the exploratory policy?

A

Through off-policy learning

19
Q

target policy:

A

policy that the agent estimates its function value based on

20
Q

behavior policy:

A

agent behaves according to sample actions and interacts with env.

21
Q

on-policy

A

behavior policy = target policy

22
Q

off-policy

A

behavior policy != target policy

23
Q

What does the off-policy help with?

A
  • dealing with the exploration problem
24
Q

When is off-policy valid?

A
  • for a valid off-policy learning the chosen behavior must cover the target policy.
25
What unifies DT and MC?
n-step boostrapping | 1-step is TD, infinite is TD and MC
26
key ideas methods have in common
1. Estimating value functions 2. back up value functions 3. Generalized policy iteration
27
Which methods have sample updates
MC, TD
28
Which methods have bootstrapping
DP, TD
29
Where is depth of update highest?
MC
30
Where is width of update largest
DP dynamic programming
31
what is low depth and low width of update?
temporal difference