Reinforcement Learning all 6 exercise videos Flashcards

1
Q

What is an interaction loop?

A

Humans and animals learn from interaction with our environment without examples
Learning is goal-directed.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Two types of learning in psychology (associative learning)

A
  • classic conditioning

- operant conditioning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

classic conditioning

A
  • subject learns the relationship between an initially neutral conditioned stimulus (cs) an unconditioned stimulus (us) that reflexively produces a conditioned response (cr).
cs = stimulus once neutral but now leads to response
us = automatic response
cr = learned response
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

operant conditioning

A
  • subject learns the relationship btw. stimulus and its behavior
  • stimulus is only shown in response to an action and serves as a reinforcer that increases or decreases the probability of that action.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

reinforcement learning cycle

A
  1. State St exisit.
  2. Agent takes action At
  3. Environment is influences and is now in state St+1
  4. Agent gets reward Rt
    repeat
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

reward hypothesis

A

Goals and purpose can be thought of as the maximization of the expected value of the cumulative sum of a received scalar signal (called return)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Markov Process (MDP)

A
  • a sequence is a markov process if the probability of the next state only depends on the predecessor state
  • MDP: actions steer states in a desired direction
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

state-value function

A

is the expected return when a specific policy is followed

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

action-value function

A

expected return when a specific policy is followed after choosing an action in a particular state

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

General Policy Interation (GPI)

A
  • a value function depends on the policy. The policy depends on the value function.
  • We need to iteratively apply value evaluation and policy improvement.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How is policy evaluation called in classical conditioning

A

prediction

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How is policy evaluation called in operant conditioning

A

control

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

DP (maybe: dynamic programming) prediction

A
  • bootstrapping: propagating value between consecutive states by iteratively exploiting the recursive relationship that is formulated by the Bellman equation.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

value iteration

A
  • a variation of policy iteration that is not using exhaustive evaluation but a single sweep
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Monte Carlo Prediction (MC)

A
  • does not require knowledge of the MDP as it learns form sampled state trajectories
  • MC methods are an approach to learn without prior knowledge
  • return is calculated for all states in each sampled trajectory. Experienced return are averaged
  • goal: estimate state-action values
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Temporal Difference

A
  • mixture of DP and MC that samples and bootstraps

- Bellman equation is employed by iteratively updating value after every time step.

17
Q

Dilemma

A

Learning action values is conditional on subsequent optimal behavior, but behaving non-optimally is necessary in order to explore all actions and find the optimal ones.

18
Q

How can the agent learn about the optimal policy while behaving according to the exploratory policy?

A

Through off-policy learning

19
Q

target policy:

A

policy that the agent estimates its function value based on

20
Q

behavior policy:

A

agent behaves according to sample actions and interacts with env.

21
Q

on-policy

A

behavior policy = target policy

22
Q

off-policy

A

behavior policy != target policy

23
Q

What does the off-policy help with?

A
  • dealing with the exploration problem
24
Q

When is off-policy valid?

A
  • for a valid off-policy learning the chosen behavior must cover the target policy.
25
Q

What unifies DT and MC?

A

n-step boostrapping

1-step is TD, infinite is TD and MC

26
Q

key ideas methods have in common

A
  1. Estimating value functions
  2. back up value functions
  3. Generalized policy iteration
27
Q

Which methods have sample updates

A

MC, TD

28
Q

Which methods have bootstrapping

A

DP, TD

29
Q

Where is depth of update highest?

A

MC

30
Q

Where is width of update largest

A

DP dynamic programming

31
Q

what is low depth and low width of update?

A

temporal difference