- subject learns the relationship between an initially neutral conditioned stimulus (cs) an unconditioned stimulus (us) that reflexively produces a conditioned response (cr). ``` cs = stimulus once neutral but now leads to response us = automatic response cr = learned response ```

- subject learns the relationship btw. stimulus and its behavior - stimulus is only shown in response to an action and serves as a reinforcer that increases or decreases the probability of that action.

- a sequence is a markov process if the probability of the next state only depends on the predecessor state - MDP: actions steer states in a desired direction

- a variation of policy iteration that is not using exhaustive evaluation but a single sweep

Reinforcement Learning all 6 exercise videos Flashcards by Paolo Oppelt

What is an interaction loop?

Humans and animals learn from interaction with our environment without examples
Learning is goal-directed.

How well did you know this?

Not at all

Perfectly

Two types of learning in psychology (associative learning)

classic conditioning

- operant conditioning

How well did you know this?

Not at all

Perfectly

classic conditioning

subject learns the relationship between an initially neutral conditioned stimulus (cs) an unconditioned stimulus (us) that reflexively produces a conditioned response (cr).

cs = stimulus once neutral but now leads to response
us = automatic response
cr = learned response

How well did you know this?

Not at all

Perfectly

operant conditioning

subject learns the relationship btw. stimulus and its behavior
stimulus is only shown in response to an action and serves as a reinforcer that increases or decreases the probability of that action.

How well did you know this?

Not at all

Perfectly

reinforcement learning cycle

State St exisit.
Agent takes action At
Environment is influences and is now in state St+1
Agent gets reward Rt
repeat

How well did you know this?

Not at all

Perfectly

reward hypothesis

Goals and purpose can be thought of as the maximization of the expected value of the cumulative sum of a received scalar signal (called return)

How well did you know this?

Not at all

Perfectly

Markov Process (MDP)

a sequence is a markov process if the probability of the next state only depends on the predecessor state
MDP: actions steer states in a desired direction

How well did you know this?

Not at all

Perfectly

state-value function

is the expected return when a specific policy is followed

How well did you know this?

Not at all

Perfectly

action-value function

expected return when a specific policy is followed after choosing an action in a particular state

How well did you know this?

Not at all

Perfectly

General Policy Interation (GPI)

a value function depends on the policy. The policy depends on the value function.
We need to iteratively apply value evaluation and policy improvement.

How well did you know this?

Not at all

Perfectly

How is policy evaluation called in classical conditioning

prediction

How well did you know this?

Not at all

Perfectly

How is policy evaluation called in operant conditioning

control

How well did you know this?

Not at all

Perfectly

DP (maybe: dynamic programming) prediction

bootstrapping: propagating value between consecutive states by iteratively exploiting the recursive relationship that is formulated by the Bellman equation.

How well did you know this?

Not at all

Perfectly

value iteration

a variation of policy iteration that is not using exhaustive evaluation but a single sweep

How well did you know this?

Not at all

Perfectly

Monte Carlo Prediction (MC)

does not require knowledge of the MDP as it learns form sampled state trajectories
MC methods are an approach to learn without prior knowledge
return is calculated for all states in each sampled trajectory. Experienced return are averaged
goal: estimate state-action values

How well did you know this?

Not at all

Perfectly

Temporal Difference

Study These Flashcards

mixture of DP and MC that samples and bootstraps

- Bellman equation is employed by iteratively updating value after every time step.

Dilemma

Study These Flashcards

Learning action values is conditional on subsequent optimal behavior, but behaving non-optimally is necessary in order to explore all actions and find the optimal ones.

How can the agent learn about the optimal policy while behaving according to the exploratory policy?

Study These Flashcards

Through off-policy learning

target policy:

Study These Flashcards

policy that the agent estimates its function value based on

behavior policy:

Study These Flashcards

agent behaves according to sample actions and interacts with env.

on-policy

Study These Flashcards

behavior policy = target policy