Reinforcement Learning Flashcards

1
Q

What are the two forms of conditioning in psychology?

A
  • classical conditioning: subject learns the relationship between an initially neutral conditioned stimulus (CS) and an unconditioned stimulus (US) that reflexively produces a behavioral response (CR). When the subject is repeatedly presented the CS and US together, the CS will finally produce the CR even in the absence of the US.
  • operant conditioning: subject learns the relationship between a stimulus and its behavior. A stimulus is only presented in response to a certain action of the subject and serves as a reinforcer that increases or decreases the probability of that action
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is a Markov process?

A

A sequence of states is a Markov process if it satisfies the Markov property (memorylessness): the probability of the next state solely depends on the predecessor state

P[Sₜ₊₁ | Sₜ] = P[Sₜ₊₁| S₁, …, Sₜ]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is a Markov reward process?

A

A Markov reward process extends a Markov process by adding a reward rate Rₜ to each state transition. An additional variable (Return) records the cumulative reward up to time t.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are Markov decision processes (MDP)?

A

In MDPs, action steer states (and rewards) in a desired direction.

  • set of environment and agent states, S;
  • set of actions A
  • probability that transition at time t from state s under action a will lead to s’
  • immediate reward after transition from s to s’ with action a

(-> rational decisions over time)

In reinforcement learning, the environment is typically stated in the form of a Markov decision process.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is reinforcement learning?

A

Reinforcement learning (RL) is concerned with how agents ought to take actions in an environment in order to maximize the notion of cumulative reward.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the state-value function?

A

The state-value function is the expected return Gₜ when a specific policy π is followed after visiting a particular state s:

v_π(s) := E_π[Gₜ | Sₜ = s]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the action value function q?

A

The action value q is the expected return when a specific policy π is followed after choosing an action a in a particular state s:

q_π(s, a) = E_π[Gₜ | Sₜ = s, Aₜ = a]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is Bellman’s principle of optimality?

A

An optimal policy has the property that whatever the initial state and initial decision are, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decision.

“Any part of an optimal path is itself optimal”

Bellman equation: recursive relationship between consecutive states

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is Generalized Policy Iteration (GPI)?

A

Generalized Policy Iteration is any method that alternates between policy evaluation and policy improvement.

  • Policy evaluation is called prediction (classical conditioning)
  • Policy evaluation & improvement is denoted as control (operant conditioning)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is bootstrapping?

A

Bootstrapping is propagating values between consecutive states by iteratively exploiting the recursive relationship that is formulated by the bellman equation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is Policy Iteration?

A

Policy iteration is alternating between an exhaustive evaluation step and a greedy policy improvement step afterwards.

π0 —eval—>
v_π0 –impr–>
π1 —eval—>
v_π1 –…

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How does Value Iteration work?

A

In value iteration, values are not calculated in exhaustive evaluation, but in a single sweep.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are Monte-Carlo-based approaches?

A

Monte-Carlo-based approaches don’t require knowledge of the MDP as they learn from sampled state trajectories.

(In practical cases, the MDP is usually unknown)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is Temporal Difference (TD)?

A

Temporal Difference is a mixture of dynamic programming (DP) and Monte-Carlo (MC) that samples and bootstraps.

In TD prediction, the Bellman equation is employed by iteratively updating the value after every time step.

  • SARSA
  • Q-learning
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

On-policy vs off-policy

A

On-policy methods evaluate the policy that has been taken.

Off-policy methods evaluate a policy different from the actual taken action.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is SARSA?

A

State–action–reward–state–action (SARSA) is an on-policy algorithm for learning a MDP policy.

(TD control)

17
Q

What is Q-learning?

A

Q-learning is a model-free off-policy learning algorithm.