reinforcement learning Flashcards

(25 cards)

1
Q

What is the main goal of reinforcement learning (RL)?

A

To learn a policy that maximises cumulative reward through interaction.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the core components of an RL system?

A

Agent, environment, state, action, reward.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Is reinforcement learning supervised or unsupervised?

A

Neither; it’s a separate paradigm focused on learning from interaction.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is a policy in RL?

A

A strategy mapping states to actions to maximise expected return.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What does the reward signal in RL do?

A

Provides feedback to guide the agent’s learning.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What does the discount factor γ represent?

A

How much future rewards are valued compared to immediate ones.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the agent-environment loop in RL?

A

Agent takes action → receives reward and new state → repeats.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the formula for cumulative discounted return?

A

Gₜ = Rₜ₊₁ + γRₜ₊₂ + γ²Rₜ₊₃ + …

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What does Q-learning aim to learn?

A

The optimal action-value function Q(s, a).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is the Q-learning update rule?

A

Q(s,a) ← (1−α)Q(s,a) + α[r + γ maxₐ’ Q(s’,a’)].

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the purpose of the Q-table?

A

To store estimates of action values for each state-action pair.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is exploration in RL?

A

Trying new actions to discover better long-term strategies.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is exploitation in RL?

A

Choosing the best known action based on current knowledge.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What does the state-value function vπ(s) represent?

A

Expected return starting from state s following policy π.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What does the action-value function qπ(s, a) represent?

A

Expected return from state s taking action a under policy π.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Why do we approximate value functions in deep RL?

A

To handle large or continuous state/action spaces using neural networks.

17
Q

What does a Deep Q-Network (DQN) do?

A

Approximates Q-values using deep neural networks from raw inputs.

18
Q

What game domain did early DQNs succeed in?

A

Atari games using pixel input.

19
Q

What major game did AlphaGo Zero master using RL?

A

The game of Go, without human data.

20
Q

What does RLHF stand for?

A

Reinforcement Learning from Human Feedback.

21
Q

Why is RLHF used in training LLMs like ChatGPT?

A

To align model behaviour with human preferences.

22
Q

What makes RL different from supervised learning?

A

It learns from rewards, not explicit labels or correct answers.

23
Q

What kind of reward structure makes RL hard?

A

Delayed rewards where consequences appear later.

24
Q

What are trajectories in RL?

A

Sequences of states, actions, and rewards experienced over time.

25
What is the purpose of the learning rate α in Q-learning?
Controls how much new experiences update value estimates.