reinforcement learning Flashcards by ROWAN Gomanee

What is the main goal of reinforcement learning (RL)?

To learn a policy that maximises cumulative reward through interaction.

How well did you know this?

Not at all

Perfectly

What are the core components of an RL system?

Agent, environment, state, action, reward.

How well did you know this?

Not at all

Perfectly

Is reinforcement learning supervised or unsupervised?

Neither; it’s a separate paradigm focused on learning from interaction.

How well did you know this?

Not at all

Perfectly

What is a policy in RL?

A strategy mapping states to actions to maximise expected return.

How well did you know this?

Not at all

Perfectly

What does the reward signal in RL do?

Provides feedback to guide the agent’s learning.

How well did you know this?

Not at all

Perfectly

What does the discount factor γ represent?

How much future rewards are valued compared to immediate ones.

How well did you know this?

Not at all

Perfectly

What is the agent-environment loop in RL?

Agent takes action → receives reward and new state → repeats.

How well did you know this?

Not at all

Perfectly

What is the formula for cumulative discounted return?

Gₜ = Rₜ₊₁ + γRₜ₊₂ + γ²Rₜ₊₃ + …

How well did you know this?

Not at all

Perfectly

What does Q-learning aim to learn?

The optimal action-value function Q(s, a).

How well did you know this?

Not at all

Perfectly

What is the Q-learning update rule?

Q(s,a) ← (1−α)Q(s,a) + α[r + γ maxₐ’ Q(s’,a’)].

How well did you know this?

Not at all

Perfectly

What is the purpose of the Q-table?

To store estimates of action values for each state-action pair.

How well did you know this?

Not at all

Perfectly

What is exploration in RL?

Trying new actions to discover better long-term strategies.

How well did you know this?

Not at all

Perfectly

What is exploitation in RL?

Choosing the best known action based on current knowledge.

How well did you know this?

Not at all

Perfectly

What does the state-value function vπ(s) represent?

Expected return starting from state s following policy π.

How well did you know this?

Not at all

Perfectly

What does the action-value function qπ(s, a) represent?

Expected return from state s taking action a under policy π.

How well did you know this?

Not at all

Perfectly

Why do we approximate value functions in deep RL?

Study These Flashcards

To handle large or continuous state/action spaces using neural networks.

What does a Deep Q-Network (DQN) do?

Study These Flashcards

Approximates Q-values using deep neural networks from raw inputs.

What game domain did early DQNs succeed in?

Study These Flashcards

Atari games using pixel input.

What major game did AlphaGo Zero master using RL?

Study These Flashcards

The game of Go, without human data.

What does RLHF stand for?

Study These Flashcards

Reinforcement Learning from Human Feedback.

Why is RLHF used in training LLMs like ChatGPT?

Study These Flashcards

To align model behaviour with human preferences.

What makes RL different from supervised learning?

Study These Flashcards

It learns from rewards, not explicit labels or correct answers.

What kind of reward structure makes RL hard?

Study These Flashcards

Delayed rewards where consequences appear later.

What are trajectories in RL?

Study These Flashcards

Sequences of states, actions, and rewards experienced over time.

What is the purpose of the learning rate α in Q-learning?

Controls how much new experiences update value estimates.

reinforcement learning Flashcards

(25 cards)