reinforcement learning Flashcards
(25 cards)
What is the main goal of reinforcement learning (RL)?
To learn a policy that maximises cumulative reward through interaction.
What are the core components of an RL system?
Agent, environment, state, action, reward.
Is reinforcement learning supervised or unsupervised?
Neither; it’s a separate paradigm focused on learning from interaction.
What is a policy in RL?
A strategy mapping states to actions to maximise expected return.
What does the reward signal in RL do?
Provides feedback to guide the agent’s learning.
What does the discount factor γ represent?
How much future rewards are valued compared to immediate ones.
What is the agent-environment loop in RL?
Agent takes action → receives reward and new state → repeats.
What is the formula for cumulative discounted return?
Gₜ = Rₜ₊₁ + γRₜ₊₂ + γ²Rₜ₊₃ + …
What does Q-learning aim to learn?
The optimal action-value function Q(s, a).
What is the Q-learning update rule?
Q(s,a) ← (1−α)Q(s,a) + α[r + γ maxₐ’ Q(s’,a’)].
What is the purpose of the Q-table?
To store estimates of action values for each state-action pair.
What is exploration in RL?
Trying new actions to discover better long-term strategies.
What is exploitation in RL?
Choosing the best known action based on current knowledge.
What does the state-value function vπ(s) represent?
Expected return starting from state s following policy π.
What does the action-value function qπ(s, a) represent?
Expected return from state s taking action a under policy π.
Why do we approximate value functions in deep RL?
To handle large or continuous state/action spaces using neural networks.
What does a Deep Q-Network (DQN) do?
Approximates Q-values using deep neural networks from raw inputs.
What game domain did early DQNs succeed in?
Atari games using pixel input.
What major game did AlphaGo Zero master using RL?
The game of Go, without human data.
What does RLHF stand for?
Reinforcement Learning from Human Feedback.
Why is RLHF used in training LLMs like ChatGPT?
To align model behaviour with human preferences.
What makes RL different from supervised learning?
It learns from rewards, not explicit labels or correct answers.
What kind of reward structure makes RL hard?
Delayed rewards where consequences appear later.
What are trajectories in RL?
Sequences of states, actions, and rewards experienced over time.