Reinforcement Learning Flashcards

Question 1

Q

What is Reinforcement Learning?

Answer

A

A learning paradigm where an agent interacts with an environment to take actions that maximise cumulative numeric rewards (the reward hypothesis).

Question 2

Q

What are the core elements formalised in an RL problem?

Answer

A

Agent (decision maker), Environment (the world), Actions, States (observations), and Rewards.

Question 3

Q

What is the difference between environment state and agent state?

Answer

A

Environment state is the true underlying state; agent state is what the agent observes (e.g., partial observations or history).

Question 4

Q

What distinguishes fully observable from partially observable environments?

Answer

A

Fully observable: agent sees the complete state; partially observable: agent only sees partial observations of the state.

Question 5

Q

What are the three main components of an RL agent?

Answer

A

Policy (mapping states to actions), Value function (estimating expected returns), and Model (predicting environment dynamics).

Question 6

Q

Why does RL violate the i.i.d. assumption of supervised learning?

Answer

A

Because samples are sequentially correlated, collected via the agent’s policy, not independently drawn from a stationary distribution.

Question 7

Q

What is Deep Reinforcement Learning?

Answer

A

Using deep neural networks as function approximators for policies and value functions to handle high-dimensional inputs.

Question 8

Q

What is imitation learning?

Answer

A

Training agents by mimicking expert demonstrations instead of learning from reward signals.

Question 9

Q

Name two high-profile RL applications mentioned in the lecture.

Answer

A

AlphaGo (game playing) and ChatGPT RLHF (language model fine-tuning with human preferences).

Question 10

Q

What is RLHF (Reinforcement Learning from Human Feedback)?

Answer

A

A training paradigm where human feedback provides reward signals to fine-tune models, as used in ChatGPT.

Question 11

Q

What is the reward hypothesis?

Answer

A

Any goal can be formalised as the maximisation of the expected cumulative reward signal from the environment.

Question 12

Q

What examples of sequential data and tasks motivate RL?

Answer

A

Games (Go), robotics control, dialogue systems, and autonomous vehicles, where sequence and decision-making are key.

Question 13

Q

What is the Markov property in the RL context?

Answer

A

The assumption that the future is independent of the past given the present state, enabling state-based decision-making.

Reinforcement Learning Flashcards

(13 cards)