Core Tenants Impact on RL Algorithm Selection Flashcards
(110 cards)
Number of environment interactions needed for good policy. Critical for costly real-world data
Sample Efficiency
Trade-off with wall clock time: cheap samples favor model-free; expensive samples favor model-based.
In RL what is Stability & Convergence related to?
Ease and speed of convergence; sensitivity to seeds/hyperparameters.
Improvements often add hyperparameters, hurting generalization. Local optima common.
Performance on unseen tasks/situations.
Generalization
Trade-off with model fidelity: explicit models are efficient but brittle; direct policies are adaptable but less efficient.
Underlying constraints (e.g., continuity, observability, action space type).
Assumptions & Approximations
Violating assumptions leads to failure. Problem structure dictates suitable algorithms
Ease of computation and ability to scale horizontally.
Computational Simplicity & Parallelism
High parallelism can reduce wall clock time even with low sample efficiency.
Effectiveness in discovering states/actions.
Exploration
Violating assumptions leads to failure. Problem structure dictates suitable algorithms.
Error from assumptions vs. sensitivity to data fluctuations.
Bias vs. Variance
Reducing variance often hurts sample efficiency; conflicting objectives.
Bias vs. Variance
Identify the matching model with these descriptions -
a. High variance, zero bias example
b. Low variance, biased example
c. Lower variance, but hyperparameter sensitive.
d. Model example that is vulnerable to variance
(1. one step Temporal Difference,
2. Monte-Carlo,
3. Policy Gradient,
4. off-policy Temporal Difference(TD) )
Bias vs. Variance
- one step Temporal Difference(TD)
b. Low variance, biased example - Monte-Carlo
a. High variance, zero bias example - Policy Gradient
d. Model example that is vulnerable to variance - Off-policy Temporal Difference(TD)
c. Lower variance, but hyperparameter sensitive.
Computational Simplicity & Parallelism
rate the models computation cost & Parallelism
Model-based
Computational cost - (Low, Medium, High)
Value-based/Model-based
Parallelizability - (Easy or Hard)
Evolutionary
Computational cost - (Low, Medium, High)
Parallelizability——— (Low or high)
Model-free:
Computational cost?
Computational Simplicity & Parallelism
Model-based
Computational cost - High( iterative methods)
Value-based/Model-based
Parallelizability - Hard
Evolutionary
Computational cost - Low
Parallelizability - high
Model-free:
Computational cost - Appealing when samples are cheap.
How does Model-based methods do with Sample efficiency?
Model-based methods generally stand out as the most sample-efficient because they leverage an internal understanding of the system dynamics, which significantly reduces the volume of samples needed for learning
On-policy vs. Off-policy Learning
The fundamental distinction between on-policy and off-policy learning lies in the relationship between the policy used to collect data and the policy being optimized.
On-policy methods
operate with a ___ that serves both as the behavior policy for sampling data and the target policy for optimization
1.single policy
operate with a single policy that serves both as the behavior policy for sampling data and the target policy for optimization
Policy Gradient:
On-policy or Off-policy Learning
On-policy
Policy Gradient methods exemplify this, single policy usage where actions are sampled from a policy (π), and the observed rewards are then used to optimize that same policy
are **Value-Learning ** methods
On-policy or Off-policy Learning?
it depends.
some value-learning methods, observed rewards are used to fit a Q-value function, which subsequently derives the exact same policy used for data collection
How does On-policy methods
do with sample effiecency?
A critical implication of this single-policy approach is that any change in the policy during optimization typically necessitates the collection of new samples, as old samples become irrelevant for the updated policy
This often leads to poor sample efficiency, as data collected for one gradient update may be inefficient for subsequent updates
How does on-policy methods do with
Converage? Easy or hard
on-policy methods are sometimes claimed to converge faster compared to off-policy approaches
How to improve
on-policy methods
Exploration performance?
To enhance exploration, on-policy methods may introduce “hacks” like ε-greedy policies, which are inherently sub-optimal for the ultimate target policy
The distinction between Model-free vs. Model-based Reinforcement Learning is?
The distinction between model-free and model-based Reinforcement Learning centers on whether the algorithm explicitly learns or utilizes a model of the environment’s dynamics.
Describe Model-free RL
Model-free RL approaches focus on directly learning a policy or value functions from observed samples, without constructing or relying on an explicit model of the environment’s dynamics.1
The choice between model-free and model-based approaches can often be framed by a
a “policy-centric vs. model-centric” perspective: which is conceptually simpler to model for a given task?1
Is it easier to define what the agent should do (policy) or how the environment behaves (model)?
.
a “policy-centric vs. model-centric” perspective: which is conceptually simpler to model for a given task?1
(example)
For instance, in the cart-pole problem, balancing the pole might be more intuitively modeled by a policy (e.g., “move left if the pole falls left”) without needing to understand the underlying physics. In contrast, for a game like Go, where the rules (the model) are clearly defined, a model-based search for promising moves might be more straightforward than directly learning a policy for a beginner
Value-Learning vs. Policy Gradient Methods
Value-learning and Policy Gradient methods represent two fundamental approaches to optimizing an agent’s behavior in Reinforcement Learning, differing primarily in what they directly optimize.
Value-learning methods, such as Q-learning and DQN, aim
Value-learning methods, such as Q-learning and DQN, aim to estimate the optimal value function (e.g., the Q-value function), which then implicitly defines the optimal policy. The policy is derived by selecting the action that maximizes the estimated Q-value for a given state
Policy Gradient (PG) methods ( by contrast to Value learning)
Policy Gradient (PG) methods, by contrast, directly optimize the policy function itself. They achieve this by performing gradient ascent on an objective function that quantifies the expected return, effectively making high-reward actions more probable