Core Tenants Impact on RL Algorithm Selection Flashcards

Question

The core tension between Value-Learning and Policy Gradient methods

Answer 1

lies in their approach to optimization, representing an efficiency versus directness dilemma. Value-Learning prioritizes sample efficiency and stability (via off-policy and experience replay) at the cost of direct policy control and interpretability, particularly in continuous action spaces

Answer 2

Single policy for sampling and optimization.

Answer 3

Simpler, less variance, potentially faster convergence (per step).

Answer 4

Poor sample efficiency (needs new samples for policy changes), exploration often relies on "hacks."

Answer 5

SARSA, Vanilla Policy Gradient, A3C, TRPO, PPO

Answer 6

1. policy-based(Actor) 2. value-based(Critic)

Answer 7

Separate behavior and target policies.

Answer 8

Improved sample efficiency (reuses samples via importance sampling/replay buffer), better exploration control, enhanced stability (e.g., DQN).

Answer 9

Can be more complex, hyperparameter sensitive, may have slower convergence in some cases.

Answer 10

Q-learning, DQN, DDPG

Answer 11

Learns policy/value functions directly from samples, ignores dynamics.

Answer 12

Fewer assumptions, wider applicability, good at complex policies, better generalization in some tasks.

Answer 13

Less sample efficient, vulnerable to overfitting with complex models.

Answer 14

Q-learning, DQN, Policy Gradient methods (e.g., PPO, A3C)

Answer 15

In reinforcement learning, a policy defines an agent's strategy for interacting with an environment. It essentially maps the agent's current state to an action, guiding its behavior and determining how it should behave in various situations. The goal is to learn an optimal policy that maximizes the cumulative reward over time.

Answer 16

Learns/uses environment model to derive optimal policy.

Answer 17

Highly sample efficient, scalable (self-trained).

Answer 18

More assumptions, less generalized solutions, model can be complex to train, doesn't optimize policy directly.

Answer 19

PILCO, Guided Policy Search (GPS)

Answer 20

Estimates optimal value function (e.g., Q-value) to derive implicit policy.

Answer 21

More sample efficient (especially with off-policy/replay), can improve stability.

Answer 22

Difficult for continuous control, less interpretable, no convergence guarantee with deep nets, hyperparameter sensitive, prone to instabilities.

Answer 23

Q-learning, DQN

Answer 24

Directly optimizes policy function via gradient ascent.

Answer 25

Easy to set up, interpretable policy, supports discrete and continuous control, converges for convex functions.

Answer 26

High variance, poor sample efficiency, challenging learning rate tuning, can fall into local optima.

Answer 27

REINFORCE, TRPO, PPO, A2C/A3C (Actor-Critic)

Answer 28

Monte-Carlo (MC) methods This characteristic arises because a stochastic policy generates different trajectories across different runs, and even minute changes in a trajectory can lead to significantly different cumulative rewards, resulting in high variance over multiple runs.

Answer 29

1-step Temporal Difference (TD) learning, the outcome is biased, particularly during the initial phases of training

Answer 30

Policy Gradient (PG) methods are highly susceptible to variance, which can severely impede model convergence

Answer 31

utilizing a more effective advantage function or imposing restrictions on gradient changes through trust regions. Increasing the number of samples—for instance, by running MC simulations many times or employing a large batch size—can also reduce variance, but this approach simultaneously negatively impacts sample efficiency

Answer 32

Off-policy actor-critic, Q-learning, and other value-fitting methods Nevertheless, these methods frequently demand extensive hyperparameter searches to function correctly, and this hypersensitivity to hyperparameter tuning can detrimentally affect both stability and generalization

Answer 33

sample efficiency revealing a crucial interconnectedness between fundamental trade-offs

Answer 34

sample efficiency.

Answer 35

the "no free lunch" principle applying not just to algorithm selection but also to the internal design choices within an algorithm.

Answer 36

The relationship between sample efficiency and wall clock time is a critical trade-off that influences practical RL deployment. Sample Efficiency vs Wall Clock Time

Answer 37

more effective optimization per sample. Sample efficiency, while often a primary concern, is not necessarily the ultimate goal

Answer 38

Model-based methods

Answer 39

Model-free methods

Answer 40

model-based methods are often favored. In these scenarios, the time spent on trajectory planning, which is part of the model-based approach, constitutes a relatively small fraction of the overall wall clock time

Answer 41

Evolutionary methods,

Answer 42

data parallelism

Answer 43

first blank -- "cost" of data acquisition second blank -- "cost" of computation

Answer 44

first blank -- sample efficiency second blank -- "wall clock time"

Answer 45

"wall clock time"

Answer 46

first blank -- their number second blank -- the computational time to process them, through simpler calculations or parallelism,

Answer 47

first blank -- model-free, parallelizable second blank -- a model-based algorithm

Answer 48

It is crucial to reiterate that no single RL algorithm is universally superior; continuous advancements and the integration of different methods are common practices in the field.

Answer 49

sample efficiency

Answer 50

first blank -- Expansive second blank -- Model-based

Answer 51

first blank -- Moderate Sample second blank -- Value-based

Answer 52

first blank -- Q-learning second blank -- DDPG third blank -- sample efficiency

Answer 53

first blank -- Guided Policy Search (GPS) second blank -- sample efficiency

Answer 54

first blank - Cheap second blank - Policy Gradient methods

Answer 55

first blank -- Policy Gradient second blank -- Actor-Critic

Answer 56

first blank -- TRPO second blank -- PPO third blank -- A3C

Answer 57

first blank -- cost of simulation second blank -- data acquisition

Answer 58

For instance, one might add policy learning to a model-based RL framework (like GPS) or integrate value-learning components into a Policy Gradient algorithm t is also important to consider that the choice of algorithms can be sensitive to the specific task, particularly due to varying curvature characteristics in their reward functions, which often necessitates some empirical experimentation

Answer 59

Model-based: Highest (leverages dynamics). Off-policy: High (reuses samples). On-policy: Low (new samples per update). Evolutionary: Lowest.

Answer 60

Trade-off with wall clock time: cheap samples favor model-free; expensive samples favor model-based.

Answer 61

Ease and speed of convergence; sensitivity to seeds/hyperparameters.

Answer 62

Value-fitting: Problematic with deep nets (no guarantee, moving targets). Model-based: Generally converges (fits dynamics, but less generalized). Policy Gradient: High variance (can destroy training).

Answer 63

Improvements often add hyperparameters, hurting generalization. Local optima common.

Answer 64

Performance on unseen tasks/situations.

Answer 65

Model-based: Less generalized (prone to errors in unfamiliar situations). Model-free: Can generalize better (learns complex policies directly).

Answer 66

Trade-off with model fidelity: explicit models are efficient but brittle; direct policies are adaptable but less efficient.

Answer 67

Underlying constraints (e.g., continuity, observability, action space type).

Answer 68

DQN: Discrete, low-dim control. Q-learning: Difficult for continuous. Policy Gradient: Supports both discrete/continuous. Value-fitting/Model-based: Often assume continuity.

Answer 69

Violating assumptions leads to failure. Problem structure dictates suitable algorithms.

Answer 70

Effectiveness in discovering states/actions.

Answer 71

Off-policy: More explicit control (e.g., ε-greedy, replay buffer), broader understanding. On-policy: Limited, often relies on "hacks."

Answer 72

Violating assumptions leads to failure. Problem structure dictates suitable algorithms.

Answer 73

Ease of computation and ability to scale horizontally.

Answer 74

Model-based: Computationally expensive (iterative methods). Value-based/Model-based: Hard to parallelize. Evolutionary: Low computation, highly parallel. Model-free: Appealing when samples are cheap.

Answer 75

High parallelism can reduce wall clock time even with low sample efficiency.

Answer 76

Error from assumptions vs. sensitivity to data fluctuations.

Answer 77

Monte-Carlo: High variance, zero bias. 1-step TD: Low variance, biased. Policy Gradient: Vulnerable to variance. Off-policy TD: Lower variance, but hyperparameter sensitive

Answer 78

Reducing variance often hurts sample efficiency; conflicting objectives.

Answer 79

--Debugging and Troubleshooting: RL agents often exhibit unpredictable behavior during training. --Detailed logs allow researchers to pinpoint when and why performance degrades, identifying issues such as exploding gradients, policy collapse, or environmental interaction errors. --Reproducibility: To ensure that experiments can be replicated, all relevant parameters, random seeds, and environmental interactions must be recorded. This is crucial for validating research findings and building upon previous work. --Performance Analysis: Tracking key metrics over time provides insights into the learning progress, convergence, and overall effectiveness of the agent. This includes monitoring rewards, losses, and other custom metrics. --Hyperparameter Tuning: RL algorithms are notoriously sensitive to hyperparameters. Logging the performance across different hyperparameter configurations is essential for systematic tuning and identifying optimal settings. --Comparison and Benchmarking: Consistent logging practices enable fair comparisons between different algorithms or variations of the same algorithm.

Answer 80

Training Metrics, Environment Interactions, Agent State, System Information,

Answer 81

--Episode Rewards: Total reward accumulated per episode. This is often the primary indicator of agent performance. --Episode Lengths: Number of steps per episode. --Losses: Policy loss, value loss, and any other relevant loss functions from the neural networks. --Learning Rates: Current learning rates for optimizers, especially if using schedules. --Gradient Norms: To detect exploding or vanishing gradients.

Answer 82

--Number of Steps/Frames: Total interactions with the environment. --Replay Buffer Size: For off-policy methods.

Answer 83

--Hyperparameters: All hyperparameters used for the run (e.g., discount factor, GAE lambda, batch size, network architecture details). --Random Seeds: For reproducibility. --Model Checkpoints: Periodically save the agent's policy and value function weights.

Answer 84

--CPU/GPU Usage: Resource consumption during training. --Memory Usage: To identify potential leaks or inefficiencies. --Wall Clock Time: Total training duration.

Answer 85

--Python's logging Module: For basic text-based logging to console and files. Useful for detailed step-by-step information. --TensorBoard: A powerful visualization tool from TensorFlow, widely adopted across various deep learning frameworks (including PyTorch). It allows for plotting scalars (rewards, losses), visualizing network graphs, embedding projections, and more. --Weights & Biases (W&B): A popular platform for experiment tracking, visualization, and collaboration. It offers more advanced features like hyperparameter sweeps, artifact management, and interactive dashboards. --MLflow: An open-source platform for managing the ML lifecycle, including experiment tracking, reproducible runs, and model deployment. --Custom File Logging: Simple CSV or JSON files can be used to store tabular data for later analysis, especially for smaller experiments.

Core Tenants Impact on RL Algorithm Selection Flashcards

(110 cards)