Core Tenants Impact on RL Algorithm Selection Flashcards

(110 cards)

1
Q

Number of environment interactions needed for good policy. Critical for costly real-world data

A

Sample Efficiency

Trade-off with wall clock time: cheap samples favor model-free; expensive samples favor model-based.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

In RL what is Stability & Convergence related to?

A

Ease and speed of convergence; sensitivity to seeds/hyperparameters.

Improvements often add hyperparameters, hurting generalization. Local optima common.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Performance on unseen tasks/situations.

A

Generalization

Trade-off with model fidelity: explicit models are efficient but brittle; direct policies are adaptable but less efficient.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Underlying constraints (e.g., continuity, observability, action space type).

A

Assumptions & Approximations

Violating assumptions leads to failure. Problem structure dictates suitable algorithms

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Ease of computation and ability to scale horizontally.

A

Computational Simplicity & Parallelism

High parallelism can reduce wall clock time even with low sample efficiency.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Effectiveness in discovering states/actions.

A

Exploration

Violating assumptions leads to failure. Problem structure dictates suitable algorithms.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Error from assumptions vs. sensitivity to data fluctuations.

A

Bias vs. Variance

Reducing variance often hurts sample efficiency; conflicting objectives.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Bias vs. Variance
Identify the matching model with these descriptions -

a. High variance, zero bias example
b. Low variance, biased example
c. Lower variance, but hyperparameter sensitive.
d. Model example that is vulnerable to variance

(1. one step Temporal Difference,
2. Monte-Carlo,
3. Policy Gradient,
4. off-policy Temporal Difference(TD) )

A

Bias vs. Variance

  1. one step Temporal Difference(TD)
    b. Low variance, biased example
  2. Monte-Carlo
    a. High variance, zero bias example
  3. Policy Gradient
    d. Model example that is vulnerable to variance
  4. Off-policy Temporal Difference(TD)
    c. Lower variance, but hyperparameter sensitive.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Computational Simplicity & Parallelism
rate the models computation cost & Parallelism
Model-based
Computational cost - (Low, Medium, High)

Value-based/Model-based
Parallelizability - (Easy or Hard)

Evolutionary
Computational cost - (Low, Medium, High)
Parallelizability——— (Low or high)

Model-free:
Computational cost?

A

Computational Simplicity & Parallelism

Model-based
Computational cost - High( iterative methods)

Value-based/Model-based
Parallelizability - Hard

Evolutionary
Computational cost - Low
Parallelizability - high

Model-free:
Computational cost - Appealing when samples are cheap.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How does Model-based methods do with Sample efficiency?

A

Model-based methods generally stand out as the most sample-efficient because they leverage an internal understanding of the system dynamics, which significantly reduces the volume of samples needed for learning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

On-policy vs. Off-policy Learning

A

The fundamental distinction between on-policy and off-policy learning lies in the relationship between the policy used to collect data and the policy being optimized.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

On-policy methods

operate with a ___ that serves both as the behavior policy for sampling data and the target policy for optimization

A

1.single policy

operate with a single policy that serves both as the behavior policy for sampling data and the target policy for optimization

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Policy Gradient:
On-policy or Off-policy Learning

A

On-policy

Policy Gradient methods exemplify this, single policy usage where actions are sampled from a policy (π), and the observed rewards are then used to optimize that same policy

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

are **Value-Learning ** methods
On-policy or Off-policy Learning?

A

it depends.

some value-learning methods, observed rewards are used to fit a Q-value function, which subsequently derives the exact same policy used for data collection

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How does On-policy methods
do with sample effiecency?

A

A critical implication of this single-policy approach is that any change in the policy during optimization typically necessitates the collection of new samples, as old samples become irrelevant for the updated policy

This often leads to poor sample efficiency, as data collected for one gradient update may be inefficient for subsequent updates

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How does on-policy methods do with
Converage? Easy or hard

A

on-policy methods are sometimes claimed to converge faster compared to off-policy approaches

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

How to improve
on-policy methods
Exploration performance?

A

To enhance exploration, on-policy methods may introduce “hacks” like ε-greedy policies, which are inherently sub-optimal for the ultimate target policy

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

The distinction between Model-free vs. Model-based Reinforcement Learning is?

A

The distinction between model-free and model-based Reinforcement Learning centers on whether the algorithm explicitly learns or utilizes a model of the environment’s dynamics.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Describe Model-free RL

A

Model-free RL approaches focus on directly learning a policy or value functions from observed samples, without constructing or relying on an explicit model of the environment’s dynamics.1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

The choice between model-free and model-based approaches can often be framed by a

A

a “policy-centric vs. model-centric” perspective: which is conceptually simpler to model for a given task?1

Is it easier to define what the agent should do (policy) or how the environment behaves (model)?

.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

a “policy-centric vs. model-centric” perspective: which is conceptually simpler to model for a given task?1
(example)

A

For instance, in the cart-pole problem, balancing the pole might be more intuitively modeled by a policy (e.g., “move left if the pole falls left”) without needing to understand the underlying physics. In contrast, for a game like Go, where the rules (the model) are clearly defined, a model-based search for promising moves might be more straightforward than directly learning a policy for a beginner

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Value-Learning vs. Policy Gradient Methods

A

Value-learning and Policy Gradient methods represent two fundamental approaches to optimizing an agent’s behavior in Reinforcement Learning, differing primarily in what they directly optimize.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Value-learning methods, such as Q-learning and DQN, aim

A

Value-learning methods, such as Q-learning and DQN, aim to estimate the optimal value function (e.g., the Q-value function), which then implicitly defines the optimal policy. The policy is derived by selecting the action that maximizes the estimated Q-value for a given state

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Policy Gradient (PG) methods ( by contrast to Value learning)

A

Policy Gradient (PG) methods, by contrast, directly optimize the policy function itself. They achieve this by performing gradient ascent on an objective function that quantifies the expected return, effectively making high-reward actions more probable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
The core tension between Value-Learning and Policy Gradient methods
lies in their approach to optimization, representing an efficiency versus directness dilemma. Value-Learning prioritizes sample efficiency and stability (via off-policy and experience replay) at the cost of direct policy control and interpretability, particularly in continuous action spaces
26
On-policy ---------------- Core Characteristic/Mechanism
Single policy for sampling and optimization.
27
On-policy ------------------- Strengths
Simpler, less variance, potentially faster convergence (per step).
28
On-policy ------------------- Weaknesses
Poor sample efficiency (needs new samples for policy changes), exploration often relies on "hacks."
29
On-policy ------------------- Representative Algorithms
SARSA, Vanilla Policy Gradient, A3C, TRPO, PPO
30
Understand why Actor-Critic methods were developed as a hybrid approach, combining the strengths of 1 and 2 reinforcement learning
1. policy-based(Actor) 2. value-based(Critic)
31
Off-policy ---------------- Core Characteristic/Mechanism
Separate behavior and target policies.
32
Off-policy ------------------- Strengths
Improved sample efficiency (reuses samples via importance sampling/replay buffer), better exploration control, enhanced stability (e.g., DQN).
33
Off-policy ------------------- Weaknesses
Can be more complex, hyperparameter sensitive, may have slower convergence in some cases.
34
Off-policy ------------------- Representative Algorithms
Q-learning, DQN, DDPG
35
Model-free RL ---------------- Core Characteristic/Mechanism
Learns policy/value functions directly from samples, ignores dynamics.
36
Model-free RL ------------------- Strengths
Fewer assumptions, wider applicability, good at complex policies, better generalization in some tasks.
37
Model-free RL ------------------- Weaknesses
Less sample efficient, vulnerable to overfitting with complex models.
38
Model-free RL ------------------- Representative Algorithms
Q-learning, DQN, Policy Gradient methods (e.g., PPO, A3C)
39
In reinforcement learning, what is a policy?
In reinforcement learning, a policy defines an agent's strategy for interacting with an environment. It essentially maps the agent's current state to an action, guiding its behavior and determining how it should behave in various situations. The goal is to learn an optimal policy that maximizes the cumulative reward over time.
40
Model-based RL ---------------- Core Characteristic/Mechanism
Learns/uses environment model to derive optimal policy.
41
Model-based RL ------------------- Strengths
Highly sample efficient, scalable (self-trained).
42
Model-based RL ------------------- Weaknesses
More assumptions, less generalized solutions, model can be complex to train, doesn't optimize policy directly.
43
Model-based RL ------------------- Representative Algorithms
PILCO, Guided Policy Search (GPS)
44
Value-learning ---------------- Core Characteristic/Mechanism
Estimates optimal value function (e.g., Q-value) to derive implicit policy.
45
Value-learning ------------------- Strengths
More sample efficient (especially with off-policy/replay), can improve stability.
46
Value-learning ------------------- Weaknesses
Difficult for continuous control, less interpretable, no convergence guarantee with deep nets, hyperparameter sensitive, prone to instabilities.
47
Value-learning ------------------- Representative Algorithms
Q-learning, DQN
48
Policy Gradient ---------------- Core Characteristic/Mechanism
Directly optimizes policy function via gradient ascent.
49
Policy Gradient ---------------- Core Characteristic/Mechanism
Easy to set up, interpretable policy, supports discrete and continuous control, converges for convex functions.
50
Policy Gradient ---------------- Core Characteristic/Mechanism
High variance, poor sample efficiency, challenging learning rate tuning, can fall into local optima.
51
Value-learning ---------------- Core Characteristic/Mechanism
REINFORCE, TRPO, PPO, A2C/A3C (Actor-Critic)
52
which methods exhibit high variance but possess zero bias?
Monte-Carlo (MC) methods This characteristic arises because a stochastic policy generates different trajectories across different runs, and even minute changes in a trajectory can lead to significantly different cumulative rewards, resulting in high variance over multiple runs.
53
which method employs a single-step lookup in computing the value function, which results in low variance because only one action is involved and the change is minimal?
1-step Temporal Difference (TD) learning, the outcome is biased, particularly during the initial phases of training
54
while Monte-carlo methods exhibit high variance with zero bias and 1-step Temporal Difference learning has low varance. which other method is highly susceptible to variance, which can severely impede model convergence?
Policy Gradient (PG) methods are highly susceptible to variance, which can severely impede model convergence
55
what Strategies are use to mitigate variance in PG?
utilizing a more effective advantage function or imposing restrictions on gradient changes through trust regions. Increasing the number of samples—for instance, by running MC simulations many times or employing a large batch size—can also reduce variance, but this approach simultaneously negatively impacts sample efficiency
56
which methods adapt TD concepts, thereby achieving lower variance?
Off-policy actor-critic, Q-learning, and other value-fitting methods Nevertheless, these methods frequently demand extensive hyperparameter searches to function correctly, and this hypersensitivity to hyperparameter tuning can detrimentally affect both stability and generalization
57
The explicit statement that increasing sampling to reduce variance hurts what?
sample efficiency revealing a crucial interconnectedness between fundamental trade-offs
58
It is not merely about balancing bias and variance in isolation, but also understanding how this balance impacts the cost of learning, which is frequently measured by
sample efficiency.
59
Researchers and practitioners must consciously decide which form of "cost"—bias, variance, or data—they are most willing to incur. Is related to what fundamental principle in ML development?
the "no free lunch" principle applying not just to algorithm selection but also to the internal design choices within an algorithm.
60
How is Sample Efficiency and Wall Clock Time related?
The relationship between sample efficiency and wall clock time is a critical trade-off that influences practical RL deployment. Sample Efficiency vs Wall Clock Time
61
If an algorithm operates with fewer samples, it inherently requires
more effective optimization per sample. Sample efficiency, while often a primary concern, is not necessarily the ultimate goal
62
Which methods are the most sample-efficient; however, they involve the most complex optimizations calculations per sample?
Model-based methods
63
Which methods may achieve shorter wall clock times when data sampling is inexpensive or can be performed efficiently through simulation?
Model-free methods
64
For tasks such as robotic control, where collecting physical samples is prohibitively expensive, (model-based or model-free) methods are often favored.
model-based methods are often favored. In these scenarios, the time spent on trajectory planning, which is part of the model-based approach, constitutes a relatively small fraction of the overall wall clock time
65
which methods, despite their low sample efficiency, can be practical solutions due to their inherently low computational complexity and high potential for parallelism, which significantly reduces overall wall clock time?
Evolutionary methods,
66
The viability of _______________ increases with the amount of samples available, which in turn influences which methods prove more efficient in terms of wall clock time.
data parallelism
67
The trade-off between sample efficiency and wall clock time introduces an economic dimension to algorithm selection. The "cost" of _____________________ —whether from real-world interactions or simulations—and the "cost" of _______________—whether sequential or parallel become primary drivers.
first blank -- "cost" of data acquisition second blank -- "cost" of computation
68
The observation that _______________ is not always the ultimate goal and that better optimization per sample is required when fewer samples are used, alongside the specific examples of model-based methods being computationally complex despite high sample efficiency, and model-free methods being faster with cheap samples, illustrates this economic dimension. The _______________ is the ultimate practical metric, representing a product of sample efficiency, computational efficiency per sample, and parallelizability.
first blank -- sample efficiency second blank -- "wall clock time"
69
what is the ultimate practical metric, representing a product of sample efficiency, computational efficiency per sample, and parallelizability?
"wall clock time"
70
If samples are expensive, minimizing ___________ becomes paramount. If samples are cheap, minimizing __________________becomes the priority.
first blank -- their number second blank -- the computational time to process them, through simpler calculations or parallelism,
71
A company with access to vast simulation infrastructure might prefer a ______________ algorithm, while a robotics lab might prioritize ____________ due to the prohibitive cost of physical interactions. This underscores that RL algorithm selection is a strategic decision involving resource allocation
first blank -- model-free, parallelizable second blank -- a model-based algorithm
72
Practical Guidelines for Algorithm Selection Synthesizing the detailed analysis of decision factors and trade-offs, this section provides actionable advice for selecting an RL algorithm based on specific problem characteristics. Which RL algorithm is universally superior?
It is crucial to reiterate that no single RL algorithm is universally superior; continuous advancements and the integration of different methods are common practices in the field.
73
For real-world problems, especially those involving physical simulation, such as robotics, ___________ often emerges as the most critical factor due to the high cost and inherent slowness of data collection
sample efficiency
74
Learning in the Real World (________Samples): If real-world learning is required and samples are extremely limited, ___________ RL approaches, such as Guided Policy Search (GPS), are frequently preferred due to their superior sample efficiency
first blank -- Expansive second blank -- Model-based
75
Simulation Cost is a Factor ( __________ Cost): When simulation cost is a consideration but not prohibitive, _________ methods, including Q-learning and DDPG, are strong candidates. These methods often leverage experience replay to significantly improve sample efficiency
first blank -- Moderate Sample second blank -- Value-based
76
Simulation Cost is a Factor (Moderate Sample Cost): When simulation cost is a consideration but not prohibitive, Value-based methods, including __________ and __________, are strong candidates. These methods often leverage experience replay to significantly improve __________________.
first blank -- Q-learning second blank -- DDPG third blank -- sample efficiency
77
Learning in the Real World (Expensive Samples): If real-world learning is required and samples are extremely limited, Model-based RL approaches, such as _______________, are frequently preferred due to their superior _______________ .
first blank -- Guided Policy Search (GPS) second blank -- sample efficiency
78
Low Simulation Cost (________ Samples): If sampling data is inexpensive, for example, in fast computer simulations, Policy Gradient or Actor-Critic methods like TRPO, PPO, and A3C become highly appealing. _________ are generally easier to set up and tend to be less sensitive to hyperparameters
first blank - Cheap second blank - Policy Gradient methods
79
Low Simulation Cost (Cheap Samples): If sampling data is inexpensive, for example, in fast computer simulations, ________ or ____________ methods like TRPO, PPO, and A3C become highly appealing. Policy Gradient methods are generally easier to set up and tend to be less sensitive to hyperparameters
first blank -- Policy Gradient second blank -- Actor-Critic
80
Low Simulation Cost (Cheap Samples): If sampling data is inexpensive, for example, in fast computer simulations, Policy Gradient or Actor-Critic methods like ____, ___, and ___ become highly appealing. Policy Gradient methods are generally easier to set up and tend to be less sensitive to hyperparameters
first blank -- TRPO second blank -- PPO third blank -- A3C
81
RL Algorithm selection can be structured based on the ____________ or ______________ .
first blank -- cost of simulation second blank -- data acquisition
82
Beyond this primary selection being based on the cost of simulation or data acquisition, How can the chosen approach be supplemented with additional methods?
For instance, one might add policy learning to a model-based RL framework (like GPS) or integrate value-learning components into a Policy Gradient algorithm t is also important to consider that the choice of algorithms can be sensitive to the specific task, particularly due to varying curvature characteristics in their reward functions, which often necessitates some empirical experimentation
83
Sample efficiency ___________________ Definition/Core Concern
84
Sample efficiency ___________________ Impact on Algorithm Choice (General Trends) Model-based:? Off-policy: ? On-policy: ? Evolutionary: ? (Lowest, Low, High, Highest)
Model-based: Highest (leverages dynamics). Off-policy: High (reuses samples). On-policy: Low (new samples per update). Evolutionary: Lowest.
85
Sample efficiency ___________________ Key Considerations/Trade-offs
Trade-off with wall clock time: cheap samples favor model-free; expensive samples favor model-based.
86
Stability & Convergence ___________________ Definition/Core Concern
Ease and speed of convergence; sensitivity to seeds/hyperparameters.
87
Stability & Convergence ___________________ Impact on Algorithm Choice (General Trends) Value-fitting: ? Model-based:? Gradient:? (No guarantee, Generally converges, High variance)
Value-fitting: Problematic with deep nets (no guarantee, moving targets). Model-based: Generally converges (fits dynamics, but less generalized). Policy Gradient: High variance (can destroy training).
88
Stability & Convergence ___________________ Key Considerations/Trade-offs
Improvements often add hyperparameters, hurting generalization. Local optima common.
89
Generalization ___________________ Definition/Core Concern
Performance on unseen tasks/situations.
90
Generalization ___________________ Impact on Algorithm Choice (General Trends) Model-based:? Model-free:? (Less generalizable, can generalize better)
Model-based: Less generalized (prone to errors in unfamiliar situations). Model-free: Can generalize better (learns complex policies directly).
91
Generalization ___________________ Key Considerations/Trade-offs
Trade-off with model fidelity: explicit models are efficient but brittle; direct policies are adaptable but less efficient.
92
Assumptions & Approximations ___________________ Definition/Core Concern
Underlying constraints (e.g., continuity, observability, action space type).
93
Assumptions & Approximations ___________________ Impact on Algorithm Choice (General Trends) DQN: ? Q-learning: ? Policy Gradient: ? Value-fitting/Model-based: ? ( How does it handle? Discrete vs Continuous)
DQN: Discrete, low-dim control. Q-learning: Difficult for continuous. Policy Gradient: Supports both discrete/continuous. Value-fitting/Model-based: Often assume continuity.
94
Assumptions & Approximations ___________________ Key Considerations/Trade-offs
Violating assumptions leads to failure. Problem structure dictates suitable algorithms.
95
Exploration ___________________ Definition/Core Concern
Effectiveness in discovering states/actions.
96
Exploration ___________________ Impact on Algorithm Choice (General Trends) Off-policy:? On-policy:? (More explicit control, Limited)
Off-policy: More explicit control (e.g., ε-greedy, replay buffer), broader understanding. On-policy: Limited, often relies on "hacks."
97
Exploration ___________________ Key Considerations/Trade-offs
Violating assumptions leads to failure. Problem structure dictates suitable algorithms.
98
Computational Simplicity & Parallelism ___________________ Definition/Core Concern
Ease of computation and ability to scale horizontally.
99
Computational Simplicity & Parallelism ___________________ Impact on Algorithm Choice (General Trends) Model-based: ? Value-based/Model-based: ? Evolutionary: ? Model-free: ? (Computationally - cheap to expensive, Parallelism - easy to difficult )
Model-based: Computationally expensive (iterative methods). Value-based/Model-based: Hard to parallelize. Evolutionary: Low computation, highly parallel. Model-free: Appealing when samples are cheap.
100
Computational Simplicity & Parallelism ___________________ Key Considerations/Trade-offs
High parallelism can reduce wall clock time even with low sample efficiency.
101
Bias vs. Variance ___________________ Definition/Core Concern
Error from assumptions vs. sensitivity to data fluctuations.
102
# Bias vs. Variance Bias vs. Variance ___________________ Impact on Algorithm Choice (General Trends) (How does variance affect it?) Monte-Carlo:? 1-step TD:? Policy Gradient:? Off-policy TD:?
Monte-Carlo: High variance, zero bias. 1-step TD: Low variance, biased. Policy Gradient: Vulnerable to variance. Off-policy TD: Lower variance, but hyperparameter sensitive
103
Bias vs. Variance ___________________ Key Considerations/Trade-offs
Reducing variance often hurts sample efficiency; conflicting objectives.
104
What is the importance of logging in RL?
--Debugging and Troubleshooting: RL agents often exhibit unpredictable behavior during training. --Detailed logs allow researchers to pinpoint when and why performance degrades, identifying issues such as exploding gradients, policy collapse, or environmental interaction errors. --Reproducibility: To ensure that experiments can be replicated, all relevant parameters, random seeds, and environmental interactions must be recorded. This is crucial for validating research findings and building upon previous work. --Performance Analysis: Tracking key metrics over time provides insights into the learning progress, convergence, and overall effectiveness of the agent. This includes monitoring rewards, losses, and other custom metrics. --Hyperparameter Tuning: RL algorithms are notoriously sensitive to hyperparameters. Logging the performance across different hyperparameter configurations is essential for systematic tuning and identifying optimal settings. --Comparison and Benchmarking: Consistent logging practices enable fair comparisons between different algorithms or variations of the same algorithm.
105
What are the four things that should be monitored during an RL experiment?
Training Metrics, Environment Interactions, Agent State, System Information,
106
What are the five parts to the comprehensive logging of Training Metrics?
--Episode Rewards: Total reward accumulated per episode. This is often the primary indicator of agent performance. --Episode Lengths: Number of steps per episode. --Losses: Policy loss, value loss, and any other relevant loss functions from the neural networks. --Learning Rates: Current learning rates for optimizers, especially if using schedules. --Gradient Norms: To detect exploding or vanishing gradients.
107
What are the two parts to the comprehensive logging of Environment Interactions?
--Number of Steps/Frames: Total interactions with the environment. --Replay Buffer Size: For off-policy methods.
108
What are the three parts to the comprehensive logging of Agent State?
--Hyperparameters: All hyperparameters used for the run (e.g., discount factor, GAE lambda, batch size, network architecture details). --Random Seeds: For reproducibility. --Model Checkpoints: Periodically save the agent's policy and value function weights.
109
What are the three parts to the comprehensive logging of System Information?
--CPU/GPU Usage: Resource consumption during training. --Memory Usage: To identify potential leaks or inefficiencies. --Wall Clock Time: Total training duration.
110
Name at least two common logging Tools/approaches for RL experiments
--Python's logging Module: For basic text-based logging to console and files. Useful for detailed step-by-step information. --TensorBoard: A powerful visualization tool from TensorFlow, widely adopted across various deep learning frameworks (including PyTorch). It allows for plotting scalars (rewards, losses), visualizing network graphs, embedding projections, and more. --Weights & Biases (W&B): A popular platform for experiment tracking, visualization, and collaboration. It offers more advanced features like hyperparameter sweeps, artifact management, and interactive dashboards. --MLflow: An open-source platform for managing the ML lifecycle, including experiment tracking, reproducible runs, and model deployment. --Custom File Logging: Simple CSV or JSON files can be used to store tabular data for later analysis, especially for smaller experiments.