ML Fundamental Theory Flashcards
(8 cards)
Why does the L1 regularization push coefficients to zero but not L2?
TLDR:
The L1 norm pushes coefficients to zero because of its constant gradient and the geometry of its constraint region.
The L2 norm shrinks coefficients but doesn’t push them to zero because its gradient weakens as coefficients get smaller.
L1 Norm (Lasso):
Penalty: The L1 norm adds a penalty proportional to the absolute values of the coefficients to the loss function.
Effect: Because the L1 penalty grows linearly, it encourages sparsity by setting some coefficients exactly to zero. This happens because the L1 norm’s gradient is constant (it doesn’t depend on the size of the coefficient), so it can push smaller coefficients to zero completely, effectively “shrinking” them out of the model.
Why Zero?: When a feature’s contribution to the loss doesn’t outweigh its L1 penalty, the optimization drives that coefficient to zero, leading to sparse models. This makes Lasso useful for feature selection, as it eliminates irrelevant features.
L1 Norm (Lasso):
Penalty: The L1 norm adds a penalty proportional to the absolute values of the coefficients to the loss function.
Effect: Because the L1 penalty grows linearly, it encourages sparsity by setting some coefficients exactly to zero. This happens because the L1 norm’s gradient is constant (it doesn’t depend on the size of the coefficient), so it can push smaller coefficients to zero completely, effectively “shrinking” them out of the model.
Why Zero?: When a feature’s contribution to the loss doesn’t outweigh its L1 penalty, the optimization drives that coefficient to zero, leading to sparse models. This makes Lasso useful for feature selection, as it eliminates irrelevant features.
What is the learning rate in gradient descent?
The learning rate controls the size of the step our algorithm takes on each iteration.
What is Stochastic Gradient Descent? What are it’s pros and cons vs regular gradient descent?
Stochastic Gradient Descent (SGD) is a popular optimization algorithm used in various machine learning models, especially in training neural networks.
It’s a variant of the classic gradient descent algorithm but differs in how it updates the model parameters.
Key Idea:
In traditional gradient descent, you calculate the gradient (the direction of steepest ascent) of the cost function using the entire training dataset. However, with large datasets, this can be computationally expensive and slow.
SGD addresses this by randomly selecting a single training example (or a small mini-batch of examples) in each iteration and computing the gradient based only on this subset. This makes the updates much faster but introduces some randomness into the process.
1. Optimization: Stochastic Gradient Descent - Deep Learning
2. Efficient Opti: Mastering Stochastic Gradient Descent - StatusNeo
Advantages:
Faster Convergence: SGD often converges faster than batch gradient descent, especially for large datasets, because it updates the parameters more frequently.
Escaping Local Minima: The randomness introduced by selecting random examples in each iteration can help the algorithm escape local minima and find a better global solution.
1. Efficient Opti: Mastering Stochastic Gradient Descent - StatusNeo
Online Learning: SGD can be used in online learning scenarios where data arrives in a continuous stream, as it can update the model incrementally with each new example.
Less memory intensive.
Disadvantages:
Noisier Updates: Due to the random sampling, the updates in SGD are more noisy than in batch gradient descent, which can lead to fluctuations in the cost function during training.
Requires Tuning: The learning rate and batch size need to be carefully tuned to balance convergence speed and stability.
Convergence: While SGD often converges faster initially, it can take longer to reach the final optimal solution compared to batch gradient descent.
What is mini batch gradient descent? What are it’s benefits?
Mini-batch gradient descent (MBGD) is a widely used optimization algorithm in machine learning that strikes a balance between the speed of stochastic gradient descent (SGD) and the stability of batch gradient descent (BGD). Let’s explore its advantages and disadvantages:
Pros of Mini-batch Gradient Descent:
Faster Convergence: MBGD often converges faster than BGD, especially with large datasets. By calculating the gradient on smaller batches, it updates the model parameters more frequently, leading to quicker learning.
Reduced Noise: Compared to SGD, where updates are based on a single example, MBGD uses a batch of examples, resulting in a more stable estimate of the gradient and less noisy updates. This leads to smoother convergence and less oscillation around the minimum.
Efficient Computation: MBGD takes advantage of vectorization and parallelization capabilities of modern hardware, making the computation of gradients on mini-batches more efficient than calculating gradients on the entire dataset.
Improved Generalization: The noise introduced by using mini-batches can act as a form of regularization, helping the model generalize better to unseen data compared to BGD, which can be prone to overfitting.
Cons of Mini-batch Gradient Descent:
Tuning Hyperparameters: The performance of MBGD depends on the choice of the batch size. Choosing an optimal batch size can require experimentation and tuning.
Potential for Local Minima: While MBGD can escape shallow local minima due to the noise introduced by mini-batches, it can still get stuck in deeper local minima, especially if the batch size is too large or the learning rate is not appropriately adjusted.
Implementation Complexity: Implementing MBGD can be slightly more complex than SGD or BGD, as it requires dividing the dataset into batches and managing the iterations over them.
Choosing the Right Batch Size:
Small Batch Size (close to SGD): Faster updates but more noisy gradients, leading to potentially slower overall convergence and more oscillations.
Large Batch Size (close to BGD): Slower updates but smoother gradients, leading to potentially faster overall convergence but a higher risk of getting stuck in local minima.
A common strategy is to start with a moderate batch size (e.g., 32 or 64) (Central Limit Theorm) and experiment to find the optimal value that balances speed and stability for your specific problem and dataset.
What is L2 and L1 regularization? When to use one vs the other?
L1 (Lasso):
Feature Selection: Preferred when you have a large number of features and believe that only a subset of them are truly relevant. Lasso can automatically identify and eliminate irrelevant features by setting their coefficients to zero.
Interpretability: The resulting model is often easier to interpret due to its sparsity (fewer non-zero coefficients).
Effect: Encourages sparsity by setting some coefficients exactly to zero. This makes L1 regularization useful for feature selection, as it can eliminate irrelevant features.
L2 (Ridge):
Multicollinearity: More suitable when you have correlated features. Ridge regression can help stabilize the model by shrinking the coefficients of correlated features towards each other.
Prediction Accuracy: If your goal is purely predictive accuracy and you don’t need feature selection, Ridge might perform slightly better than Lasso, especially when all features are somewhat relevant.
Effect: Encourages smaller coefficients but does not set any coefficients exactly to zero. It distributes the penalty uniformly among the coefficients, leading to shrinkage but maintaining all features in the model.
Hybrid Approach: Elastic Net
Elastic Net is a combination of L1 and L2 regularization. It allows you to control the balance between feature selection (L1) and handling multicollinearity (L2). This can be a good option when you have a mix of these situations in your data.
Important Considerations:
Regularization Strength: The strength of regularization is controlled by a hyperparameter (often called lambda or alpha). You need to tune this hyperparameter to find the optimal balance between model complexity and overfitting.
Data Scaling: It’s important to standardize or normalize your features before applying regularization, as the penalty terms are sensitive to the scale of the features.
What is gradient descent?
Gradient descent is an algorithm used to minimize a function by iteratively moving in the direction of steepest descent. Imagine you’re standing on a hill and want to find the lowest point. Gradient descent is like taking small steps downhill, following the direction of the steepest slope, until you reach the bottom.
In machine learning, the “hill” is a cost function that measures how well the model is performing.
The “steps” are adjustments made to the model’s parameters (weights and biases) in the direction that most rapidly reduces the cost. This process continues iteratively until the model converges to a minimum point, hopefully representing the optimal solution.
The size of the steps is determined by a learning rate. A larger learning rate means faster convergence, but it can also lead to overshooting the minimum. A smaller learning rate is more cautious but might take longer to find the optimal solution.
Gradient descent is a fundamental optimization technique in machine learning and is used in various algorithms, including linear regression, logistic regression, and neural networks.
Mathematical Basis
Cost Function: The cost function (or loss function) quantifies the error between the predicted values and the actual values. The objective of training a model is to find parameters that minimize this error.
Gradient: The gradient of the cost function with respect to the parameters indicates the direction of the steepest increase in the cost. By moving in the opposite direction of the gradient, we move towards the steepest decrease in the cost.
Convergence to Minimum
Local and Global Minima: Gradient descent is designed to converge to a local minimum of the cost function. In convex problems, this local minimum is also the global minimum.
Stopping Criteria: The algorithm typically includes stopping criteria, such as a maximum number of iterations or a threshold for the change in the cost function, to determine when to stop the iterative updates.
What is a likelihood function?
The likelihood function is the probability of observing the actual outcomes in your data, given the model’s parameters (coefficients). In logistic regression, we assume that the outcomes for each observation are independent. Therefore, the likelihood function is the product of the probabilities for all observations:
L(β) = ∏ [P(Yᵢ=1 | Xᵢ)]^Yᵢ * [1 - P(Yᵢ=1 | Xᵢ)]^(1-Yᵢ)
Instead of working with the likelihood function directly, we often use the log-likelihood function. This is because the log function transforms the product into a sum, making it easier to compute and optimize:
log L(β) = ∑ [Yᵢ * log(P(Yᵢ=1 | Xᵢ)) + (1-Yᵢ) * log(1 - P(Yᵢ=1 | Xᵢ))]
What is a convex function and why is it important?