Machine Learning Flashcards

Question

What is the difference between L1 and L2 regularization, and when would you use each?

Answer 1

L1 regularization (Lasso) adds a penalty term proportional to the absolute value of the model parameters, encouraging sparsity and feature selection. L2 regularization (Ridge) adds a penalty term proportional to the squared magnitude of the model parameters, preventing large weights and reducing overfitting. L1 regularization is often used when feature selection is desired, while L2 regularization is used for general regularization.

Answer 2

Imbalanced datasets occur when one class or category is significantly more prevalent than others in the training data. This imbalance can lead to biased models that prioritize the majority class and perform poorly on the minority class. Techniques for handling imbalanced datasets include resampling (e.g., oversampling, undersampling), using different evaluation metrics (e.g., F1-score, AUC-ROC), and algorithmic approaches (e.g., cost-sensitive learning, ensemble methods).

Answer 3

Precision measures the proportion of true positive predictions among all positive predictions made by the model, while recall measures the proportion of true positive predictions among all actual positive instances in the dataset. Precision is useful when minimizing false positive predictions is important (e.g., spam detection), while recall is useful when minimizing false negative predictions is important (e.g., cancer diagnosis).

Answer 4

The bias-variance decomposition of the expected prediction error decomposes the expected squared error of the model into three components: bias^2, variance, and irreducible error. Bias^2 represents the error due to the difference between the average prediction of the model and the true value, variance represents the variability of the model predictions across different training datasets, and irreducible error represents the noise inherent in the data that cannot be reduced by the model.

Answer 5

Feature selection involves selecting a subset of the original features in the dataset that are most relevant to the target variable, while discarding irrelevant or redundant features. Feature extraction, on the other hand, involves creating new features from the existing features in the dataset through techniques such as dimensionality reduction (e.g., PCA) or transformation (e.g., polynomial features).

Answer 6

The kernel trick is a technique used in support vector machines (SVM) to implicitly map the input features into a higher-dimensional feature space without explicitly computing the transformation. This allows SVMs to efficiently model complex nonlinear relationships between the input features and the target variable by computing the dot product between data points in the transformed feature space.

Answer 7

Linear regression assumes that the relationship between the independent variables and the target variable is linear, that the errors are normally distributed with constant variance (homoscedasticity), that the errors are independent, and that there is no multicollinearity among the independent variables. Violations of these assumptions can lead to biased parameter estimates, inflated standard errors, and unreliable predictions.

Answer 8

Grid search is a technique used for hyperparameter tuning, where a predefined set of hyperparameters is exhaustively searched over a grid of possible values, and the model is trained and evaluated using each combination of hyperparameters. The combination of hyperparameters that results in the best performance on a validation set is selected as the optimal set of hyperparameters for the model.

Answer 9

Common techniques for model evaluation and selection include cross-validation, where the dataset is split into multiple subsets (folds), and the model is trained and evaluated on each fold, hyperparameter tuning, where the optimal set of hyperparameters for the model is selected using techniques such as grid search or random search, and model comparison using evaluation metrics such as accuracy, precision, recall, F1-score, AUC-ROC, or AUC-PR.

Answer 10

Feature importance measures the contribution of each feature to the predictive performance of the model. In tree-based models such as decision trees or random forests, feature importance can be computed based on how often a feature is used to split the data or how much it reduces the impurity (e.g., Gini impurity or entropy) at each node of the tree. Features that are frequently used near the top of the tree or result in large impurity reductions are considered more important.

Answer 11

The EM (Expectation-Maximization) algorithm is an iterative optimization algorithm used to estimate the parameters of probabilistic models with latent variables. It consists of two main steps: the E-step, where the expected values of the latent variables are computed given the current model parameters, and the M-step, where the model parameters are updated to maximize the likelihood of the observed data given the expected values of the latent variables. The EM algorithm is commonly used in unsupervised learning tasks such as clustering (e.g., Gaussian mixture models) or latent variable models (e.g., factor analysis).

Answer 12

Multicollinearity occurs when two or more independent variables in a linear regression model are highly correlated with each other, leading to instability in the estimation of the model parameters. Common techniques for handling multicollinearity include removing one of the correlated variables, combining the correlated variables into a single composite variable (e.g., principal component analysis), or using regularization techniques such as ridge regression or LASSO regression to penalize the magnitude of the model parameters.

Answer 13

Early stopping is a regularization technique used to prevent overfitting by stopping the training process before the model starts to overfit the training data. It works by monitoring the performance of the model on a separate validation set during training and stopping the training process when the performance on the validation set starts to degrade (e.g., when the validation loss stops decreasing or starts to increase). Early stopping helps prevent the model from memorizing the training data and encourages it to generalize to unseen data.

Answer 14

Batch normalization and layer normalization are normalization techniques used in neural networks to improve the convergence and stability of the training process. Batch normalization normalizes the activations of each layer by subtracting the mean and dividing by the standard deviation of the activations across the mini-batch, while layer normalization normalizes the activations of each layer by subtracting the mean and dividing by the standard deviation of the activations across the entire layer. Batch normalization is typically applied before the activation function, while layer normalization is typically applied after the activation function. Batch normalization is more commonly used in feedforward neural networks, while layer normalization is more commonly used in recurrent neural networks.

Answer 15

Word embeddings are dense vector representations of words in a high-dimensional vector space, where words with similar meanings are represented by vectors that are close together in the space. Word embeddings capture semantic relationships between words and can be learned from large text corpora using techniques such as word2vec, GloVe (Global Vectors for Word Representation), or fastText. Word embeddings are commonly used as input features for various NLP tasks such as sentiment analysis, named entity recognition, machine translation, and text classification.

Answer 16

Gradient descent is an optimization algorithm used to minimize the loss function of a machine learning model by iteratively updating the model parameters in the direction of the steepest descent of the loss function with respect to the parameters. At each iteration, the gradient of the loss function with respect to the parameters is computed using backpropagation, and the parameters are updated by subtracting a fraction of the gradient (learning rate) from the current parameter values. Gradient descent continues until convergence criteria are met (e.g., a maximum number of iterations or a small change in the loss function). Gradient descent can be applied to various machine learning models, including linear regression, logistic regression, neural networks, and support vector machines.

Machine Learning Flashcards

(40 cards)