Machine Learning Flashcards

1
Q

What is overfitting in machine learning, and how can it be prevented?

A

Overfitting occurs when a model learns the training data too well, including noise and random fluctuations, resulting in poor generalization to unseen data. It can be prevented through techniques such as cross-validation, regularization (e.g., L1 or L2 regularization), early stopping, and using simpler models.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Explain the difference between supervised and unsupervised learning.

A

In supervised learning, the model learns from labeled data, where each example is associated with a target label. The goal is to learn a mapping from input features to target labels. In unsupervised learning, the model learns from unlabeled data, aiming to discover hidden patterns or structures within the data without explicit guidance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What evaluation metrics would you use for a binary classification problem?

A

Common evaluation metrics for binary classification include accuracy, precision, recall (sensitivity), F1-score, specificity, area under the ROC curve (AUC-ROC), and area under the precision-recall curve (AUC-PR).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are the advantages and disadvantages of decision trees?

A

Decision trees are interpretable and easy to understand, making them suitable for explaining decision-making processes. They can handle both numerical and categorical data and require minimal data preprocessing. However, decision trees are prone to overfitting, especially with complex datasets, and may not generalize well to unseen data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Explain the bias-variance tradeoff.

A

The bias-variance tradeoff refers to the tradeoff between the error due to bias and the error due to variance in machine learning models. High bias models are overly simplistic and may underfit the data, while high variance models are overly complex and may overfit the data. Finding the right balance between bias and variance is crucial for achieving good generalization performance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is cross-validation, and why is it important?

A

Cross-validation is a technique used to assess the performance of machine learning models by splitting the dataset into multiple subsets (folds), training the model on some folds, and evaluating it on the remaining fold. It helps to estimate how well the model will generalize to unseen data and reduces the risk of overfitting.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Describe the difference between batch gradient descent, stochastic gradient descent, and mini-batch gradient descent.

A

Batch gradient descent computes the gradient of the cost function with respect to the parameters using the entire training dataset.
Stochastic gradient descent (SGD) computes the gradient using only one randomly chosen training example at a time, making it faster but more noisy than batch gradient descent.
Mini-batch gradient descent computes the gradient using a subset (mini-batch) of the training dataset, striking a balance between the efficiency of SGD and the stability of batch gradient descent.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Explain the difference between generative and discriminative models.

A

Generative models learn the joint probability distribution of the input features and the target labels, allowing them to generate new samples similar to the training data. Discriminative models, on the other hand, directly learn the decision boundary between different classes without modeling the underlying probability distribution.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are the key components of a support vector machine (SVM)?

A

The key components of an SVM include the kernel function, which computes the similarity between data points in a high-dimensional feature space, the margin, which represents the distance between the decision boundary and the nearest data points (support vectors), and the regularization parameter, which controls the tradeoff between maximizing the margin and minimizing classification errors.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is the curse of dimensionality, and how does it affect machine learning algorithms?

A

The curse of dimensionality refers to the phenomena where the volume of the feature space increases exponentially with the number of dimensions. This can lead to sparsity of data, making it difficult for machine learning algorithms to effectively learn from the data, especially with limited training examples. Dimensionality reduction techniques such as PCA or feature selection can help mitigate this issue.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Explain the concept of feature engineering.

A

Feature engineering involves creating new features or transforming existing features to improve the performance of machine learning models. This may include techniques such as scaling, normalization, encoding categorical variables, creating interaction terms, and extracting relevant information from raw data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is ensemble learning, and why is it useful?

A

Ensemble learning involves combining multiple base learners to improve the overall performance of the model. This can be achieved through techniques such as bagging, boosting, and stacking. Ensemble methods are useful because they reduce overfitting, increase model robustness, and often result in better generalization performance compared to individual base learners.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Describe the difference between K-means clustering and hierarchical clustering.

A

K-means clustering is a partitioning algorithm that divides the data into K clusters by iteratively assigning data points to the nearest cluster centroid and updating the centroids based on the mean of the data points in each cluster. Hierarchical clustering, on the other hand, creates a hierarchy of clusters by either iteratively merging or splitting clusters based on the similarity between data points until a desired number of clusters is achieved.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is regularization, and why is it important in machine learning?

A

Regularization is a technique used to prevent overfitting by adding a penalty term to the loss function, which discourages overly complex models. Common regularization techniques include L1 regularization (Lasso), which encourages sparsity in the model parameters, and L2 regularization (Ridge), which penalizes the squared magnitude of the parameters.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the difference between a hyperparameter and a parameter in machine learning models?

A

Hyperparameters are configuration settings that are external to the model and are typically set before the learning process begins (e.g., learning rate, regularization parameter). Parameters, on the other hand, are internal to the model and are learned from the training data (e.g., weights and biases in neural networks).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Explain the concept of cross-entropy loss and its role in classification tasks.

A

Cross-entropy loss, also known as log loss, measures the difference between the predicted probability distribution and the true probability distribution of the target labels. It is commonly used as the loss function for binary and multiclass classification tasks, where the goal is to minimize the cross-entropy between the predicted and true labels.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is the purpose of feature scaling, and what are some common techniques for feature scaling?

A

Feature scaling is the process of standardizing or normalizing the range of features in the dataset to ensure that they have similar scales. This helps prevent certain features from dominating the learning process and improves the convergence of optimization algorithms. Common techniques for feature scaling include min-max scaling, z-score normalization, and robust scaling.

18
Q

Explain the concept of bagging and how it is used in ensemble learning.

A

Bagging, short for bootstrap aggregating, is an ensemble learning technique that involves training multiple base learners on different bootstrap samples of the training data and combining their predictions through averaging or voting. Bagging helps reduce variance and improve the stability of the resulting model.

19
Q

What is the purpose of dropout in neural networks, and how does it work?

A

Dropout is a regularization technique used in neural networks to prevent overfitting by randomly dropping a proportion of neurons during training. This forces the network to learn more robust features and reduces the reliance on any single neuron. Dropout effectively simulates training multiple neural networks with shared parameters.

20
Q

Describe the difference between a parametric and a non-parametric machine learning algorithm.

A

Parametric machine learning algorithms make strong assumptions about the functional form of the underlying data distribution and have a fixed number of parameters that are learned from the training data. Non-parametric algorithms, on the other hand, make fewer assumptions about the data distribution and have a flexible number of parameters that grow with the size of the training data.

21
Q

What are some common techniques for handling missing data in machine learning?

A

Common techniques for handling missing data include imputation (e.g., replacing missing values with the mean, median, or mode), deletion (e.g., removing rows or columns with missing values), and using algorithms that can handle missing data directly (e.g., tree-based methods, K-nearest neighbors).

22
Q

Explain the concept of feature importance and how it can be computed in machine learning models.

A

Feature importance measures the contribution of each feature to the predictive performance of the model. It can be computed using techniques such as permutation importance, which evaluates the decrease in model performance when the values of a feature are randomly permuted, or using model-specific techniques such as feature importance scores in decision trees or coefficients in linear models.

23
Q

What is the trade-off between bias and variance, and how does it affect the performance of machine learning models?

A

The bias-variance trade-off refers to the trade-off between the error due to bias (underfitting) and the error due to variance (overfitting) in machine learning models. Models with high bias are too simplistic and may fail to capture the underlying patterns in the data, while models with high variance are too complex and may fit the noise in the data. Finding the right balance between bias and variance is essential for achieving good generalization performance.

24
Q

Explain the difference between a ROC curve and a precision-recall curve.

A

ROC (Receiver Operating Characteristic) curves plot the true positive rate (sensitivity) against the false positive rate (1-specificity) for different threshold values, showing the trade-off between sensitivity and specificity. Precision-recall curves, on the other hand, plot precision against recall (sensitivity) for different threshold values, focusing on the trade-off between precision and recall for imbalanced datasets.

25
Q

What is the difference between L1 and L2 regularization, and when would you use each?

A

L1 regularization (Lasso) adds a penalty term proportional to the absolute value of the model parameters, encouraging sparsity and feature selection. L2 regularization (Ridge) adds a penalty term proportional to the squared magnitude of the model parameters, preventing large weights and reducing overfitting. L1 regularization is often used when feature selection is desired, while L2 regularization is used for general regularization.

26
Q

Explain the concept of imbalanced datasets and how it can affect machine learning models.

A

Imbalanced datasets occur when one class or category is significantly more prevalent than others in the training data. This imbalance can lead to biased models that prioritize the majority class and perform poorly on the minority class. Techniques for handling imbalanced datasets include resampling (e.g., oversampling, undersampling), using different evaluation metrics (e.g., F1-score, AUC-ROC), and algorithmic approaches (e.g., cost-sensitive learning, ensemble methods).

27
Q

What is the difference between precision and recall, and when would you use each as an evaluation metric?

A

Precision measures the proportion of true positive predictions among all positive predictions made by the model, while recall measures the proportion of true positive predictions among all actual positive instances in the dataset. Precision is useful when minimizing false positive predictions is important (e.g., spam detection), while recall is useful when minimizing false negative predictions is important (e.g., cancer diagnosis).

28
Q

Describe the bias-variance decomposition of the expected prediction error.

A

The bias-variance decomposition of the expected prediction error decomposes the expected squared error of the model into three components: bias^2, variance, and irreducible error. Bias^2 represents the error due to the difference between the average prediction of the model and the true value, variance represents the variability of the model predictions across different training datasets, and irreducible error represents the noise inherent in the data that cannot be reduced by the model.

29
Q

What is the difference between feature selection and feature extraction?

A

Feature selection involves selecting a subset of the original features in the dataset that are most relevant to the target variable, while discarding irrelevant or redundant features. Feature extraction, on the other hand, involves creating new features from the existing features in the dataset through techniques such as dimensionality reduction (e.g., PCA) or transformation (e.g., polynomial features).

30
Q

Explain the concept of kernel trick in support vector machines (SVM).

A

The kernel trick is a technique used in support vector machines (SVM) to implicitly map the input features into a higher-dimensional feature space without explicitly computing the transformation. This allows SVMs to efficiently model complex nonlinear relationships between the input features and the target variable by computing the dot product between data points in the transformed feature space.

31
Q

What are the assumptions of linear regression, and how can violations of these assumptions affect the model?

A

Linear regression assumes that the relationship between the independent variables and the target variable is linear, that the errors are normally distributed with constant variance (homoscedasticity), that the errors are independent, and that there is no multicollinearity among the independent variables. Violations of these assumptions can lead to biased parameter estimates, inflated standard errors, and unreliable predictions.

32
Q

Explain the concept of grid search and how it is used for hyperparameter tuning.

A

Grid search is a technique used for hyperparameter tuning, where a predefined set of hyperparameters is exhaustively searched over a grid of possible values, and the model is trained and evaluated using each combination of hyperparameters. The combination of hyperparameters that results in the best performance on a validation set is selected as the optimal set of hyperparameters for the model.

33
Q

What are some common techniques for model evaluation and selection in machine learning?

A

Common techniques for model evaluation and selection include cross-validation, where the dataset is split into multiple subsets (folds), and the model is trained and evaluated on each fold, hyperparameter tuning, where the optimal set of hyperparameters for the model is selected using techniques such as grid search or random search, and model comparison using evaluation metrics such as accuracy, precision, recall, F1-score, AUC-ROC, or AUC-PR.

34
Q

Explain the concept of feature importance in tree-based models such as decision trees or random forests.

A

Feature importance measures the contribution of each feature to the predictive performance of the model. In tree-based models such as decision trees or random forests, feature importance can be computed based on how often a feature is used to split the data or how much it reduces the impurity (e.g., Gini impurity or entropy) at each node of the tree. Features that are frequently used near the top of the tree or result in large impurity reductions are considered more important.

35
Q

Explain the concept of the EM (Expectation-Maximization) algorithm and its application in machine learning.

A

The EM (Expectation-Maximization) algorithm is an iterative optimization algorithm used to estimate the parameters of probabilistic models with latent variables. It consists of two main steps: the E-step, where the expected values of the latent variables are computed given the current model parameters, and the M-step, where the model parameters are updated to maximize the likelihood of the observed data given the expected values of the latent variables. The EM algorithm is commonly used in unsupervised learning tasks such as clustering (e.g., Gaussian mixture models) or latent variable models (e.g., factor analysis).

36
Q

What are some common techniques for handling multicollinearity in linear regression models?

A

Multicollinearity occurs when two or more independent variables in a linear regression model are highly correlated with each other, leading to instability in the estimation of the model parameters. Common techniques for handling multicollinearity include removing one of the correlated variables, combining the correlated variables into a single composite variable (e.g., principal component analysis), or using regularization techniques such as ridge regression or LASSO regression to penalize the magnitude of the model parameters.

37
Q

What is the purpose of early stopping in machine learning, and how does it work?

A

Early stopping is a regularization technique used to prevent overfitting by stopping the training process before the model starts to overfit the training data. It works by monitoring the performance of the model on a separate validation set during training and stopping the training process when the performance on the validation set starts to degrade (e.g., when the validation loss stops decreasing or starts to increase). Early stopping helps prevent the model from memorizing the training data and encourages it to generalize to unseen data.

38
Q

What is the difference between batch normalization and layer normalization in neural networks?

A

Batch normalization and layer normalization are normalization techniques used in neural networks to improve the convergence and stability of the training process. Batch normalization normalizes the activations of each layer by subtracting the mean and dividing by the standard deviation of the activations across the mini-batch, while layer normalization normalizes the activations of each layer by subtracting the mean and dividing by the standard deviation of the activations across the entire layer. Batch normalization is typically applied before the activation function, while layer normalization is typically applied after the activation function. Batch normalization is more commonly used in feedforward neural networks, while layer normalization is more commonly used in recurrent neural networks.

39
Q

Explain the concept of word embeddings and their role in natural language processing (NLP).

A

Word embeddings are dense vector representations of words in a high-dimensional vector space, where words with similar meanings are represented by vectors that are close together in the space. Word embeddings capture semantic relationships between words and can be learned from large text corpora using techniques such as word2vec, GloVe (Global Vectors for Word Representation), or fastText. Word embeddings are commonly used as input features for various NLP tasks such as sentiment analysis, named entity recognition, machine translation, and text classification.

40
Q

Explain the concept of gradient descent and how it is used to train machine learning models.

A

Gradient descent is an optimization algorithm used to minimize the loss function of a machine learning model by iteratively updating the model parameters in the direction of the steepest descent of the loss function with respect to the parameters. At each iteration, the gradient of the loss function with respect to the parameters is computed using backpropagation, and the parameters are updated by subtracting a fraction of the gradient (learning rate) from the current parameter values. Gradient descent continues until convergence criteria are met (e.g., a maximum number of iterations or a small change in the loss function). Gradient descent can be applied to various machine learning models, including linear regression, logistic regression, neural networks, and support vector machines.