DS - Concepts Flashcards

1
Q

[STATS] Bias

A

wrong assumptions when training → can’t capture underlying patterns → underfit

Bias is error due to overly simplistic assumptions in the learning algorithm being used. This can lead the model underfitting your data, making it hard for it to have high predictive accuracy and for you to generalize your knowledge from the training set to the test set.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

[STATS] Variance

A

sensitive to fluctuations when training→ can’t generalize on unseen data → overfit

Variance is error due to too much complexity in the learning algorithm you’re using. This leads to algorithm being highly sensitive to high degree of variation in your training data, which can lead your model to overfit the data. You’ll be carrying too much noise from your training data for your model to be very useful to your test data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

[STATS] Bias Variance Tradeoff

A

The bias-variance tradeoff attempts to minimize these two sources of error, through methods such as:
– Cross validation to generalize to unseen data
– Dimension reduction and feature selection
In all cases, as variance decreases, bias increases.

The bias-variance decomposition essentially decomposes the learning error from any algorithm by adding the bias, the variance and a bit of irreducible error due to noise in the underlying dataset. Essentially, if you make a model more complex and add more variables, you’ll lose bias but gain more variance - in order to get the optimally reduced amount of error, you’ll have to tradeoff bias and variance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

[STATS] Precision

A

TP / TP + FP —> percent correct when predict positive

positive predictive value : a measure of the amount of accurate positives the model claims compared to the number of positives it actually claims

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

[STATS] Recall

A

Sensitivity

TP / TP + FN —-> percent of actual positives identified correctly (True Positive Rate)

Recall is the true positive rate : the amount of positives your model claims compared to the actual number of positives there are really are in the data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

[STATS] Specificity

A

TN / TN + FP —-> percent of actual negatives identified correctly

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

[STATS] F1 Score

A

2 * (( Precision * Recall) / Precision + Recall)

Useful when classes are imbalanced

The F1 Score is a measure of a model’s performance. It is a weighted average of the precision and recall of a model, with results tending to 1 being the best and those tending to 0 being the worst. You would use it in classification tests where true negatives don’t matter much.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

[STATS] ROC Curve

A

plots TPR vs. FPR for every threshold α. Area Under the Curve measures how likely the model differentiates positives and negatives (perfect AUC = 1, baseline = 0.5).

The ROC curve is a graphical representation of the contrast between true positive rates and false positive rates at various thresholds. It’s often used as a proxy for the trade-off between the sensitivity of the model (true positives) vs the fallout or probability it will trigger a false alarm (false positives).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

[STATS] Precision-Recall Curve

A

Focuses on the correct prediction of the minority class, useful when data is imbalanced

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

[STATS] P-Value

A

probability that an effect could have occurred by
chance. If less than the significance level α, or if the test statistic is greater than the critical value, then reject the null.

A p-value is a measure of the probability that an observed difference could have occurred just by random chance. The lower the p-value, the greater the statistical significance of the observed difference.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

[STATS] Type I Error

A

(False Positive α) - rejecting a true null

Type 1 error is a false positive meaning claiming something has happened when it hasn’t

Confidence Level (1 - α) - probability of finding an effect that did not occur by chance and avoiding a Type I error

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

[STATS] Type II Error

A

(False Negative β) - not rejecting a false null Decreasing Type I Error causes an increase in Type II Error

Type 2 error is a false negative meaning you claim nothing happening when in fact something is

Power (1 - β) - probability of picking up on an effect that is present and avoiding a Type II Error

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

[ML] Logistic Regression

A

Predicts probability that y belongs to a binary class. Estimates β through maximum likelihood estimation (MLE) by fitting a logistic (sigmoid) function to the data. This is equivalent to minimizing the cross entropy loss. The threshold a classifies predictions as either 1 or 0.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

[ML] Logistic Regression Assumptions

A

– Linear relationship between X and log-odds of Y
– Independent observations
– Low multicollinearity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

[ML] Multiclass Classification

A
  • Multiclass can distinguish between more than two classes
  • SGD classifiers, Random Forest and Naïve Bayes can use multiclass by default
  • One-versus-the-rest (OVR) strategy for those that don’t support
  • Multilabel Classification: Can test multiple outcomes in one model. Use F1 score to evaluate multilabel factor
  • Multioutput Classification: Generalization of multilabel classification where each label can be multiclass (have more than two possible values)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

[ML] Linear Regression

A

Models linear relationships between a continuous response and explanatory variables

Makes prediction by computing a weighted sum of input features plus a constant called the bias term (intercept term)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

[ML] Linear Regression Assumptions

A

– Linear relationship and independent observations
– Homoscedasticity - error terms have constant variance
– Errors are uncorrelated and normally distributed
– Low multicollinearity

18
Q

[ML] Gradient Descent

A

Generic optimization algorithm capable of finding optimal solutions to wide range of problems and can tweak parameters iteratively in order to minimize a cost function.

Method: it measures the local gradient of the error function with regard to the parameter vector 𝜃 and it goes in the direction of descending gradient. Once the gradient is 0, the minimum is reached.

Random Initialization: start by filling 𝜃 with random values

Learning Rate:
The parameter that determines the size of the steps.
If learning rate is too small, then the algorithm will have to go through many iterations to converge, which can take too long. If learning rate is too high, the algorithm might diverge, with larger and larger values, failing to find a good solution.

19
Q

[ML] Batch Gradient Descent

A
  • Partial Derivative: how much the cost function will change if you change 𝜃 just a little bit.
  • The gradient vector contains all the partial derivatives of the cost function ( one for each model parameter).
20
Q

[ML] Stochastic(Random) Gradient Descent

A
  • Main problem with batch gradient descent is that it uses the whole training set to compute the gradients at every step making it very slow when training set is large.
  • Stochastic picks a random instance in the training set at every step and computes the gradients based on only the single instance.
  • Makes the algorithm much faster and possible to train on huge training sets.
  • However it introduces a lot more variability and the final parameter values are good but not optimal.
21
Q

[ML] Mini-Batch Gradient Descent

A
  • At each step, instead of computing the gradients based on the full training set (as in Batch GD) or based on just one instance (Stochastic GD), Mini-batch GD computes the gradients on small random sets of instances called mini batches.
  • Main Advantage: performance boost from hardware optimization of matrix operations, especially when using GPUs
22
Q

[ML] Polynomial Regression

A
  • A method to fit linear regression on more complex and non linear data
  • Use Scikit-Learn’s PolynomialFeatures to transform data (such as adding a square for each feature)
  • Polynomial Regression is capable of feature relationships between features which is something Linear Regression cannot do
23
Q

[ML] Gradients

A

the generalization of derivative for functions that take several inputs (or one input in the form of a vector or some other complex structure). A gradient of a function is a vector of partial derivatives. You can look at finding a partial derivative of a function as the process of finding the derivative by focusing on one of the function’s inputs and by considering all other inputs as constant values

24
Q

[ML] Learning Curves

A
  • High-degree Polynomial Regression models can be prone to overfitting on the training data while linear models can underfit.
  • Learning curves are plots of the model’s performance on training set and the validation set as a function of the training set size.
  • These plots show the size of the training set needed to stabilize the RMSE.
  • The learning curve of a complicated model shows two differences compared to a simple model:
  • The error on the training data is much lower than the simple model
  • There is a gap between the curves. This means that the model performs significantly better on the training data than on the validation data.
25
Q

[ML] Regularization

A
  • For linear models, regularization is typically achieved by constraining the weights of the model. The three methods are: Ridge Regression, Lasso Regression and Elastic Net.
  • Add a penalty λ for large coefficients to the cost function, which reduces overfitting. Requires normalized data.
26
Q

[ML] Ridge Regression (L2)

A
  • Ridge regression is a regularized version of linear regression with a regularization term added to the cost function. This forces the learning algorithm to not only fit the data but also keep the model weights as small as possible. However the regularization term is only added to the cost function during training. Which is common to have a different cost version for training vs testing model performance.
  • The hyperparameter a controls how much you want to regularize the model. If a = 0, then Ridge Regression is just Linear Regression. If a is vey large, then all weights end up very close to zero and the result is a flat line going through the data’s mean. The bias term is not regularized
  • The penalty hyperparameter sets the type of regularization to use. Specifying ‘l2’ indicates that you want SGD to add a regularization term to the cost function equal to half the square of the l2 norm of the weight vector: this is simply Ridge Regression
  • Reduces effects of multicollinearity
  • L2 regularization tends to spread error among all the terms and corresponds to a Guassian prior
27
Q

[ML] Lasso Regression (L1)

A
  • L1 regularization is more binary/sparse, with many variables either being assigned a 1 or 0 in weighting and corresponds to setting a Laplacean prior
  • Least Absolute Shrinkage and Selection Operator Regression
  • Adds regularization term to the cost function but it uses the l1 norm of the weight vector instead of half the square of the l2 norm
  • LASSO tends to eliminate the weights of the least important features
  • It automatically preforms feature selection and outputs a sparse model (ie with few nonzero feature weights)
28
Q

[ML] Reduce Overfitting

A

The possibility of overfitting training data and carrying the noise of that data through to the test set, thereby providing inaccurate generalizations is a fundamental problem.

There are three main methods to avoid overfitting:

  • Keep the model simpler: reduce variance by taking into account fewer variables and parameters, thereby removing some of the noise in the training data.
  • Use cross-validation techniques such as k-fold cross-validation.
  • Use regularization techniques such as LASSO that penalize certain model parameters if they’re likely to cause overfitting.
29
Q

[ML] Imbalanced Dataset

A

An imbalanced dataset is when you have for example a classification test and 90% of the data is in one class. That leads to problems: an accuracy of 90% can be skewed if you have no predictive power on the other category of data. A few tactics to solve this are:

  • Collect more data to even the imbalances in the dataset
  • Resample the dataset to correct for imbalances
  • Try a different algorithm altogether on your dataset
30
Q

[ML] Elastic Net

A
  • Is the middle ground between Ridge Regression and LASSO Regression
  • The regularization term is a simple mix of Ridge and LASSO and you can control the mix ratio r. When r = 0, Elastic Net is equivalent to Ridge and when r = 1, it is equivalent to LASSO.
  • Which regularization to use:
  • Generally avoid plain Linear Regression
  • If only few features are suspected to be useful, then prefer LASSO or Elastic Net because they tend to reduce the useless features weights down to zero
  • Elastic Net is preferred over LASSO because LASSO may behave erratically when the number of features is greater than the number of training instances or when several features are strongly correlated
31
Q

[FE] One-Hot Encoding

A

Transforming categorical features into several binary ones. Increases dimensionality of the feature vector.

32
Q

[FE] Binning

A

Process of converting a continuous feature into multiple binary features called bins or buckets, typically based on value range

33
Q

[FE] Normalization

A

Process of converting an actual range of values which a numerical feature can take, into a standard range of values, typically in the interval [-1,1] or [0,1]. This can lead to increased speed of learning.

34
Q

[FE] Standardization

A

The procedure during which the feature values are rescaled so that they have the properties of a standard normal distribution with μ=0 and o=1,whereμis the mean (the average value of the feature, averaged over all examples in the dataset) and‡is the standard deviation from the mean.

35
Q

[FE] Data Imputation

A

1) Replace the missing values with the average value of the feature in the dataset. 2) Replace the missing value by same value outside the normal range of values.

36
Q

[ML] Decision Tree Steps

A

1) Take the entire data set as input
2) Calculate entropy of the target variable, as well as the predictor attributes
3) Calculate your information gain of all attributes (we gain information on sorting different objects from each other)
4) Choose the attribute with the highest information gain as the root node
5) Repeat the same procedure on every branch until the decision node of each branch is finalized

37
Q

[ML] Random Forest Steps

A

1) Randomly select ‘k’ features from a total of ‘m’ features where k &laquo_space;m
2) Among the ‘k’ features, calculate the node D using the best split point
3) Split the node into daughter nodes using the best split
4) Repeat steps two and three until leaf nodes are finalized
5) Build forest by repeating steps one to four for ‘n’ times to create ‘n’ number of trees

38
Q

[ML] Overfit vs Underfit

A

With an overfit model, you get very accurate predictions on the training data but make less precise predictions on the test and real world data. Overfitting occurs when the model is overly complex and captures the noise of the data. An underfit model is overly simple; it does not find the data’s underlying patterns. Inaccurate predictions are present in both the training and test results. Underfit models can be caused by insufficient data that covers all combinations, or improper randomization.

39
Q

[ML] Entropy

A

The measure of randomness/variance. The higher the value, the harder it is to draw conclusions. A result of 0 entropy means perfect classification. A greedy algorithm seeks to homogenize data quickly by reducing entropy.

40
Q

[STATS] Type I Error vs Type II Error

A

Type I → False Positive (value was classified as positive but is actually negative). Type II → False Negative (value was classified as negative but is actually positive)