DS - Concepts Flashcards
[STATS] Bias
wrong assumptions when training → can’t capture underlying patterns → underfit
Bias is error due to overly simplistic assumptions in the learning algorithm being used. This can lead the model underfitting your data, making it hard for it to have high predictive accuracy and for you to generalize your knowledge from the training set to the test set.
[STATS] Variance
sensitive to fluctuations when training→ can’t generalize on unseen data → overfit
Variance is error due to too much complexity in the learning algorithm you’re using. This leads to algorithm being highly sensitive to high degree of variation in your training data, which can lead your model to overfit the data. You’ll be carrying too much noise from your training data for your model to be very useful to your test data.
[STATS] Bias Variance Tradeoff
The bias-variance tradeoff attempts to minimize these two sources of error, through methods such as:
– Cross validation to generalize to unseen data
– Dimension reduction and feature selection
In all cases, as variance decreases, bias increases.
The bias-variance decomposition essentially decomposes the learning error from any algorithm by adding the bias, the variance and a bit of irreducible error due to noise in the underlying dataset. Essentially, if you make a model more complex and add more variables, you’ll lose bias but gain more variance - in order to get the optimally reduced amount of error, you’ll have to tradeoff bias and variance.
[STATS] Precision
TP / TP + FP —> percent correct when predict positive
positive predictive value : a measure of the amount of accurate positives the model claims compared to the number of positives it actually claims
[STATS] Recall
Sensitivity
TP / TP + FN —-> percent of actual positives identified correctly (True Positive Rate)
Recall is the true positive rate : the amount of positives your model claims compared to the actual number of positives there are really are in the data
[STATS] Specificity
TN / TN + FP —-> percent of actual negatives identified correctly
[STATS] F1 Score
2 * (( Precision * Recall) / Precision + Recall)
Useful when classes are imbalanced
The F1 Score is a measure of a model’s performance. It is a weighted average of the precision and recall of a model, with results tending to 1 being the best and those tending to 0 being the worst. You would use it in classification tests where true negatives don’t matter much.
[STATS] ROC Curve
plots TPR vs. FPR for every threshold α. Area Under the Curve measures how likely the model differentiates positives and negatives (perfect AUC = 1, baseline = 0.5).
The ROC curve is a graphical representation of the contrast between true positive rates and false positive rates at various thresholds. It’s often used as a proxy for the trade-off between the sensitivity of the model (true positives) vs the fallout or probability it will trigger a false alarm (false positives).
[STATS] Precision-Recall Curve
Focuses on the correct prediction of the minority class, useful when data is imbalanced
[STATS] P-Value
probability that an effect could have occurred by
chance. If less than the significance level α, or if the test statistic is greater than the critical value, then reject the null.
A p-value is a measure of the probability that an observed difference could have occurred just by random chance. The lower the p-value, the greater the statistical significance of the observed difference.
[STATS] Type I Error
(False Positive α) - rejecting a true null
Type 1 error is a false positive meaning claiming something has happened when it hasn’t
Confidence Level (1 - α) - probability of finding an effect that did not occur by chance and avoiding a Type I error
[STATS] Type II Error
(False Negative β) - not rejecting a false null Decreasing Type I Error causes an increase in Type II Error
Type 2 error is a false negative meaning you claim nothing happening when in fact something is
Power (1 - β) - probability of picking up on an effect that is present and avoiding a Type II Error
[ML] Logistic Regression
Predicts probability that y belongs to a binary class. Estimates β through maximum likelihood estimation (MLE) by fitting a logistic (sigmoid) function to the data. This is equivalent to minimizing the cross entropy loss. The threshold a classifies predictions as either 1 or 0.
[ML] Logistic Regression Assumptions
– Linear relationship between X and log-odds of Y
– Independent observations
– Low multicollinearity
[ML] Multiclass Classification
- Multiclass can distinguish between more than two classes
- SGD classifiers, Random Forest and Naïve Bayes can use multiclass by default
- One-versus-the-rest (OVR) strategy for those that don’t support
- Multilabel Classification: Can test multiple outcomes in one model. Use F1 score to evaluate multilabel factor
- Multioutput Classification: Generalization of multilabel classification where each label can be multiclass (have more than two possible values)
[ML] Linear Regression
Models linear relationships between a continuous response and explanatory variables
Makes prediction by computing a weighted sum of input features plus a constant called the bias term (intercept term)
[ML] Linear Regression Assumptions
– Linear relationship and independent observations
– Homoscedasticity - error terms have constant variance
– Errors are uncorrelated and normally distributed
– Low multicollinearity
[ML] Gradient Descent
Generic optimization algorithm capable of finding optimal solutions to wide range of problems and can tweak parameters iteratively in order to minimize a cost function.
Method: it measures the local gradient of the error function with regard to the parameter vector 𝜃 and it goes in the direction of descending gradient. Once the gradient is 0, the minimum is reached.
Random Initialization: start by filling 𝜃 with random values
Learning Rate:
The parameter that determines the size of the steps.
If learning rate is too small, then the algorithm will have to go through many iterations to converge, which can take too long. If learning rate is too high, the algorithm might diverge, with larger and larger values, failing to find a good solution.
[ML] Batch Gradient Descent
- Partial Derivative: how much the cost function will change if you change 𝜃 just a little bit.
- The gradient vector contains all the partial derivatives of the cost function ( one for each model parameter).
[ML] Stochastic(Random) Gradient Descent
- Main problem with batch gradient descent is that it uses the whole training set to compute the gradients at every step making it very slow when training set is large.
- Stochastic picks a random instance in the training set at every step and computes the gradients based on only the single instance.
- Makes the algorithm much faster and possible to train on huge training sets.
- However it introduces a lot more variability and the final parameter values are good but not optimal.
[ML] Mini-Batch Gradient Descent
- At each step, instead of computing the gradients based on the full training set (as in Batch GD) or based on just one instance (Stochastic GD), Mini-batch GD computes the gradients on small random sets of instances called mini batches.
- Main Advantage: performance boost from hardware optimization of matrix operations, especially when using GPUs
[ML] Polynomial Regression
- A method to fit linear regression on more complex and non linear data
- Use Scikit-Learn’s PolynomialFeatures to transform data (such as adding a square for each feature)
- Polynomial Regression is capable of feature relationships between features which is something Linear Regression cannot do
[ML] Gradients
the generalization of derivative for functions that take several inputs (or one input in the form of a vector or some other complex structure). A gradient of a function is a vector of partial derivatives. You can look at finding a partial derivative of a function as the process of finding the derivative by focusing on one of the function’s inputs and by considering all other inputs as constant values
[ML] Learning Curves
- High-degree Polynomial Regression models can be prone to overfitting on the training data while linear models can underfit.
- Learning curves are plots of the model’s performance on training set and the validation set as a function of the training set size.
- These plots show the size of the training set needed to stabilize the RMSE.
- The learning curve of a complicated model shows two differences compared to a simple model:
- The error on the training data is much lower than the simple model
- There is a gap between the curves. This means that the model performs significantly better on the training data than on the validation data.