Introduction Flashcards
(39 cards)
What is a categorical variable?
Qualitative variables are referred as categorical variable
What is the logic behind classification technique?
We predict the probability of each of the categories of a qualitative variable (categorical).
What is variable selection?
Determining which predictors are associated with the response, in order to fit a single model involving only those predictors
How to determine which model is best?
- Mallow’s Cp
- AIC
- BIC
- Adjusted R2
What are the two types of resampling method?
- Cross validation
- Bootstrap
LOOCV
Leave One Out Cross Validation
Here we use (n-1) observation for training the model and 1 observation for testing. So overall, we have n cross validation (model is fit n times).
It produces the same result every time. There is no randomness in the training and validation set, However, it is computationally extensive
k-fold Cross validation
Whole data set is divided into k folds and model is fit on k-1 folds whereas validated on 1 fold.
LOOCV is a special case of k-fold CV, where k = n
Generally k = 5 or 10
Which gives low bias: LOOCV or k-fold CV
LOOCV,
because, each training set contains n-1 observation. It means the data is almost fit on the whole observation.
Which gives low variance: LOOCV or k-fold CV
k-fold CV,
because, In case of LOOCV, every fit is almost on the same observation. Therefore, the output are highly (positively) correlated with each other as compared to k-fold CV, where outputs are somewhat correlated.
The mean of highly correlated value has higher variance as compared to mean of value that are not so highly correlated Therefore, LOOCV has higher variance than k-fold cv.
Bias
The inability of the model to truly capture the relationship in the data.
Variance
The difference in performance of model when it is trained on different dataset.
PCA
It is feature extraction technique.
It is a dimension reduction technique. It transforms higher dimensional data to lower dimensional data, explaining the maximum variability in the data.
First principal component
It is the line that is as close as possible to the data.
It minimizes the sum of squared perpendicular distance between each point and the line.
It is the normalized linear combination of the feature.
Second principal component
It is the linear combination of variables that are not related to first component.
The first two principal components of a data set span the plane that is as close to the observation as possible in terms of average squared Euclidean distance.
What is high dimensional data?
When p>n
Can model evaluation metrics like Cp, AIC, BIC, adjusted R2 be used in high dimensional data?
No,
Indirect performance metrics like Cp, AIC, BIC are not appropriate for high dimensional setting. This is because estimating RSS and σ2 is not feasible as well as RSS might become zero, suggesting a perfect fit.
And it is easier to obtain a model with adjusted R2 value = 1
Is regularization effective in high-dimensional settings?
Yes
What happens to the test error with increase in dimension?
Increases
Curse of dimensionality
The quality of the model doesn’t always improve with the increase in the predictive variable.
Adding variables that are truly associated with the response variable will improve the fitted model and adding noise features that are not truly associated with response will deteriorate the quality of the model and will lead to overfitting.
Undersampling
Random sampling on majority classes to reduce the number of dataset in majority classes equal to the minority class.
Oversampling
Copying the datapoints of minority classes to equal the majority classes in case of imbalance dataset.
Advantages and disadvantages of Under sampling
Advantages:
1. Handles class imbalance
2. Faster Training
Disadvantages:
1. Data loss
2. Sampling bias (Use random sampling)
Advantages and disadvantages of Oversampling
Advantages:
1. Handles class imbalance
Disadvantages:
1. Duplicacy of data may cause overfitting
SMOTE
Synthetic Minority Oversampling Technique:
Uses Interpolation instead of Duplicacy
- A KNN is trained on the minority observation to find the K nearest neighbor.
- Randomly select a minority datapoint and then randomly select its neighbor (for interpolation)
- Take difference between a sample and selected neighbor.
- Multiply the difference by a random number between 0 and 1.
- Add this difference to the sample to generate a new synthetic example in feature space
- Continue on with next nearest neighbor until minority = majority