External: Dealing with Skewness Flashcards

1
Q

WHAT ARE THE EFFECTS OF SKEWED DATA ON DIFFERENT MODELS AND STATISTICAL METHODS?

A

Effects of skewed data:
+Degrades the model’s ability (especially regression based models)
+Skewed data also does not work well with many statistical methods
+ Tree based models are not affected

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

WHAT ARE THE WAYS OF DEALING WITH SKEWED DATA?

A

1- Log transformation: only use it when your data is skewed to the right. (np.log)
2-Removing Outliers
3-Cube root: when values are too large, can be applied to negative values
4-Square root: applied only to positive values
5-Normalize (min-max)
6-Reciprocal (1/x)
7-Square: apply on left skew (np.square)
8-Box-Cox transformation: it’s for positive values, but can add a number to the column so that we have positive values.
from sklearn.preprocessing import PowerTransformer
boxcoxTr = PowerTransformer (method = “box-cox”, standardize=True)
9- Yeo-Johnson Transformation: it’s box-cox with a boost! It can be used for negative and 0 too. It’s the default method for PowerTransformer

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

WHAT TO DO WITH SKEWNESS IN TARGET VARIABLE? (In Classification)

A

Use under/over sampling

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Does any machine learning algorithm assume normal distribution of data? External

A

It is not mandatory that data should always follow normality. ML models work very well in the case of non-normally distributed data also. Models like decision tree, XgBoost, don’t assume any normality and work on raw data as well. Also, linear regression is statistically effective if only the model errors are Gaussian, not exactly the entire dataset.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly