Final Flashcards
(27 cards)
Line of code to sort a df
df.sort_values(by(feature), axis=0(rows)/1(columns), ascending=True/False, inplace=True/False(False))
Standardization vs Normalization
Standardized (preferred) = 1 is one standard deviation
Normalized = values scaled from 0 to 1
Standardized scales are always identical
FALSE
Normalized scales are always identical
TRUE, the values are always on a scale from 0 to 1.
Normalized Scale
(x - min)/(max - min)
How to Normalize in Python
from sklearn import preprocessing
preprocessing.MinMaxScaler().fit_transform(df)
How to Standardize in Python
from sklearn import preprocessing
preprocessing.scale(df)
Is the standardized value intechangeable with the standardized distance from mean?
NO, value is signed but distance is not!!
Dirty Data
Missing, outlier, or duplicate data
You should only discard dirty data when there is a clear trend behind dirty instances
FALSE! This can introduce bias. Only discard dirty data that comprises of a small amount of the instances, seems to be random, or is duplicate.
Pairwise discarding
Discarded for some analyses, but kept in for analysis of other features.
Mean Imputation
Replace a missing value with the mean of that feature, excluding any other missing values or outliers!
Replace missing values in Python
df.fillna(value=v)
You should calculate df mean with df.mean()
FALSE, you should also use numeric_only=True
inner, outer, left, right merge
inner - intersection
outer - union
left - origin
right - external
Proportion
Number in Group/Total #
Box plot
Visual 5-number summary-
min, Q1, median, Q3, max
seaborn - histogram
sns.histplot(df, x=’Feature’)
seaborn - density plot
sns.kdeplot(df, x=’Feature’)
seaborn - bar chart
sns.countplot(df, x=’Feature’)
seaborn - box plot
sns.boxplot(df, x=’Feature’)
seaborn - violin plot
sns.violinplot(df, x=’Feature’)
seaborn - scatter plot
sns.scatterplot(df, x=’Horizontal feature’, y=’Vertical feature’)
seaborn - swarm plot
sns.swarmplot(df, x=’Numerical feature’, y=’Categorical feature’)