Final Flashcards by Lapis Lazuli

Line of code to sort a df

df.sort_values(by(feature), axis=0(rows)/1(columns), ascending=True/False, inplace=True/False(False))

How well did you know this?

Not at all

Perfectly

Standardization vs Normalization

Standardized (preferred) = 1 is one standard deviation
Normalized = values scaled from 0 to 1

How well did you know this?

Not at all

Perfectly

Standardized scales are always identical

FALSE

How well did you know this?

Not at all

Perfectly

Normalized scales are always identical

TRUE, the values are always on a scale from 0 to 1.

How well did you know this?

Not at all

Perfectly

Normalized Scale

(x - min)/(max - min)

How well did you know this?

Not at all

Perfectly

How to Normalize in Python

from sklearn import preprocessing
preprocessing.MinMaxScaler().fit_transform(df)

How well did you know this?

Not at all

Perfectly

How to Standardize in Python

from sklearn import preprocessing
preprocessing.scale(df)

How well did you know this?

Not at all

Perfectly

Is the standardized value intechangeable with the standardized distance from mean?

NO, value is signed but distance is not!!

How well did you know this?

Not at all

Perfectly

Dirty Data

Missing, outlier, or duplicate data

How well did you know this?

Not at all

Perfectly

You should only discard dirty data when there is a clear trend behind dirty instances

FALSE! This can introduce bias. Only discard dirty data that comprises of a small amount of the instances, seems to be random, or is duplicate.

How well did you know this?

Not at all

Perfectly

Pairwise discarding

Discarded for some analyses, but kept in for analysis of other features.

How well did you know this?

Not at all

Perfectly

Mean Imputation

Replace a missing value with the mean of that feature, excluding any other missing values or outliers!

How well did you know this?

Not at all

Perfectly

Replace missing values in Python

df.fillna(value=v)

How well did you know this?

Not at all

Perfectly

You should calculate df mean with df.mean()

FALSE, you should also use numeric_only=True

How well did you know this?

Not at all

Perfectly

inner, outer, left, right merge

inner - intersection
outer - union
left - origin
right - external

How well did you know this?

Not at all

Perfectly

Proportion

Study These Flashcards

Number in Group/Total #

Box plot

Study These Flashcards

Visual 5-number summary-
min, Q1, median, Q3, max

seaborn - histogram

Study These Flashcards

sns.histplot(df, x=’Feature’)

seaborn - density plot

Study These Flashcards

sns.kdeplot(df, x=’Feature’)

seaborn - bar chart

Study These Flashcards

sns.countplot(df, x=’Feature’)

seaborn - box plot

Study These Flashcards

sns.boxplot(df, x=’Feature’)

seaborn - violin plot

Study These Flashcards

sns.violinplot(df, x=’Feature’)

seaborn - scatter plot

Study These Flashcards

sns.scatterplot(df, x=’Horizontal feature’, y=’Vertical feature’)

seaborn - swarm plot

Study These Flashcards

sns.swarmplot(df, x=’Numerical feature’, y=’Categorical feature’)

SQL

Structured Query Language

SQL - select multiple columns

Separate by comma. Don't use NOT, just list all other columns.

Final Flashcards

(27 cards)