Final Flashcards

(27 cards)

1
Q

Line of code to sort a df

A

df.sort_values(by(feature), axis=0(rows)/1(columns), ascending=True/False, inplace=True/False(False))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Standardization vs Normalization

A

Standardized (preferred) = 1 is one standard deviation
Normalized = values scaled from 0 to 1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Standardized scales are always identical

A

FALSE

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Normalized scales are always identical

A

TRUE, the values are always on a scale from 0 to 1.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Normalized Scale

A

(x - min)/(max - min)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How to Normalize in Python

A

from sklearn import preprocessing
preprocessing.MinMaxScaler().fit_transform(df)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How to Standardize in Python

A

from sklearn import preprocessing
preprocessing.scale(df)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Is the standardized value intechangeable with the standardized distance from mean?

A

NO, value is signed but distance is not!!

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Dirty Data

A

Missing, outlier, or duplicate data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

You should only discard dirty data when there is a clear trend behind dirty instances

A

FALSE! This can introduce bias. Only discard dirty data that comprises of a small amount of the instances, seems to be random, or is duplicate.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Pairwise discarding

A

Discarded for some analyses, but kept in for analysis of other features.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Mean Imputation

A

Replace a missing value with the mean of that feature, excluding any other missing values or outliers!

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Replace missing values in Python

A

df.fillna(value=v)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

You should calculate df mean with df.mean()

A

FALSE, you should also use numeric_only=True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

inner, outer, left, right merge

A

inner - intersection
outer - union
left - origin
right - external

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Proportion

A

Number in Group/Total #

17
Q

Box plot

A

Visual 5-number summary-
min, Q1, median, Q3, max

18
Q

seaborn - histogram

A

sns.histplot(df, x=’Feature’)

19
Q

seaborn - density plot

A

sns.kdeplot(df, x=’Feature’)

20
Q

seaborn - bar chart

A

sns.countplot(df, x=’Feature’)

21
Q

seaborn - box plot

A

sns.boxplot(df, x=’Feature’)

22
Q

seaborn - violin plot

A

sns.violinplot(df, x=’Feature’)

23
Q

seaborn - scatter plot

A

sns.scatterplot(df, x=’Horizontal feature’, y=’Vertical feature’)

24
Q

seaborn - swarm plot

A

sns.swarmplot(df, x=’Numerical feature’, y=’Categorical feature’)

25
SQL
Structured Query Language
26
SQL - select multiple columns
Separate by comma. Don't use NOT, just list all other columns.
27