Machine Learning for Classification Flashcards

1
Q

What is it called if a customer is leaving a company for another competitor company?

A

Churn. If we want to identify such customers, we call it churn prediction. If we know if a customer is churning, we can send them discounts so that they do not go to the other company.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Binary Classification

A

Our target variable can either be 0(false) or y(true)
g(X) ~ y
g(x) is the likelihood of the customer being churned between 0 and 1.
X is the feature Matrix
y is the binary target variable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How can you look at all columns at the same time?

A

df.head().T

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What can you do in data preparation step?

A

1) Standardize column names e.g. lower all columns, replace space with underscores etc
2) Check for dtypes of columns e.g. numeric columns are integers or floats. Is there a column which should have been a number but is actually not? Maybe it’s string.
3) Check for missing values and replace it.
4) Check the target variable. In machine learning, we are interested in numeric values so we can convert string target variable e.g. yes or no, to 1 or 0.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How can you see docs of a function within Jupyter notebook?

A

<function_name>?
</function_name>

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How to use scikit_learn to setup validation framework?

A

from sklearn.model_selection import train_test_split
# This function is used to split a dataset into two
df_train_full, df_test=train_test_split(df, test_size=0.2,random_state=1)
train, df_val = train_test_split(df_train_full, test_size=0.25, random_state=11)
So here in order to get 20% of the original dataset, we need to split a bit more from df_train_full so thats why we choose 0.25. 20%/80%=1/4

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What steps need to be taken to set up validation framework?

A

1) Split dataset into 60% training, 20% test and 20% validation dataset.
2) Store target outcomes in variables
3) delete the variables from the validation, training and test featured datasets.
4) Shuffled the dataset which is done automatically through scikit-learn.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What steps need to be taken in exploratory data analysis?

A

1) Check for missing values.
2) Look at the distribution for the target variable e.g. df_train_full.churn.value_counts()
3) Look at the numerical and categorical variables. e.g. number of unique values per variables
3) Value counts or percentage of target outcome
4) feature importance w.r.t to target variable. If a feature shows that it effects the target variable then it maybe an important feature.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How can you see percentages of the target variable?

A

e.g. you can use value_counts function with parameter normalize=True.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How to measure feature importance?

A

1) Take difference between global mean and feature specific grouped mean. So if (global_mean - grouped _mean) > 0, meaning this group is not effecting the output variable. If it’s < 0 then it means that group effects the output variable. We are interested in large differences.
2) Risk Ratio: Divide the grouped mean by the global mean. If it’s greater than 1, it’s more likely to have target outcome. And if it’s less than 1, it is less likely to have the outcome variable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How to display from within a for loop?

A

from Ipython.display import display

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Mutual Information

A

We can know about the importance of a feature using mutual Information. It’s a concept from information theory, it tells us how much we can learn about one variable if we know the value of another. It can be calculated in sklearn.

from sklearn.metrics import mutual_info_score.

The higher the mutual Information is, the more we learn about the target variable by observing the value of the other variable.

Mutual Information is a way to check the relative importance for categorical values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How can you convert sort_values() to a dataframe?

A

Using to_frame().

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Correlation

A

It’s a way to learn feature importance for numerical variables.

Correlation coefficient is a Pearson correlation and it’s a way to measure dependency between two variables.

The values are between -1<=r<=1.
-ive correlation: increase in one variable decrease in the other variable
+I’ve correlation: Increase in one variable can increase in the other variable.
Higher than 0.5 and closer to 1 are always strong correlation.

Zero correlation means: no effect of variable on the target outcome.

e g. x has values between 0 and 200 i.e. tenure
y has values as 0 or 1 i.e. churn
Positive correlation means more tenure, higher churn
Negative correlation means more tenure less churn
Zero correlation no effect on churn.

Correlation can be calculated in pandas as df[cols].corrwith(df.target_col).to_frame(‘correlation’)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

One-hot encoding

A

How can we encode categorical features before giving it to the machine learning algorithm?

When we convert categorical variables to a bunch of binary variables so this way of encoding is called one - hot encoding and it can be achieved via scikit-learn.
from sklearn.feature_extraction import DictVectorizer
1) Convert dataframe into records based dict
2) Use Fit method from DictVectorizer
3) Use the fit output to transform the dict
4) DictVectorizer is smart enough to know which column is numerical and does not convert it.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How can we combine fit and transform in scikit-learn?

A

We can use fit_transform.

17
Q

Logistic Regression

A

g(X) ~ y where g is the model, X is the feature matrix and y is the target variable.

Logistic Regression is a supervised machine learning algorithm for binary Classification.

In binary Classification, our target could be 0s and 1s. 0 is negative and 1 is positive.

The model outputs a probability between 0 and 1.

It’s similar to linear regression, we have a bias term and weights multiplied by features. The difference is in output. Linear Regression outputs any number between -inf to +inf. However, logistic regression outputs a number between 0 and 1.
The way it does is by using a Sigmoid function.

g(X) = Sigmoid(w0+w1Tx1…..wnTxn)

For negative values, sigmoid is below 0.5 and for positive values, sigmoid is above 0.5.

Sigmoid formula: 1/(1+np.exp(z))

18
Q

dot products in linear algebra are called

A

Linear operators.
Linear models are good quality and fast to train.

19
Q

What is hard predictions and soft predictions?

A

Hard predictions are numbers e.g. 1 or 0
Soft predictions are probabilities

20
Q

Output of predict_proba for classification model

A

It returns a matrix of two columns. The first one is the probability of being 0 class and the second one is the probability of being a 1 class.
We are usually interested in the second column.

21
Q

Classification Accuracy

A

Checking how many of our predictions are being matched with the actual output.e.g.
(Predictions == actual_value).mean()

22
Q

Model Interpretation

A

We can look at each feature and its corresponding weights
dict(zip(dv.get_feature_names_out(), model.coef_[0].round(3)))

We can get the bias term and weights for each of the feature.

23
Q

What does underscore means in Jupyter?

A

Take whatever the output was in the previous cell and plug it in.

24
Q

Using the model

A

Train on both 60% dataset and validation dataset (20% dataset)
Apply same preparation on test dataset
Predict probabilities on the test dataset.