A-kassen Flashcards

Question

Why did you chose to solely look at supervised machine learning algorithms?

Answer 1

We knew the target class, and what we wanted to look at which is churner/no churner

Answer 2

Through classification accuracy. However accuracy does not capture what is important for the problem at hand.

Answer 3

As we had a label, classification was appropriate as classification involves selecting which, out of a set of labels should be assigned to some data, according to some spot meaningful indicators.

Answer 4

It means that we are working with two options.

Answer 5

Because we found some of them to be irrelevant to predictions related to the target "will churn" and "will not churn". We observed distinctiveness degree in which the attributes we have chosen affect the target in the attribute selection section of the paper.

Answer 6

Because a dataset can be imbalances, which was the case for our dataset, and in this case the accuracy can be misleading. In our example we had 71,5% non-churners.

Answer 7

We decided to use stratified cross validation in order to evaluate how generalisable our model was.

Answer 8

Stratified cross validation is a way of splitting or portion the entire dataset into "k" bin of equal size to get a fair share of data points in both training data and test data.

Answer 9

Utilise the dataset more. By using stratified cross validation, you will get more out of the training and test data, that is, the best validation and learning results as possible. If we were to only use training data (hold out data), we would only train on parts of the dataset and leave the rest behind for test.

Answer 10

Because the class churn_loyal has a high concentration 71,5% - and as the concentration is not equal it can be argued that it is imbalanced

Answer 11

It is a very normal thing - but we knew that we need to be aware that the accuracy could be misleading and therefore we chose not to focus much upon the accuracy.

Answer 12

We ensured that the machine learning algorithms were able to get the right data types.

Answer 13

To make the dataset as clean as possible, and then we could either remove or adjust these.

Answer 14

We spotted a lot of duplicates, null-values, inconsistent casing, and danish special characters.

Answer 15

To make the dataset as clean as possible, and then we could either remove or adjust these.

Answer 16

Defined a rule to decide wether and attribute brings value into predicting churn/no churn.

Answer 17

Used coding to create som rules to decide wether and attribute brings value into predicting churn/no churn. For example we cleaned some of the specific instances, and categorised education names and commune into broader groups.

Answer 18

Education name which has values as for example Ha. IT and Ha.it.

Answer 19

Because the machine cannot distinguish between for example Ha.IT and Ha.it. After doing this unique values were dropped from 841 to 796.

Answer 20

It was not possible to use churn month because it was null for most/all members in the dataset, since they have not churned - and it is not possible to make calculating time range of null values. We converted the registration date and churn month into a new attribute: days as member. And we could calculate how many days the instance had been a member. The data is only representable within the given timeframe but this is easy to change. Recommend to change every day - so AK has an accurate overview.

Answer 21

Introduced a new attribute - years_correct. age*365-days as member. So we calculated the critical range, and if a persons number was below 6570 it meant that they had been member before turning 18 or the data was not correct.

Answer 22

1. Map danish names to machine computer friendly names 2. Fill null values with meaningful values such as other etc 3. Drop low quality instances based on our rule based attribute selection

Answer 23

Accuracy, variance and standard deviation, confusion matrix metrics, ROC/AUC, base rate

Answer 24

A measure of how many instances the model has correctly guessed in total.

Answer 25

Often we can see this in dataset that are skewed, which is the case for us. There is accuracy paradox when models with lower accuracy are preferred because it is important, on the basis of our business understanding, that who can be churners on the basis of attributes.

Answer 26

Tells us how much we can expect our model to vary when adding new datasets. If we see a big variance between each fold we could think that the dataset is not generalisable which is preferred if using the same model on another dataset.

Answer 27

Gives a quick overview of results. Many metrics can be calculated on behalf of the matrix and one of them that we used is TPR or recall.

Answer 28

Gives a quick overview of results. Many metrics can be calculated on behalf of the matrix for example FPR and TPR.

Answer 29

For model performance. To set a baseline for the performance of the model.

Answer 30

To get a median over the population

Answer 31

We used stratified cross validation and fold 5 times so

Answer 32

We used stratified cross validation and fold 5 times so we could train each fold and measure the mean and variance, which can be used to estimate the expectation upon future datasets.

Answer 33

If it is below the line the model are not performing well and anything avove is an advantage from being random.

Answer 34

0,0 means that you never classify an instance as positive. 1,1 means that you classify all as positive.

Answer 35

Feature engineering can lead to increased performance by refining the dataset with knowledge about the attributes. An example is days as member where we creates a new feature since we saw bigger potential this way. So instead of introducing models that are more complex we refined our dataset and improve performance in that way.

Answer 36

To limit the influence from numeric features with broader ranges. As for example days as member had a very high range this could affect other numeric features. Therefore we used feature scaling to secure that our attributes contribute approximately proportionally.

Answer 37

Standardscaler which standardized the numeric features

Answer 38

Is done in the dataset to reduce variance in order to not have an imbalance in the estimator influence and dominate the function

Answer 39

How much gain each attribute provides to the label. The closer to 1 the more information the attribute provides.

Answer 40

To choose which attributes are most important in regards to the label we wanted to predict. We defined a rule that would exclude the attributes that did not meet a high enough contribution.

Answer 41

The correlation coefficient spans from -1 to 1 whereas the coefficient is denoting how correlated a feature is to the label. If the coefficient states 0 the feature is no way correlating to the label meaning no prediction power.

Answer 42

Very similar to cross validation, but SFK ensures the class is approximately the same size in each fold. Each of the folds will be tested and trained k number of times.

Answer 43

Parametric and nonparametric

Answer 44

In order to utilise a broader toolset of classifiers on our dataset, it made sense to include methods from both categories.

Answer 45

The accuracy, TPR and TNR are identical so we followed the advice from scikit learner and used only Linear SVC

Answer 46

To explore where our instances fit in regards to our target value.

Answer 47

With euclidean distance

Answer 48

Based on error rate. But how? IN regards to recall we chose 5? (67%)

Answer 49

Because Linear SVC predicted only 14% TPR and KNN predicted 69%

Answer 50

Determined which attributes are contributing factor when a member choose to churn.

Answer 51

Our model can look at any given customer in AK and predict with 87% certainty wether the customer will churn or not.

Answer 52

Run the model agains customer database every day to identify groups of customer who will stay members but also identify a group that are likely to churn. With this AK can engage in several efforts to retain customers.

Answer 53

No. It only gives a binary churn/no churn result.

Answer 54

- We defined the data set - We acquired and prepared the data - We used data mining techniques to extract value from data

Answer 55

On average is the best way to evaluate how generalizable our model was. It is always working on held-out data (training data). We used stratified cross validation

Answer 56

Stratified cross validation is a way of splitting or portion the entire dataset into “k” bin of equal size in order to get a fair share of data points in both the training data and test data

Answer 57

Large dataset. Time consuming to clean.

Answer 58

Python - ten line python script to transform dates, understand something, make the instances more clean and therefore more intuitive and usable

Answer 59

Basis for serious data science

Answer 60

Because the important results are on the test data because then you have trained the model and ....

A-kassen Flashcards

(88 cards)