2. Main differences from descriptive modeling Flashcards

(22 cards)

1
Q

what is sucess criteria

A

it is the criteria for a succefull or useful outcome for the project from the business point fo view

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

what are the main differences in crisp dm between descriptive modeling and predictive modeling

A
  • High number of iterations (tends to increase exponentationally between data preparation and modeling)
  • Evaluation of the business success criteria (it is needed to design a test to measure the business sucess criteria in the evaluation phase can be more challenging - it is harder to stablish a goal for something that we dont know when in comparison to descriptive)
  • Cyclical nature (the model is harder to be “closed”, the lessons learned can cause more issues, and as time goes by things change)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

what is performative prediction

A

it happens when prediction changes future outcomes.

Deployment changes the model from observer to actor

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

how does performative prediction work

A
  1. We train a model on distribution
  2. Deployment causes a distribution shift
  3. This shift can change the model performance - for better or worse
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How to stop performance prediction

A
  • By default, assume that the model is affected
  • Monitor for distribution shift
  • Think of human as agents who will react to the predictions
  • Analyze: are there feedback loops?
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are the main issues to detect and take in consideration in data preparation phase

A
  1. Missing Values
  2. Outliers
  3. Categorical variables with high cardinality
  4. Categorical variables where one level represents most of the observations
  5. Strong correlations with the target variable
  6. Strong correlation between the different values
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is a target or label

A

columns in the dataset that contains the values that we want to estimate/predict as the business objectives. It can be numeric or categorical

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is data leakage?

A

Data leakage occurs when information from outside the training dataset influences the model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Why is data leakage an issue?

A

It can lead overly optimistic performance during training bur poor performance in production.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is a sign we need to doubt that data leakage is occuring

A

High accuracy

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are the 2 types of data leakage

A
  • target leakage
  • train test contamination
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is target leakage

A

occurs when predictive features include information about the target that would not be available at the prediction time.
It can lead to unrealistic good model performance

example: predictive customer churn using post churn data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is Train-Test Contamination

A

It happens when information from the test set influences the training process. This can occur throught:
* Data preprocessing (normalizing train uses values in a range different from test)
* Feature Selection
* Hyperparameter tunning
Results in overfitting and poor generalization

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is target encoding

A

Substitui cada categoria por uma estatistica do target (normalmente a média) - pode causar overfitting especialmente com poucas observações por categorias

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is frequency encoding

A

Substitui a categoria pela frequencia com que aparece no dataset, pode ser menos indicativo mas evita overfitting

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What does high cardinality mean

A

too many values/levels (some will have very low amount of observations)

17
Q

How to fix High Cardinality

A
  • Agrupar categorias + frequentes em “Outros”
  • Usar técnicas como target encoding ou frequency encoding
  • Reduzir a granularidade (p.e. em vez de usar cidade, referir como região)
18
Q

Why is High Cardinality bad

A
  • Makes inferences made by the algorithm about the relationship between the target and the level to be noisy (Modelo capta padrões falsos)
  • May lead models to overfit
  • May difficult and slowdown model training
19
Q

Why is redundancy bad

A
  • não adiciona informação nova
  • podem confundir o modelo
  • aumentam o tempo de treino sem beneficio
20
Q

Como detetar redundancia

A
  • Numerical data (pearson correlation - if linear relationship, covariance)
  • Numerical or ordinal data ( spearman correlation)
  • Categorical data (teste do qui quadrado( chi-square)
21
Q

What is dimensionality reduction and what is useful

A

Consiste em reduzir o nº de variaveis (features) no dataset mantendo o máximo de informação possível.
As the number of candidates variables for modeling increase, the number of observations must also increase (exponentially) to be able to capture the high dimensional patterns.
Em espaços de alta dimensão:
* os dados tornam se esparsos
* a distância entre os pontos perde significado
* o modelo está mais propenso para overfit

22
Q

Como combater dimensionality reduction

A
  • PCA: Principal Component Analysis (transforma as variaveis originais num novo conjunto de variaveis não correlacionadas)
  • t-SNE ou UMAP - técnicas mais avançadas para visualizações 2D e 3D
  • Seleção de features