2. Main differences from descriptive modeling Flashcards

Question 1

Q

what is sucess criteria

Answer

A

it is the criteria for a succefull or useful outcome for the project from the business point fo view

Question 2

Q

what are the main differences in crisp dm between descriptive modeling and predictive modeling

Answer

A

High number of iterations (tends to increase exponentationally between data preparation and modeling)
Evaluation of the business success criteria (it is needed to design a test to measure the business sucess criteria in the evaluation phase can be more challenging - it is harder to stablish a goal for something that we dont know when in comparison to descriptive)
Cyclical nature (the model is harder to be “closed”, the lessons learned can cause more issues, and as time goes by things change)

Question 3

Q

what is performative prediction

Answer

A

it happens when prediction changes future outcomes.

Deployment changes the model from observer to actor

Question 4

Q

how does performative prediction work

Answer

A

We train a model on distribution
Deployment causes a distribution shift
This shift can change the model performance - for better or worse

Question 5

Q

How to stop performance prediction

Answer

A

By default, assume that the model is affected
Monitor for distribution shift
Think of human as agents who will react to the predictions
Analyze: are there feedback loops?

Question 6

Q

What are the main issues to detect and take in consideration in data preparation phase

Answer

A

Missing Values
Outliers
Categorical variables with high cardinality
Categorical variables where one level represents most of the observations
Strong correlations with the target variable
Strong correlation between the different values

Question 7

Q

What is a target or label

Answer

A

columns in the dataset that contains the values that we want to estimate/predict as the business objectives. It can be numeric or categorical

Question 8

Q

What is data leakage?

Answer

A

Data leakage occurs when information from outside the training dataset influences the model.

Question 9

Q

Why is data leakage an issue?

Answer

A

It can lead overly optimistic performance during training bur poor performance in production.

Question 10

Q

What is a sign we need to doubt that data leakage is occuring

Answer

A

High accuracy

Question 11

Q

What are the 2 types of data leakage

Answer

A

target leakage
train test contamination

Question 12

Q

What is target leakage

Answer

A

occurs when predictive features include information about the target that would not be available at the prediction time.
It can lead to unrealistic good model performance

example: predictive customer churn using post churn data

Question 13

Q

What is Train-Test Contamination

Answer

A

It happens when information from the test set influences the training process. This can occur throught:
* Data preprocessing (normalizing train uses values in a range different from test)
* Feature Selection
* Hyperparameter tunning
Results in overfitting and poor generalization

Question 14

Q

What is target encoding

Answer

A

Substitui cada categoria por uma estatistica do target (normalmente a média) - pode causar overfitting especialmente com poucas observações por categorias

Question 15

Q

What is frequency encoding

Answer

A

Substitui a categoria pela frequencia com que aparece no dataset, pode ser menos indicativo mas evita overfitting

Question 16

Q

What does high cardinality mean

Answer

Study These Flashcards

A

too many values/levels (some will have very low amount of observations)

Question 17

Q

How to fix High Cardinality

Answer

Study These Flashcards

A

Agrupar categorias + frequentes em “Outros”
Usar técnicas como target encoding ou frequency encoding
Reduzir a granularidade (p.e. em vez de usar cidade, referir como região)

Question 18

Q

Why is High Cardinality bad

Answer

Study These Flashcards

A

Makes inferences made by the algorithm about the relationship between the target and the level to be noisy (Modelo capta padrões falsos)
May lead models to overfit
May difficult and slowdown model training

Question 19

Q

Why is redundancy bad

Answer

Study These Flashcards

A

não adiciona informação nova
podem confundir o modelo
aumentam o tempo de treino sem beneficio

Question 20

Q

Como detetar redundancia

Answer

Study These Flashcards

A

Numerical data (pearson correlation - if linear relationship, covariance)
Numerical or ordinal data ( spearman correlation)
Categorical data (teste do qui quadrado( chi-square)

Question 21

Q

What is dimensionality reduction and what is useful

Answer

Study These Flashcards

A

Consiste em reduzir o nº de variaveis (features) no dataset mantendo o máximo de informação possível.
As the number of candidates variables for modeling increase, the number of observations must also increase (exponentially) to be able to capture the high dimensional patterns.
Em espaços de alta dimensão:
* os dados tornam se esparsos
* a distância entre os pontos perde significado
* o modelo está mais propenso para overfit

Question 22

Q

Como combater dimensionality reduction

Answer

Study These Flashcards

A

PCA: Principal Component Analysis (transforma as variaveis originais num novo conjunto de variaveis não correlacionadas)
t-SNE ou UMAP - técnicas mais avançadas para visualizações 2D e 3D
Seleção de features

2. Main differences from descriptive modeling Flashcards

(22 cards)