2. Main differences from descriptive modeling Flashcards
(22 cards)
what is sucess criteria
it is the criteria for a succefull or useful outcome for the project from the business point fo view
what are the main differences in crisp dm between descriptive modeling and predictive modeling
- High number of iterations (tends to increase exponentationally between data preparation and modeling)
- Evaluation of the business success criteria (it is needed to design a test to measure the business sucess criteria in the evaluation phase can be more challenging - it is harder to stablish a goal for something that we dont know when in comparison to descriptive)
- Cyclical nature (the model is harder to be “closed”, the lessons learned can cause more issues, and as time goes by things change)
what is performative prediction
it happens when prediction changes future outcomes.
Deployment changes the model from observer to actor
how does performative prediction work
- We train a model on distribution
- Deployment causes a distribution shift
- This shift can change the model performance - for better or worse
How to stop performance prediction
- By default, assume that the model is affected
- Monitor for distribution shift
- Think of human as agents who will react to the predictions
- Analyze: are there feedback loops?
What are the main issues to detect and take in consideration in data preparation phase
- Missing Values
- Outliers
- Categorical variables with high cardinality
- Categorical variables where one level represents most of the observations
- Strong correlations with the target variable
- Strong correlation between the different values
What is a target or label
columns in the dataset that contains the values that we want to estimate/predict as the business objectives. It can be numeric or categorical
What is data leakage?
Data leakage occurs when information from outside the training dataset influences the model.
Why is data leakage an issue?
It can lead overly optimistic performance during training bur poor performance in production.
What is a sign we need to doubt that data leakage is occuring
High accuracy
What are the 2 types of data leakage
- target leakage
- train test contamination
What is target leakage
occurs when predictive features include information about the target that would not be available at the prediction time.
It can lead to unrealistic good model performance
example: predictive customer churn using post churn data
What is Train-Test Contamination
It happens when information from the test set influences the training process. This can occur throught:
* Data preprocessing (normalizing train uses values in a range different from test)
* Feature Selection
* Hyperparameter tunning
Results in overfitting and poor generalization
What is target encoding
Substitui cada categoria por uma estatistica do target (normalmente a média) - pode causar overfitting especialmente com poucas observações por categorias
What is frequency encoding
Substitui a categoria pela frequencia com que aparece no dataset, pode ser menos indicativo mas evita overfitting
What does high cardinality mean
too many values/levels (some will have very low amount of observations)
How to fix High Cardinality
- Agrupar categorias + frequentes em “Outros”
- Usar técnicas como target encoding ou frequency encoding
- Reduzir a granularidade (p.e. em vez de usar cidade, referir como região)
Why is High Cardinality bad
- Makes inferences made by the algorithm about the relationship between the target and the level to be noisy (Modelo capta padrões falsos)
- May lead models to overfit
- May difficult and slowdown model training
Why is redundancy bad
- não adiciona informação nova
- podem confundir o modelo
- aumentam o tempo de treino sem beneficio
Como detetar redundancia
- Numerical data (pearson correlation - if linear relationship, covariance)
- Numerical or ordinal data ( spearman correlation)
- Categorical data (teste do qui quadrado( chi-square)
What is dimensionality reduction and what is useful
Consiste em reduzir o nº de variaveis (features) no dataset mantendo o máximo de informação possível.
As the number of candidates variables for modeling increase, the number of observations must also increase (exponentially) to be able to capture the high dimensional patterns.
Em espaços de alta dimensão:
* os dados tornam se esparsos
* a distância entre os pontos perde significado
* o modelo está mais propenso para overfit
Como combater dimensionality reduction
- PCA: Principal Component Analysis (transforma as variaveis originais num novo conjunto de variaveis não correlacionadas)
- t-SNE ou UMAP - técnicas mais avançadas para visualizações 2D e 3D
- Seleção de features