Data Procesing Flashcards

1
Q

Data adequacy

A

Historical data must reflect future behavior
Sample must be representative
Older data less relevant
Impactful events should be noted
Be aware of sampling bias

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Convert numeric to factor

A

Yes if:
-Variables has a small # of distinct values.
-Variable values are merely numeric labels (bo sense of numeric order, group no)
-Variables has a complex relationship with target → factor conversion gives models more flexibility to capture relationships

No if:
-Monotonic relationship with target. Its effect can be captured by treating as numeric.
-Values have a sense of numeric order that might be helpful to predict target
-Variables has a large no of distinct values, ex hours of day (would cause high dimension and overfitting if converted)
-Future observation will have new variables values (calendar year)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Sparse level for categorical predictors

A

Combine levels where the target variable behaves similarity to form more representative and interpretable groups

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

AUC ( area under the curve)

A

Measures Accuracy in classification problems

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Model validation based on test data

A

Predicted us actual values of target: two sett should be close(can check quantitative or graphically)
Benchmark model: shows that the recommended model outperforms a benchmark model (intercept only glm, purely random classifier)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Handling outliers ( problems with skewness)

A

-Remove it
- keep it ( make up an insignificant proportion of the data)
- modify it (change negative value to zero)
- use robust model forms: fit models by minimizing the absolute error (instead of squared error) btw predicted and observed values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Guassian

A

Symmetric and slows negative value

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Continuous distribution

A

Normal→ all real #

Game- inverse G - y>0

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Poisson

A

Count, frequency
Positive values

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Tweedy

A

Continuous / discrete
Real + values

How well did you know this?
1
Not at all
2
3
4
5
Perfectly