CA Mock exams Flashcards

1
Q

Fbinarization might impact the feature selection in stepwise selection

A

Binarization helps simplify the model to only the predictors that are deemed necessary, as individual level factor can be left out if they do not contribute significantly to the model. (high p value is not good)

However, including only some of the dummy variables from a factor will
result in the merging of certain levels. Since the resulting merger is purely performance-based, it can
complicate the model’s interpretability when unintuitive levels are combined

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

what a cutoff is and how it is involved in calculating the AUC

A

After a model produces prediction probabilities, we decide a cutoff to obtain predictions of positive and negative. All predictive probabilities above the cutoff will be predicted as positive. If the cutoff is too high, most observations will be negative, producing a high specificity, cutoff dictates the value of sensitivity and specificity

By plotting every possible pair of sensitivity and specificity values due to changing the cutoff, the results is an ROC curve. The points are connected from the bottom left of the plot, where sensitivity is 0 and specificity is 1, to the top right,the Area under the ROC is the AUC.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

GLM vs decision tree for numeric predictor?

A

For decision trees a good numeric predictor have distinct intervals that lead to clear differences in the target, identifying the split point where there is a big differences

If there is not a strong slope, glm should not be used.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Difference in resulting features generated by principal components vs clustering

A

PCA results in numeric features called PC. They are a linear combination of the analyzed variables, which mean a pc summarizes the variables by specifying how much each variable contributes to its calculation.

Clustering identifies clusters or grouping based on the analyzed variables meaning it results in a factor. Similar observations are group into same cluster while dissimilar observation are group into different clusters.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Stepwise selection and regularization

A

Similarities:
Both produced feature selection, dropping predictors that do not contribute.
Avoid overfitting in the data, specially when the number of observation is small compared to predictors.
Reduce complexity

Difference
The way they measure flexibility, stepwise measures it by the number of predictors while regularization measures by shrinkage parameter. Stepwise would use AIC or BIC, regularization uses model accuracy metric calculated from cross validation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

best subset vs stepwise

A

Best subset selection is performed by fitting all p models, that contain exactly one predictor and picking the model with smallest deviance,
fitting all p choose 2 models that contain exactly 2 predictors and picking the model with lowest
deviance, and so forth. Then a single best model is selected from the models picked, using a metric such
as AIC. it can be quite a large search space as p
increase

Stepwise selection is an alternative to best subset selection, which is computationally more efficient,
since it considers a much smaller set of models. For example, forward stepwise selection begins with a
model containing no predictors, and then adds predictors to the model, one-at-a-time, until adding a
predictor leads to a worse model by a measure such as AIC. At each step the predictor that gives the
greatest additional improvement to the fit is added to the model. The best model is the one fit just
before adding a variable leads to a decrease in performance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Impurity measure in classification trees

A

-decide which split in the decision
tree (if any) should be made next
-decide which
branches of the tree to prune back after building a decision tree by removing branches that don’t
achieve a defined threshold of impurity reduction through cost-complexity pruning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

single vs complete linkage

A

The clustering algorithm starts out with n clusters and fuses them together in an iterative process based
on which observations are most similar. The complete linkage method considers the maximum
intercluster dissimilarity and the single linkage method uses minimum intercluster dissimilarity. As such,
the single method tends to fuse observations on one at a time resulting in less balanced clusters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

calculation to determine a split in classification vs regression tree

A

Regression trees determine splits by first measuring the residual sum of squares errors between the
target and predicted values.
Classification decision trees measure impurity using entropy, Gini index, or classification error. These
measures all attempt to increase the homogeneity of the target variable at each split

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

accuracy vs auc

A

Accuracy is measured by the ratio of correct number of predictions to total number of predictions made.
the classifications based on a fixed cutoff
point.
AUC performance across the full range of thresholds while accuracy measures performance
only at the selected threshold.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Explain how changing the link function in the GLM impacts the model fitting and how
this can impact predictor significance

A

The link function specifies a functional relationship between the linear predictor and the mean of the
distribution of the outcome conditional on the predictor variables. Different link functions have different
shapes and can therefore fit to different nonlinear relationships between the predictors and the target
variable.
When the link function matches the relationship of a predictor variable, the mean of the outcome
distribution (the prediction) will generally be closer to the actual values for the target variable, resulting
in smaller residuals and more significant p-values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Proxy variable

A

Proxy variables are variables that are used in place of other information, usually because the desired
information is either impossible or impractical to measure. For a variable to be a good proxy it must
have a close relationship with the variable of interest

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

why potential legal or ethical concerns
including whether proxy variables should be used

A

Data such as race, age, and income are generally considered sensitive information. Some jurisdictions
have legal constraints on the use of sensitive information. Before proceeding there should be
consideration of any applicable law. There are no clear rules for what ethical use of data is. Good
professional judgement must be used to ensure that inappropriate discrimination is not occurring within
the model or the project. Public perception should also be considered. The politician or the city could
suffer bad press if there is a belief that the project inappropriately discriminates.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Stepwise selection and regularization pt 2

A

Stepwise selection takes iterative steps, either from no predictors or from a model with all predictor. The selected model adds or drops predictors until there is no improvement as measured by AIC.

Shrinkage methods fits coefficient for all predictors to optimize a loss function that includes a penalty parameter that penalizes large coefficients. Shrinkage methods can reduce the size of coefficients without eliminating variables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

prescriptive analytics

A

Study that emphasizes the outcomes or consequences of decisions made in model or implementation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

issue in a variable when a level overtakes most of the observations

A

We see heating has several factor levels that have very few observations. As a predictor, it would search
for house price distinctions between its levels, but any pattern that might be found would be based on
so few observation, and thus cannot be deemed reliable. We should consider improving heating from
what it is now.

17
Q

Modeling impact of converting factor for glm

A

Numeric variables are assign a coefficient, they assume a monotonic relationship (unless polynomial terms are added) consecutive changes have the same marginal impact on linear predictors.
If changed to factor, they will be binarized and each level other than the baseline will represent a dummy variable with separate coefficient, no define order among value of coefficient which allows a nonlinear relationship and capture the training data more efficiently.

This increase in flexibility is more prone to overfitting.

18
Q

overfitting decision trees vs random forest

A

Decision trees overfits because of the sequential fashion in which the splits are made, the effect of a poorly chosen split early on will affect the rest of the fitted tree. It makes single trees sensitive to small changes in the training data

Random forest, by constructing multiple trees in parallel, and averaging the results it reduces the variance which prevent overfitting. Also by taking a random sample of predictions, it decorrelated the base trees further reducing the variance.

19
Q

Weights

A

to use weights properly, the obs of the target should be averaged by exposure. The variance of each obs is inversely related to the size of the exposure, which serves as the weight for that obs. The weights do not assume the mean of the target.

20
Q

Offset

A

the impact of the number of that variable on the target is known in advance. The mean of the target is a multiple of the value of the predictor. if modeled with a log link implies the offset would be the natural log of the variable.