Reducing Complexity in Data in Microsoft Azure Flashcards

1
Q

A model uses three features to make predictions. Two of the features are categorical and hold three unique values each. Upon evaluation of the model, we identify the model has poor performance. Why is this?

Overfitting

Underfitting

Algorithm choice

A

Underfitting

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Which of these common model evaluation metrics incorporates feature set complexity?

MSE - Mean Squared Error

BIC - Bayesian Informaton Criterion

AUC - Area Under ROC Curve

RMSE - Root Mean Square error

A

BIC

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

You want to remove categorical features that are unlikely to add value to your model. What would be an effective first step?

Hashing categorical features

One-hot encode categorical variables

Identify columns with no or low variance

A

Identify columns with no or low variance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Removing highly correlated features is an important step for which algorithm family?

k-means

Regression

Trees

A

Regression

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Which of the following is NOT a commonly used kernel in kPCA?

Polynomial

Sigmoid

Radial Basis

Beta

A

Beta

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

High cardinality categorical variables have:

A large number of rows with a single label

High correlation with another feature

Labels with long names

A large number of distinct labels

A

A large number of distinct labels - High cardinality means that the column contains a large number of totally unique values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What does Linear Discriminant Analysis (LDA) aim to maximize?

Difference between groups

Difference between observations

Sparse representation

Value compression

A

Difference between groups

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

k-means clustering can be classified as:

Supervised, parametric

Unsupervised, parametric

Unsupervised, non-parametric

Supervised, non-parametric

A

Unsupervised, non-parametric

Nonparametric statistics refers to a statistical method in which the data are not assumed to come from prescribed models that are determined by a small number of parameters; examples of such models include the normal distribution model and the linear regression model. Nonparametric statistics sometimes uses data that is ordinal, meaning it does not rely on numbers, but rather on a ranking or order of sorts.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What Python library might you use to build an autoencoder?

Tensorflow

PyStan

Scikit-learn

Codec

A

Tensorflow

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What must you ensure about your features before you perform PCA?

Features are not collinear

Features are similarly scaled

Features are normally distributed

Features have high variance

A

Features are similarly scaled

How well did you know this?
1
Not at all
2
3
4
5
Perfectly