Reducing Complexity in Data in Microsoft Azure Flashcards

Question 1

Q

A model uses three features to make predictions. Two of the features are categorical and hold three unique values each. Upon evaluation of the model, we identify the model has poor performance. Why is this?

Overfitting

Underfitting

Algorithm choice

Answer

A

Underfitting

Question 2

Q

Which of these common model evaluation metrics incorporates feature set complexity?

MSE - Mean Squared Error

BIC - Bayesian Informaton Criterion

AUC - Area Under ROC Curve

RMSE - Root Mean Square error

Question 3

Q

You want to remove categorical features that are unlikely to add value to your model. What would be an effective first step?

Hashing categorical features

One-hot encode categorical variables

Identify columns with no or low variance

Answer

A

Identify columns with no or low variance

Question 4

Q

Removing highly correlated features is an important step for which algorithm family?

k-means

Regression

Trees

Answer

A

Regression

Question 5

Q

Which of the following is NOT a commonly used kernel in kPCA?

Polynomial

Sigmoid

Radial Basis

Beta

Question 6

Q

High cardinality categorical variables have:

A large number of rows with a single label

High correlation with another feature

Labels with long names

A large number of distinct labels

Answer

A

A large number of distinct labels - High cardinality means that the column contains a large number of totally unique values.

Question 7

Q

What does Linear Discriminant Analysis (LDA) aim to maximize?

Difference between groups

Difference between observations

Sparse representation

Value compression

Answer

A

Difference between groups

Question 8

Q

k-means clustering can be classified as:

Supervised, parametric

Unsupervised, parametric

Unsupervised, non-parametric

Supervised, non-parametric

Answer

A

Unsupervised, non-parametric

Nonparametric statistics refers to a statistical method in which the data are not assumed to come from prescribed models that are determined by a small number of parameters; examples of such models include the normal distribution model and the linear regression model. Nonparametric statistics sometimes uses data that is ordinal, meaning it does not rely on numbers, but rather on a ranking or order of sorts.

Question 9

Q

What Python library might you use to build an autoencoder?

Tensorflow

PyStan

Scikit-learn

Codec

Answer

A

Tensorflow

Question 10

Q

What must you ensure about your features before you perform PCA?

Features are not collinear

Features are similarly scaled

Features are normally distributed

Features have high variance

Answer

A

Features are similarly scaled

Reducing Complexity in Data in Microsoft Azure Flashcards

(10 cards)