Reducing Complexity in Data in Microsoft Azure Flashcards Preview

DP-100 - PS > Reducing Complexity in Data in Microsoft Azure > Flashcards

Flashcards in Reducing Complexity in Data in Microsoft Azure Deck (10)
Loading flashcards...
1

A model uses three features to make predictions. Two of the features are categorical and hold three unique values each. Upon evaluation of the model, we identify the model has poor performance. Why is this?

Overfitting

Underfitting

Algorithm choice

Underfitting

2

Which of these common model evaluation metrics incorporates feature set complexity?

MSE - Mean Squared Error

BIC - Bayesian Informaton Criterion

AUC - Area Under ROC Curve

RMSE - Root Mean Square error

BIC

3

You want to remove categorical features that are unlikely to add value to your model. What would be an effective first step?

Hashing categorical features

One-hot encode categorical variables

Identify columns with no or low variance

Identify columns with no or low variance

4

Removing highly correlated features is an important step for which algorithm family?

k-means

Regression

Trees

Regression

5

Which of the following is NOT a commonly used kernel in kPCA?

Polynomial

Sigmoid

Radial Basis

Beta

Beta

6

High cardinality categorical variables have:

A large number of rows with a single label

High correlation with another feature

Labels with long names

A large number of distinct labels

A large number of distinct labels - High cardinality means that the column contains a large number of totally unique values.

7

What does Linear Discriminant Analysis (LDA) aim to maximize?

Difference between groups

Difference between observations

Sparse representation

Value compression

Difference between groups

8

k-means clustering can be classified as:

Supervised, parametric

Unsupervised, parametric

Unsupervised, non-parametric

Supervised, non-parametric

Unsupervised, non-parametric

Nonparametric statistics refers to a statistical method in which the data are not assumed to come from prescribed models that are determined by a small number of parameters; examples of such models include the normal distribution model and the linear regression model. Nonparametric statistics sometimes uses data that is ordinal, meaning it does not rely on numbers, but rather on a ranking or order of sorts.

9

What Python library might you use to build an autoencoder?

Tensorflow

PyStan

Scikit-learn

Codec

Tensorflow

10

What must you ensure about your features before you perform PCA?

Features are not collinear

Features are similarly scaled

Features are normally distributed

Features have high variance

Features are similarly scaled