Terminology Flashcards

Question 1

Q

Overfitting

Answer

A

A phenomenon where a model does very well on training data but does poorly during validation or on new data

Question 2

Q

Underfitting

Answer

A

When a model does very poorly because it failed to capture important features or distinctions even on the training set

Question 3

Q

Validation set

Answer

A

A subset of the training data set used to test the early accuracy of a model during the tuning stages.

Question 4

Q

Training set

Answer

A

A data set used to fit the model.

Question 5

Q

Testing set

Answer

A

data set used to provide an unbiased evaluation of a model.

Question 6

Q

Model fitting

Answer

A

Approximation of data to a target function.

Question 7

Q

Decision tree regression

Answer

A

Similar to decision tree but used to find a continuous value and mean squared error is used to determine the number of splits.

Question 8

Q

Imputation

Answer

A

The filling of missing values in a dataset

Question 9

Q

Categorical attributes

Answer

A

Values that fall into a set of ‘categories’

Question 10

Q

What are techniques are used to deal with categorical data?

Answer

A

1 ) dropping categorical columns.

2) Label encoding
3) One-hot encoding

Question 11

Q

Label encoding

Answer

A

Assigning a unique integer to a categorical value

Question 12

Q

One-hot encoding

Answer

A

creation of new columns for each unique categorical

Question 13

Q

Pipeline

Answer

A

automation of workflow to bundle preprocessing and modeling together

Question 14

Q

Cross validation

Answer

A

Subsets of training data used to provide a more accurate reading

Question 15

Q

Variance

Answer

A

How different the results are when the model is tested on new data sets.

Question 16

Q

Data leakage

Answer

Study These Flashcards

A

When training data contains some feature related to the target in such a way that the model has great accuracy on the training data but does poorly to unseen data

Terminology Flashcards

(16 cards)