Terminology Flashcards
(16 cards)
Overfitting
A phenomenon where a model does very well on training data but does poorly during validation or on new data
Underfitting
When a model does very poorly because it failed to capture important features or distinctions even on the training set
Validation set
A subset of the training data set used to test the early accuracy of a model during the tuning stages.
Training set
A data set used to fit the model.
Testing set
data set used to provide an unbiased evaluation of a model.
Model fitting
Approximation of data to a target function.
Decision tree regression
Similar to decision tree but used to find a continuous value and mean squared error is used to determine the number of splits.
Imputation
The filling of missing values in a dataset
Categorical attributes
Values that fall into a set of ‘categories’
What are techniques are used to deal with categorical data?
1 ) dropping categorical columns.
2) Label encoding
3) One-hot encoding
Label encoding
Assigning a unique integer to a categorical value
One-hot encoding
creation of new columns for each unique categorical
Pipeline
automation of workflow to bundle preprocessing and modeling together
Cross validation
Subsets of training data used to provide a more accurate reading
Variance
How different the results are when the model is tested on new data sets.
Data leakage
When training data contains some feature related to the target in such a way that the model has great accuracy on the training data but does poorly to unseen data