Machine Learning Flashcards
(28 cards)
Hierarchical clustering is most likely used when the problem involves
Classifying unlebeled data
and when number of categories are unknown
What is Supervised machine learning
Involves training an algorithm to take a set of inputs (x variables) and find a model that best relates them to outputs (Y variables)
Training algorithm - Set of inputs - find models that relates to outputs.
What is unsupervised machine learning
Same as supervised learning, but does not make use of labeled training data.
We give it data and expect the algorithm to make sense of it.
What is Overfitting
ML models can produce overly complex models that may fit the training data too well and thereby not generalize new data well.
The prediction model of the traning sample (in-Sample data) is too complex.
The traning Data does not work well with the new data
Name Supervised ML Algorithms
Penalized regression
Support Vector Machine (SVM)
K - Nearest Neighbor
Classification and Regression Trees (CART)
Ensemble learning
Random Forest
Name unsupervised ML Algortihms
Principle component analysis
K-Mean clustering
Hierarachical clustering
High Bias Error in ML
High Bias Error means the model does not fit the training data well.
High Variance Error in ML
High variance error means the model does not predict well on the test data
Name Dimension Reduction in ML
Principle component analysis (unsupervised ML)
Penalized Regression
(Supervised ML)
What does Penalized Regression do?
- Simmilar to maximizing adjusted R square.
- Demension Reduction
- Eliminates/minimazie overfitting
Regression coefficients are chosen to minimize the sum of the squared error, plus a penalty term that increases with the number of included features
What is SVM
Support Vector Machine
It is Classification, Regression, and Outlier detection
Classifying data that is not complex or non-linear.
Is a linear classifier that determines the hyperplane that optimally seperates the observation into two sets of data points.
Does not requier any hyperparameter.
Maximize the probability of making a correct prediction by determining the boundry that is furthest from all observation.
Outliers do not affect either the support vectors or the discriminant boundry.
What is K-Nearest Neighbor
Classification
Classify new observation by finding similarities in the existing data.
Makes no assumption about the distribution of the data.
It is non-parametric.
KNN results can be sensitive to inclusion of irrelevant or correlated featuers, so it may be neccessary to select featuers manually.
Thereby removing less irrelevant information.
What is CART
Classification and Regression Trees
Part of supervised ML
Typically applied when the target is binary.
If the goal is regression, the prediction would be the mean of the values of the terminal node.
Makes no assumption about the characteristics of the traning data, so if left unconstrained, potentially it can perfectly learn the traning data.
To avoid overfitting, regulation paramterers can be added, such as the maximum dept of the tree.
What are the 3 types of layer in Neural Network
- Input layer
- Hidden layer
- Output layer
What are non-linear functions more susceptiable to?
Variance error and overfitting
What are linear functions more susceptiable to?
Bias error and underfitting
The main distinction between clustering and classification algorithms is that
The groups in clustering are determined by the data
Classification they are determined by the analyst/researcher
What is K-Means clustering in ML?
K-means partitions observations into a fixed number, k, of non-overlaping cluster.
Each cluster is characterized by its centroid, and each observation is assigned by the algorithm to the cluster with the centroid to which that observation is closest.
High bias error and high variance error are indicative of…
Underfitting
High bias error = model does not fit on the traning data.
High variance = Model does not predict well on test data.
Both combination results in a underfitted model.
Low bias error but high variance error is indicative of ..
Overfitting
Bias error = model does not fit the traning data well.
Variance error = Model does not predict well on test data.
What are linear models more susceptible to?
Bias Error (underfitting)
What are non-linear models more prone to?
Variance Error
(overfitting)
What is Principal Components Analysis
It is part of unsupervised ML
Dimension Reduction
Use to reduce highly correlaed featuers of data into few main uncorrelated composite variables.
What are the 3 types of error in ML?
Bias error
Variance error
Base error