Machine Learning Flashcards
(89 cards)
What is Semi-supervised Machine Learning?
Semi-supervised learning is the blend of supervised and unsupervised learning. The algorithm is trained on a mix of labeled and unlabeled data. Generally, it is utilized when we have a very small labeled dataset and a large unlabeled dataset.
In simple terms, the unsupervised algorithm is used to create clusters and by using existing labeled data to label the rest of the unlabelled data. A Semi-supervised algorithm assumes continuity assumption, cluster assumption, and manifold assumption.
It is generally used to save the cost of acquiring labeled data. For example, protein sequence classification, automatic speech recognition, and self-driving cars.
What is the manifold assumption in semi-supervised learning?
The manifold assumption in semi-supervised learning states that
(a) the input space is composed of multiple lower-dimensional manifolds on which all data points lie and
(b) data points lying on the same manifold have the same label
What is the continuity assumption in semi-supervised learning?
The continuity assumption states that objects near each other tend to share the same group or label.
This assumption is also used in supervised learning, and the datasets are separated by the decision boundaries.
In semi-supervised learning, the decision boundaries are added with the smoothness assumption in low-density boundaries.
What is the cluster assumption in semi-supervised learning?
The cluster assumption states that data are divided into different discrete clusters, and that points in the same cluster share the output label.
How do you choose which algorithm to use for a dataset?
This will depend on the business use case, amounts of labelled data and application requirements
What is supervised machine learning?
Supervised ML is when the algorithm is trained using a labeled dataset, which consists of pairs of input and output data.
The main classes are Regression and Classification
What are regression based algorithms?
In regression, the target variable is a continuous value. The goal of regression is to predict the value of the target variable based on the input variables.
Linear regression, polynomial regression, and decision trees are some of the examples of regression algorithms.
What are classification based algorithms?
In classification, the target variable is a categorical value. The goal of classification is to predict the class or category of the target variable based on the input variables.
Some examples of classification algorithms include logistic regression, decision trees, support vector machines, and neural networks.
What is linear regression?
Linear regression is a type of regression algorithm that is used to predict a continuous output value. It is one of the simplest and most widely used algorithms in supervised learning. In linear regression, the algorithm tries to find a linear relationship between the input features and the output value. The output value is predicted based on the weighted sum of the input features.
What is logistic regression?
Logistic regression is a type of classification algorithm that is used to predict a binary output variable. It is commonly used in machine learning applications where the output variable is either true or false, such as in fraud detection or spam filtering. In logistic regression, the algorithm tries to find a linear relationship between the input features and the output variable. The output variable is then transformed using a logistic function to produce a probability value between 0 and 1.
What are decision trees?
A decision tree is a type of algorithm that is used for both classification and regression tasks.
Consists of three components: decision nodes, leaf nodes, and a root node.
A decision tree algorithm divides a training dataset into branches, which further segregate into other branches. This sequence continues until a leaf node is attained. The leaf node cannot be segregated further.
The nodes in the decision tree represent attributes that are used for predicting the outcome.
It is used to model decisions and their possible consequences.
Each internal node in the tree represents a decision, while each leaf node represents a possible outcome. Decision trees can be used to model complex relationships between input features and output variables.
What are random forests?
Random forests are an ensemble learning technique that is used for both classification and regression tasks. They are made up of multiple decision trees that work together to make predictions. Each tree in the forest is trained on a different subset of the input features and data (data bagging). The final prediction is made by aggregating the predictions of all the trees in the forest.
Explain the K Nearest Neighbor Algorithm
The K Nearest Neighbor (KNN) is a supervised learning classifier. It uses proximity to classify labels or predict the grouping of individual data points. We can use it for regression and classification. KNN algorithm is non-parametric, meaning it doesn’t make an underlying assumption of data distribution.
In the KNN classifier:
We find K-neighbors nearest to the white point. In the example below, we chose k=5.
To find the five nearest neighbors, we calculate the euclidean distance between the white point and the others. Then, we chose the 5 points closest to the white point.
There are three red and two green points at K=5. Since the red has a majority, we assign a red label to it.
Is it true that we need to scale our feature values when they vary greatly?
Yes. Most of the algorithms use Euclidean distance between data points, and if the feature value varies greatly, the results will be quite different. In most cases, outliers cause machine learning models to perform worse on the test dataset.
We also use feature scaling to reduce convergence time. It will take longer for gradient descent to reach local minima when features are not normalized.
What are the different types of error present in machine learning?
There are mainly two types:
- Reducible errors - can be reduced to improve model accuracy. Such errors can further be classified into bias and Variance.
- Irreducible errors: These errors will always be present in the model
What is the bias/variance trade off?
For an accurate model, algorithms need a low variance and low bias. But this is not possible because bias and variance are related to each other:
Decreasing variance will increase bias
Decreasing bias will increase variance
Ideally, we need a model that accurately captures the regularities in training data and simultaneously generalizes well with the unseen dataset. Unfortunately, doing this is not possible simultaneously.
High bias + low variance = Underfitting
Low bias + high variance = Overfitting
The Bias-Variance trade-off is about finding the sweet spot to make a balance between bias and variance errors.
What is bias?
Bias is the difference between the average prediction of our model and the correct value which we are trying to predict. Model with high bias pays very little attention to the training data and oversimplifies the model (underfitted). It always leads to high error on training and test data.
What is variance?
Variance is the variability of model prediction for a given data point or a value which tells us spread of our data. Model with high variance pays a lot of attention to training data and does not generalize on the data which it hasn’t seen before (overfitted). As a result, such models perform very well on training data but has high error rates on test data.
How can you deal with overfitting due to low bias?
Low bias occurs when the model is predicting values close to the actual value. It is mimicking the training dataset. The model has no generalization which means if the model is tested on unseen data, it will give poor results.
Bagging (parallel ensemble technique): randomly create subsets of training data and train same algorithm on each subset in parallel, taking consensus result. The combination of models reduces the variance and makes it more reliable compared to a single model.
How can you deal with overfitting due to high variance?
Model with high variance pays a lot of attention to training data and does not generalize on the data which it hasn’t seen before (overfitted).
Regularization techniques: penalise higher model coefficients to lower model complexity. Includes LASSO, Ridge Regression and ElasticNet.
Dimensionality reduction via feature selection
Boosting: iterative ensemble technique that adjusts the weights based on the last classification (assigns higher weight to inaccurate predictions)
What is the interpretation of a ROC area under the curve?
Receiver operating characteristics (ROC) shows the trade-off between sensitivity and specificity.
- Sensitivity: it is the probability that the model predicts a positive outcome when the actual value is also positive (precision)
- Specificity: it is the probability that the model predicts a negative outcome when the actual value is also negative.
The curve is plotted using the False positive rate (FP/(TN + FP)) and true positive rate (TP/(TP + FN))
The area under the curve (AUC) shows the model performance. If the area under the ROC curve is 0.5, then our model is completely random. The model with AUC close to 1 is the better model.
What are the methods of reducing dimensionality?
For dimensionality reduction, we can use feature selection or feature extraction methods.
Dimensionality reduction will decrease the computational cost of training, decrease storage requirements, and may improve the generalisation performance of the model
What is feature selection?
Feature selection is a process of selecting optimal features and dropping irrelevant features. We use Filter, Wrapper, and Embedded methods to analyze feature importance and remove less important features to improve model performance.
What is feature extraction?
Feature extraction transforms the space with multiple dimensions into fewer dimensions. No information is lost during the process, and it uses fewer resources to process the data. The most common extraction techniques are Linear discriminant analysis (LDA), Kernel PCA, and Quadratic discriminant analysis.