Machine Learning Flashcards
How to handle imbalanced datasets
1. Change the performance metric
As we saw above, accuracy is not the best metric to use when evaluating imbalanced datasets as it can be very misleading. Metrics that can provide better insight include:
Confusion Matrix: a table showing correct predictions and types of incorrect predictions.
Precision: the number of true positives divided by all positive predictions. Precision is also called Positive Predictive Value. It is a measure of a classifier’s exactness. Low precision indicates a high number of false positives.
Recall: the number of true positives divided by the number of positive values in the test data. Recall is also called Sensitivity or the True Positive Rate. It is a measure of a classifier’s completeness. Low recall indicates a high number of false negatives.
F1: Score: the weighted average of precision and recall.
Let’s see what happens when we apply these F1 and recall scores to our logistic regression from above.
2. Change the algorithm
While in every machine learning problem, it’s a good rule of thumb to try a variety of algorithms, it can be especially beneficial with imbalanced datasets. Decision trees frequently perform well on imbalanced data. They work by learning a hierarchy of if/else questions and this can force both classes to be addressed.
3. Resampling Techniques — Oversample minority class
Our next method begins our resampling techniques.
Oversampling can be defined as adding more copies of the minority class. Oversampling can be a good choice when you don’t have a ton of data to work with.
We will use the resampling module from Scikit-Learn to randomly replicate samples from the minority class.
Important Note
Always split into test and train sets BEFORE trying oversampling techniques! Oversampling before splitting the data can allow the exact same observations to be present in both the test and train sets. This can allow our model to simply memorize specific data points and cause overfitting and poor generalization to the test data.
4. Resampling techniques — Undersample majority class
Undersampling can be defined as removing some observations of the majority class. Undersampling can be a good choice when you have a ton of data -think millions of rows. But a drawback is that we are removing information that may be valuable. This could lead to underfitting and poor generalization to the test set.
We will again use the resampling module from Scikit-Learn to randomly remove samples from the majority class.
5. Generate synthetic samples
A technique similar to upsampling is to create synthetic samples. Here we will use imblearn’s SMOTE or Synthetic Minority Oversampling Technique. SMOTE uses a nearest neighbors algorithm to generate new and synthetic data we can use for training our model.
Again, it’s important to generate the new samples only in the training set to ensure our model generalizes well to unseen data.
Other Methods
https://towardsdatascience.com/the-5-most-useful-techniques-to-handle-imbalanced-datasets-6cdba096d55a
AI v.s. ML
What is artificial intelligence?
Artificial intelligence is a broad field, which refers to the use of technologies to build machines and computers that have the ability to mimic cognitive functions associated with human intelligence, such as being able to see, understand, and respond to spoken or written language, analyze data, make recommendations, and more.
Although artificial intelligence is often thought of as a system in itself, it is a set of technologies implemented in a system to enable it to reason, learn, and act to solve a complex problem.
What is machine learning?
Machine learning is a subset of artificial intelligence that automatically enables a machine or system to learn and improve from experience. Instead of explicit programming, machine learning uses algorithms to analyze large amounts of data, learn from the insights, and then make informed decisions.
Machine learning algorithms improve performance over time as they are trained—exposed to more data. Machine learning models are the output, or what the program learns from running an algorithm on training data. The more data used, the better the model will get.
Differences between AI and ML
Now that you understand how they are connected, what is the main difference between AI and ML?
While artificial intelligence encompasses the idea of a machine that can mimic human intelligence, machine learning does not. Machine learning aims to teach a machine how to perform a specific task and provide accurate results by identifying patterns.
Let’s say you ask your Google Nest device, “How long is my commute today?” In this case, you ask a machine a question and receive an answer about the estimated time it will take you to drive to your office. Here, the overall goal is for the device to perform a task successfully—a task that you would generally have to do yourself in a real-world environment (for example, research your commute time).
In the context of this example, the goal of using ML in the overall system is not to enable it to perform a task. For instance, you might train algorithms to analyze live transit and traffic data to forecast the volume and density of traffic flow. However, the scope is limited to identifying patterns, how accurate the prediction was, and learning from the data to maximize performance for that specific task.
Explanatory Algorithms
One of the biggest problems in machine learning is understanding how various models get to their end predictions. We often know the “what” but struggle to explain the “why”.
Explanatory algorithms help us identify the variables that have a meaningful impact on the outcome we are interested in. These algorithms allow us to understand the relationships between the variables in the model, rather than just using the model to make predictions about the outcome.
Algorithms
Linear/Logistic Regression: a statistical method for modeling the linear relationship between a dependent variable and one or more independent variables — can be used to understand the relationships between variables based on the t-tests and coefficients. Decision Trees: a type of machine learning algorithm that creates a tree-like model of decisions and their possible consequences. They are useful for understanding the relationships between variables by looking at the rules that split the branches. Principal Component Analysis (PCA): a dimensionality reduction technique that projects the data onto a lower-dimensional space while retaining as much variance as possible. PCA can be used to simplify the data or to determine feature importance. Local Interpretable Model-Agnostic Explanations (LIME): an algorithm that explains the predictions of any machine learning model by approximating the model locally around the prediction by constructing a simpler model using techniques such as linear regression or decision trees. Shapley Additive explanations (SHAPLEY): an algorithm that explains the predictions of any machine learning model by computing the contribution of each feature to the prediction using a method based on the concept of “marginal contribution.”. It can be more accurate than SHAP in some cases. Shapley Approximation (SHAP): a method for explaining the predictions of any machine learning model by estimating the importance of each feature in the prediction. SHAP uses a method called the “coalitional game” method to approximate Shapley values and is generally faster than SHAPLEY.
Pattern Mining Algorithms
Pattern Mining Algorithms
Pattern mining algorithms are a type of data mining technique that are used to identify patterns and relationships within a dataset. These algorithms can be used for a variety of purposes, such as identifying customer buying patterns in a retail context, understanding common user behaviour sequences for a website/app, or finding relationships between different variables in a scientific study.
Pattern mining algorithms typically work by analyzing large datasets and looking for repeated patterns or associations between variables. Once these patterns have been identified, they can be used to make predictions about future trends or outcomes or to understand the underlying relationships within the data.
Algorithms
Apriori algorithm: an algorithm for finding frequent item sets in a transactional database — it’s efficient and widely used for association rule mining tasks. Recurrent Neural Network (RNN): a type of neural network that is designed to process sequential data as they are able to capture temporal dependencies in the data. Long Short-Term Memory (LSTM): a type of recurrent neural network that is designed to remember information for longer periods of time. LSTMs are able to capture longer-term dependencies in the data and are often used for tasks such as language translation and language generation. Sequential Pattern Discovery Using Equivalence Class (SPADE): a method for finding frequent patterns in sequential data by grouping together items that are equivalent in some sense. This method is able to handle large datasets and is relatively efficient, but may not work well with sparse data. PrefixSpan: an algorithm for finding frequent patterns in sequential data by constructing a prefix tree and pruning infrequent items. PrefixScan is able to handle large datasets and is relatively efficient, but may not work well with sparse data.
Ensemble Learning
Ensemble algorithms are machine learning techniques that combine the predictions of multiple models in order to make more accurate predictions than any of the individual models. There are several reasons why ensemble algorithms can outperform traditional machine learning algorithms:
Diversity: By combining the predictions of multiple models, ensemble algorithms can capture a wider range of patterns within the data. Robustness: Ensemble algorithms are generally less sensitive to noise and outliers in the data, which can lead to more stable and reliable predictions. Reducing overfitting: By averaging the predictions of multiple models, ensemble algorithms can reduce the tendency of individual models to overfit the training data, which can lead to improved generalization to new data. Improved accuracy: Ensemble algorithms have been shown to consistently outperform traditional machine learning algorithms in a variety of contexts.
Algorithms
Random Forest: a machine learning algorithm that creates an ensemble of decision trees and makes predictions based on the majority vote of the trees. XGBoost: a type of gradient boosting algorithm that uses decision trees as its base model and is known to be one of the strongest ML algorithms for predictions. LightGBM: another type of gradient boosting algorithm that is designed to be faster and more efficient than other boosting algorithms. CatBoost: A type of gradient boosting algorithm that is specifically designed to handle categorical variables well.
Clustering
Clustering algorithms are an unsupervised learning task and are used to group data into “clusters”. In contrast to supervised learning, where the target variable is known, there is no target variable in clustering.
This technique is useful for finding natural patterns and trends in data and is often used during the exploratory data analysis phase to gain further understanding of the data. Additionally, clustering can be used to divide a dataset into distinct segments based on various variables. A common application of this is in segmenting customers or users.
Algorithms
K-mode clustering: a clustering algorithm that is specifically designed for categorical data. It is able to handle high-dimensional categorical data well and is relatively simple to implement.
DBSCAN: A density-based clustering algorithm that is able to identify clusters of arbitrary shape. It is relatively robust to noise and can identify outliers in the data. Spectral clustering: A clustering algorithm that uses the eigenvectors of a similarity matrix to group data points into clusters. It is able to handle non-linearly separable data and is relatively efficient.
What is Gradient Descent
Gradient Descent (GD) is a popular optimization algorithm used in machine learning to minimize the cost function of a model. It works by iteratively adjusting the weights or parameters of the model in the direction of the negative gradient of the cost function, until the minimum of the cost function is reached.
What is the Curse of Dimensionality
As the dimensionality increases, the number of data points required for good performance of any machine learning algorithm increases exponentially. The reason is that, we would need more number of data points for any given combination of features, for any machine learning model to be valid.
What is Cross-Validation
Cross validation (CV) is one of the technique used to test the effectiveness of a machine learning models, it is also a re-sampling procedure used to evaluate a model if we have a limited data. To perform CV we need to keep aside a sample/portion of the data on which is not used to train the model, later use this sample for testing/validating.
K-Folds Cross Validation
K-Folds technique is a popular and easy to understand, it generally results in a less biased model compare to other methods. Because it ensures that every observation from the original dataset has the chance of appearing in training and test set. This is one among the best approach if we have a limited input data. This method follows the below steps.
1) Split the entire data randomly into K folds (value of K shouldn’t be too small or too high, ideally we choose 5 to 10 depending on the data size). The higher value of K leads to less biased model (but large variance might lead to over-fit), where as the lower value of K is similar to the train-test split approach we saw before.
2) Then fit the model using the K-1 (K minus 1) folds and validate the model using the remaining Kth fold. Note down the scores/errors.
3) Repeat this process until every K-fold serve as the test set. Then take the average of your recorded scores. That will be the performance metric for the model.
6 Steps towards a Successful Machine Learning Project
- Project Initiation: Idea, Requirements, and Data Acquisition
- Data Exploration
- Data Processing and Feature Selection
- Model Development
- Model Evaluation
- Model Deployment
How to Prevent Overfitting
Early stopping
Early stopping pauses the training phase before the machine learning model learns the noise in the data. However, getting the timing right is important; else the model will still not give accurate results.
Pruning
You might identify several features or parameters that impact the final prediction when you build a model. Feature selection—or pruning—identifies the most important features within the training set and eliminates irrelevant ones. For example, to predict if an image is an animal or human, you can look at various input parameters like face shape, ear position, body structure, etc. You may prioritize face shape and ignore the shape of the eyes.
Regularization
Regularization is a collection of training/optimization techniques that seek to reduce overfitting. These methods try to eliminate those factors that do not impact the prediction outcomes by grading features based on importance. For example, mathematical calculations apply a penalty value to features with minimal impact. Consider a statistical model attempting to predict the housing prices of a city in 20 years. Regularization would give a lower penalty value to features like population growth and average annual income but a higher penalty value to the average annual temperature of the city.
Ensembling
Ensembling combines predictions from several separate machine learning algorithms. Some models are called weak learners because their results are often inaccurate. Ensemble methods combine all the weak learners to get more accurate results. They use multiple models to analyze sample data and pick the most accurate outcomes. The two main ensemble methods are bagging and boosting. Boosting trains different machine learning models one after another to get the final result, while bagging trains them in parallel.
Data augmentation
Data augmentation is a machine learning technique that changes the sample data slightly every time the model processes it. You can do this by changing the input data in small ways. When done in moderation, data augmentation makes the training sets appear unique to the model and prevents the model from learning their characteristics. For example, applying transformations such as translation, flipping, and rotation to input images.
Types of Clustering Techniques
Centroid-Based clustering
Centroid-based clustering organizes the data into non-hierarchical clusters, in contrast to hierarchical clustering defined below. k-means is the most widely-used centroid-based clustering algorithm. Centroid-based algorithms are efficient but sensitive to initial conditions and outliers. This course focuses on k-means because it is an efficient, effective, and simple clustering algorithm.
Density-based Clustering
Density-based clustering connects areas of high example density into clusters. This allows for arbitrary-shaped distributions as long as dense areas can be connected. These algorithms have difficulty with data of varying densities and high dimensions. Further, by design, these algorithms do not assign outliers to clusters.
Distribution-based Clustering
This clustering approach assumes data is composed of distributions, such as Gaussian distributions. In Figure 3, the distribution-based algorithm clusters data into three Gaussian distributions. As distance from the distribution’s center increases, the probability that a point belongs to the distribution decreases. The bands show that decrease in probability. When you do not know the type of distribution in your data, you should use a different algorithm.
Hierarchical Clustering
Hierarchical clustering creates a tree of clusters. Hierarchical clustering, not surprisingly, is well suited to hierarchical data, such as taxonomies. See Comparison of 61 Sequenced Escherichia coli Genomes by Oksana Lukjancenko, Trudy Wassenaar & Dave Ussery for an example. In addition, another advantage is that any number of clusters can be chosen by cutting the tree at the right level.
Bias Vs. Variance
What is bias?
Bias is the difference between the average prediction of our model and the correct value which we are trying to predict. Model with high bias pays very little attention to the training data and oversimplifies the model. It always leads to high error on training and test data.
What is variance?
Variance is the variability of model prediction for a given data point or a value which tells us spread of our data. Model with high variance pays a lot of attention to training data and does not generalize on the data which it hasn’t seen before. As a result, such models perform very well on training data but has high error rates on test data.
Can you explain the difference between deep learning and machine learning?
Deep learning is a subfield of machine learning that uses artificial neural networks with multiple layers to learn and represent complex patterns and relationships in data. In contrast, machine learning is a broader field that includes a variety of algorithms and techniques for training models to make predictions or decisions based on data.
The key difference between deep learning and traditional machine learning is the complexity and flexibility of the models.
Traditional machine learning algorithms, such as decision trees, support vector machines, and linear regression, typically rely on handcrafted features and are limited in their ability to handle large amounts of data or complex relationships between variables.
Deep learning, on the other hand, can automatically learn hierarchical representations of features from raw data, and can model highly nonlinear relationships between variables. Deep learning algorithms, such as convolutional neural networks (CNNs) and recurrent neural networks (RNNs), have been successfully applied to a wide range of tasks, such as image and speech recognition, natural language processing, and game playing.
In summary, deep learning is a subset of machine learning that uses deep neural networks with multiple layers to learn complex patterns in data. While traditional machine learning focuses on handcrafted features and simpler models, deep learning can automatically learn complex representations of data and is particularly effective in applications that require handling large amounts of data and complex relationships between variables.
Can you describe the steps involved in building a machine learning pipeline?
Building a machine learning pipeline involves several key steps, which can vary depending on the specific application and the data available. Here are some common steps involved in building a machine learning pipeline:
- Data preparation and preprocessing: This step involves collecting and cleaning the data, handling missing values, and converting the data into a format suitable for analysis. This may involve tasks such as data cleaning, normalization, feature scaling, and feature engineering.
- Feature selection: This step involves selecting the subset of features that are most relevant for the model. This can be done using various techniques such as correlation analysis, feature importance ranking, and principal component analysis.
- Model selection: This step involves selecting the appropriate model for the problem at hand. The choice of model depends on the specific problem, the data, and the performance metrics. Common machine learning models include decision trees, linear and logistic regression, support vector machines, and neural networks.
- Model training: This step involves training the selected model on the prepared data. This may involve splitting the data into training and testing sets, and using cross-validation techniques to estimate the model’s performance on new data.
- Hyperparameter tuning: This step involves selecting the optimal values for the model’s hyperparameters, which are parameters that are not learned from the data but are set prior to training. This can be done using techniques such as grid search or random search.
- Model evaluation: This step involves evaluating the performance of the trained model on the testing set, using appropriate performance metrics such as accuracy, precision, recall, and F1-score.
- Deployment and monitoring: Once the model is trained and evaluated, it can be deployed in production environments. This step involves integrating the model into a larger system, and monitoring its performance and accuracy over time.
Overall, building a machine learning pipeline requires a combination of domain knowledge, technical skills, and experience with data analysis and modeling techniques. Effective machine learning pipelines require careful attention to each of these steps, as well as ongoing refinement and optimization over time.