Data Science Terminology - assorted topics Flashcards

Question

Dimensionality Reduction

Answer 1

This involves reducing the number of input variables for a dataset, either by selecting only the most relevant features (feature selection) or by creating new features that capture the essential information in a more condensed form (feature extraction).

Answer 2

This involves using algorithms that can learn patterns from data and make predictions or decisions. It encompasses various techniques like regression, classification, clustering, and reinforcement learning.

Answer 3

This is a subset of machine learning that focuses on artificial neural networks with many layers ("deep" networks). It's particularly effective for complex tasks like image and speech recognition.

Answer 4

This involves techniques for dealing with human language. It's used in applications like sentiment analysis, machine translation, and speech recognition.

Answer 5

This involves analyzing data that is recorded over time to identify trends, seasonal patterns, and other temporal structures.

Answer 6

This technique is used to identify outliers in the data, which can indicate errors, unusual behavior, or interesting patterns.

Answer 7

These systems suggest products or services to users based on their past behavior, preferences, and characteristics. Techniques include collaborative filtering and content-based filtering.

Answer 8

This is used to compare two versions of a webpage, marketing strategy, or other entity to determine which one performs better.

Answer 9

These combine the predictions of multiple machine learning models to improve the overall performance. Techniques include bagging, boosting, and stacking.

Answer 10

Ensemble methods are techniques in machine learning that combine the decisions from multiple models to improve the overall performance. They're based on the idea that a group of weak learners can come together to form a strong learner. Ensemble methods can be computationally expensive and complex to set up and understand, but they often achieve higher accuracy than individual models. By effectively combining multiple perspectives, they can lead to better generalization and more robust predictions.

Answer 11

Bagging, or Bootstrap Aggregating, involves creating multiple subsets of the original dataset, training a model on each subset, and then combining the outputs. The subsets are created with replacement, meaning that the same data point can appear in multiple subsets. The final prediction is typically the mode (for classification) or mean (for regression) of the predictions from each model. Random Forests is a popular example of bagging, where many decision tree models are combined.

Answer 12

Boosting involves training multiple models in a sequential manner, where each new model attempts to correct the errors of the previous models. Each model in the sequence puts more emphasis on the instances that the previous models got wrong, aiming to improve upon them. After training, the models vote for the final prediction, with some votes carrying more weight than others based on the individual model's performance. Examples of boosting algorithms include AdaBoost and Gradient Boosting.

Answer 13

Stacking, or Stacked Generalization, involves training multiple different models and then combining their outputs with another model, called a meta-learner or a second-level learner. The meta-learner is trained to make a final prediction based on the outputs of the individual models. This method leverages the strengths of each individual model and can often lead to improved performance.

Answer 14

Voting can be used to combine predictions from multiple models. It can be hard (majority or plurality voting, where each model votes for one class and the class with the most votes is chosen) or soft (where each model outputs probabilities for each class, and the class with the highest average probability across models is chosen).

Answer 15

This technique involves training multiple models using different subsets of the training data and different machine learning algorithms. The predictions from all models are then combined through a simple mechanism like averaging or majority voting.

Answer 16

A type of model architecture used in the field of deep learning, particularly for handling sequential data such as natural language. They were introduced in the paper "Attention is All You Need" by Vaswani et al. (2017).

Answer 17

In the case of natural language processing (NLP), the input to a Transformer is a sequence of tokens, which are typically words or subwords. These tokens are first converted into vectors using an embedding layer.

Answer 18

Since Transformers don't inherently understand the order of the input data (unlike RNNs and LSTMs), they need a way to incorporate the position of each token in the input sequence. This is done through positional encodings, which are added to the input embeddings.

Answer 19

This is the core component of the Transformer architecture. It allows the model to focus on different parts of the input sequence for each token. It calculates a weighted sum of the input vectors, where the weights are determined by the "attention scores". This process is done multiple times in parallel (hence "multi-head") to allow the model to focus on different features.

Answer 20

Each position in the Transformer passes through a feed-forward neural network, which is independent across positions. This includes two linear transformations with a ReLU activation in between.

Answer 21

After each multi-head attention block and feed-forward neural network, the Transformer uses layer normalization and residual connections to help stabilize training.

Answer 22

A Transformer model typically consists of an encoder and a decoder, each of which is composed of multiple identical layers. The encoder takes in the input sequence and produces a sequence of vectors (the "encoder output"). The decoder takes the encoder output and the target sequence so far and produces the next token in the target sequence.

Answer 23

The output of the final decoder block is passed through a linear layer followed by a softmax function to produce probabilities for each potential output token.

Answer 24

Transformers use masking to prevent certain attention scores from being calculated. In the encoder, this is used to prevent the model from attending to future tokens in the sequence, preserving the sequential nature of the data.

Answer 25

This technique is a type of bagging ensemble method that creates a set of decision trees from a randomly selected subset of the training set. It then aggregates the votes from different decision trees to decide the final class of the test object. This algorithm not only helps to improve the model accuracy but also prevents overfitting.

Answer 26

This technique involves creating multiple subsets of the original dataset, with replacement, training a model (for instance, a decision tree) on each, and then combining their predictions. The model's final prediction is typically the mode (for classification) or mean (for regression) of the predictions from each model. Bagging helps to decrease the model’s variance.

Answer 27

Similar to random forests, in Extra Trees, a random subset of features is selected to split each node in the tree. However, unlike random forests, the best split is not chosen. Instead, a random split is selected, making Extra Trees "extremely randomized". This can help reduce the variance even further than a random forest, at the cost of a slight increase in bias.

Answer 28

Unlike bagging methods, boosting methods train models in sequence. Each new model is trained to correct the errors made by the previous models. AdaBoost achieves this by assigning higher weights to the instances that the previous model got wrong, making the new model focus more on these instances. Finally, the predictions from all models are combined through a weighted majority vote (or sum for regression) to produce the final prediction.

Answer 29

Gradient Boosting is another boosting method that trains models in sequence to correct the errors of the previous models. However, instead of modifying the instance weights, this method fits the new model to the residual errors made by the previous model. Then, it combines the predictions of all models through a sum. Examples of gradient boosting algorithms include Gradient Boosting Machine (GBM), XGBoost, LightGBM, and CatBoost.

Answer 30

This method involves training several different models and combining their predictions using another model, called a meta-learner or a second-level learner. The base-level models are trained based on a complete training set, then the meta-model is fitted based on the outputs, or the out-of-fold predictions, of the base level models to make a final prediction.

Answer 31

Strengths: Good performance: They often provide a very good predictive accuracy out-of-box. Feature importance: They can provide a measure of feature importance. Minimal data preprocessing: They require little data preprocessing, e.g., no need for feature scaling. Handling missing data: They have methods for dealing with missing data. Low risk of overfitting: Due to averaging of decision trees, overfitting risk is low. Weaknesses: Complexity: They create a large number of trees (though this can be controlled by the user) and hence are more complex and computational demanding. Interpretability: They are not easily interpretable like a decision tree as they involve an ensemble of trees.

Answer 32

Strengths: Reduces overfitting: By averaging the results from multiple models, it reduces the chance of overfitting. Good with high dimensional data: Bagging can be effective on high-dimensional datasets where individual models may overfit. Weaknesses: Weak learners should be diverse: Bagging relies on the diversity of the weak learners, so the base learner should be a model with high variance. Computationally expensive: It might be computationally expensive if the base learner is complex.

Answer 33

Strengths: Reduces variance further: By introducing additional randomness in the selection of features and splits, Extra Trees can reduce the variance of the model further than a random forest. Fast to train: Since it uses random thresholds for each feature rather than searching for the best possible thresholds (like Random Forests), it's typically faster to train. Weaknesses: Increased bias: Introducing additional randomness can increase the bias of the model. Not interpretable: Like Random Forests, Extra Trees are not easily interpretable.

Answer 34

Strengths: Good performance: Can achieve good classification results with much less tweaking of parameters or settings. No need for prior knowledge about weak learner: It automatically determines the weight of the weak classifiers based on their accuracy. Weaknesses: Sensitive to noisy data and outliers: AdaBoost can be sensitive to noisy data and outliers. Computationally expensive: AdaBoost can be slower to train than bagging models as the model trains sequentially.

Answer 35

Strengths: High performance: Often provides predictive accuracy that cannot be beaten. Flexibility: Can optimize on different loss functions and provides several hyperparameter tuning options. Weaknesses: Prone to overfitting: Without careful tuning, gradient boosting models can overfit the training data. Computationally expensive: Gradient boosting can be computationally expensive and requires careful tuning.

Answer 36

Strengths: High performance: Can outperform any individual model due to its ability to optimize over multiple models. Flexibility: Can use different types of models at base level. Weaknesses: Complexity: Stacking multiple models increases complexity of the model. Computationally expensive: Training multiple layers of models can be time-consuming. Risk of overfitting: If not careful, stacking can lead to overfitting especially when involving many different models or complex models.

Data Science Terminology - assorted topics Flashcards

(60 cards)