Preprocessing & EDA Flashcards

1
Q

Augmentation

A

A data preprocessing technique used to artificially increase the size of a training dataset by applying various transformations to existing data samples. These transformations can include rotation, scaling, translation, cropping, and flipping, among others. Augmentation helps improve model generalization by exposing it to a wider range of variations in the input data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Bar Charts

A

Bar charts are graphical representations of categorical data using rectangular bars, where the length or height of each bar corresponds to the frequency or relative frequency of the category it represents. Each bar typically represents a discrete category or group, and the length or height of the bar reflects the numerical value associated with that category. Bar charts are useful for comparing the frequency or distribution of different categories visually and are commonly used for visualizing categorical data, such as survey responses, product sales, or demographic characteristics. They are especially effective for displaying discrete data with a small number of categories.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Binning

A

When you have a numerical feature but you want to convert it into a categorical one. Binning (also called bucketing) is the process of converting a continuous feature into multiple binary features called bins or buckets, typically based on value range.

A data preprocessing technique used to group continuous numerical data into discrete intervals or bins. It involves partitioning the range of values into equal-sized intervals and assigning data points to their corresponding bins. Binning is commonly used to simplify complex datasets, reduce noise, and handle outliers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Box Plots

A

Box plots, also known as box-and-whisker plots, are graphical representations of the distribution of numerical data through quartiles. They consist of a box that spans the interquartile range (IQR), with a line inside representing the median. “Whiskers” extend from the edges of the box to the minimum and maximum values within 1.5 times the IQR from the first and third quartiles, respectively. Potential outliers beyond the whiskers are often displayed as individual data points. Box plots are useful for visualizing the spread and skewness of data and identifying outliers in a dataset.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Broadcasting

A

Used to perform operations on arrays of different shapes efficiently. It allows arrays with different dimensions to be combined or operated upon without explicit looping, enhancing computational performance and readability of code. when you perform an operation between arrays, Python and libraries like NumPy automatically adjust the smaller array’s shape to match the shape of the larger one. It’s like stretching or replicating the smaller array’s elements to make it the same size as the larger array, so they can be combined more easily. Broadcasting enables seamless integration of operations across multidimensional data structures, facilitating tasks such as batch processing, data augmentation, and model training

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Cardinality

A

A number of unique values in a categorical variable or feature. High cardinality variables have a large number of distinct categories within, while low cardinality variables have fewer distinct categories within. Cardinality is an important consideration in feature engineering and can impact model performance and complexity.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Categorical Plots

A

A type of graph used to visualize the distribution and relationships within categorical data. Categorical data is data that falls into distinct groups or categories.

Categorical plots are useful for:
Distribution: Showing how frequently each category occurs.
Comparison: Comparing different categories side-by-side.
Relationships: Investigating potential relationships between different categorical variables.

Types of common Categorical Plots
- Bar Plots:
- Pie Charts:
- Count Plots:
- Box Plots:
- Strip Plots:
- Swarm Plots:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Class imbalance

A

An unequal distribution of classes or categories in a classification dataset, where one class is significantly more prevalent than others. Class imbalance can lead to biased model predictions, as the model may have a tendency to favor the majority class and overlook minority classes. Addressing class imbalance often requires specific techniques such as resampling methods, cost-sensitive learning, or ensemble methods.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Correlation Analysis

A

Statistical technique used to measure and assess the strength and direction of the relationship between two or more variables in a dataset. It quantifies the degree of association between variables using correlation coefficients, such as Pearson correlation coefficient, Spearman rank correlation coefficient, or Kendall tau rank correlation coefficient. Correlation analysis helps identify patterns, dependencies, and causal relationships among variables, facilitating feature selection, model building, and predictive modeling in machine learning and data analysis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Data Balancing

A

Data balancing, also known as class imbalance correction or oversampling/undersampling, is a preprocessing technique used in machine learning to address imbalanced datasets where one class is significantly more prevalent than others. It involves modifying the dataset to ensure that each class is represented fairly during model training. Techniques for data balancing include random undersampling, random oversampling, Synthetic Minority Over-sampling Technique (SMOTE), and ensemble methods. Data balancing is crucial for improving the performance and fairness of classification models, particularly in applications where class distribution is skewed.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Data Cleaning

A

Process within data preparation that involves identifying and addressing errors, inconsistencies, outliers, and missing values within a dataset to enhance its quality and reliability.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Data imputation

A

A process of filling in missing values in a dataset with estimated or predicted values. It is a common technique used to handle missing data before performing analysis or training machine learning models. Imputation methods can range from simple strategies like mean or median imputation to more complex techniques such as regression-based imputation or k-nearest neighbors imputation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Data Preprocessing

A

Techniques used to condition raw data before feeding it into a machine learning model. This includes tasks like scaling, normalization, transforming data types, feature engineering, and reducing the number of features (dimensionality reduction).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Data Quality Assessment

A

Process of evaluating the accuracy, completeness, consistency, and reliability of data to ensure that it meets the requirements of the intended use. It involves identifying and correcting errors, anomalies, and inconsistencies in the data, as well as assessing its fitness for specific purposes. Data quality assessment encompasses various techniques and methodologies, including data profiling, data cleansing, outlier detection, and validation. It is essential for ensuring the integrity and trustworthiness of data in decision-making, analysis, and modeling processes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Data Sampling

A

Process of selecting a subset of observations or data points from a larger dataset to represent the population or distribution of interest. Sampling techniques can be random or non-random and may involve techniques such as simple random sampling, stratified sampling, systematic sampling, or cluster sampling. Data sampling is widely used in statistics, survey research, and machine learning for estimating population parameters, reducing computational complexity, and generating training datasets for model training and evaluation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Data Transformation

A

process of converting or modifying raw data into a more suitable format for analysis, modeling, or visualization. It involves operations such as normalization, standardization, scaling, encoding, imputation, aggregation, and feature engineering. Data transformation aims to improve the quality, interpretability, and performance of data in machine learning, statistical analysis, and data-driven decision-making processes. It plays a crucial role in preprocessing pipelines, where it prepares the data for subsequent tasks such as modeling, clustering, or classification.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Data Visualization

A

Graphical representation of data and information to facilitate understanding and interpretation. It encompasses a wide range of techniques and tools for creating visual representations such as charts, graphs, maps, and dashboards. Data visualization is used to explore patterns, trends, and relationships in data, communicate insights, and support decision-making in various fields including business, science, and engineering. It plays a crucial role in exploratory data analysis, storytelling, and conveying complex information to diverse audiences.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Dealing with missing features

A
  • removing rows or columns
  • imputing values
  • using domain knowledge to create derived features
  • getting new and more data from source or other data sets
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Decoder

A

A component or algorithm that transforms encoded data or representations back into their original format or domain. Decoders are commonly used in autoencoders, generative models, and communication systems to recover information from compressed or encoded representations. In natural language processing, decoders are used in sequence-to-sequence models for generating output sequences from encoded input representations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Density Plots

A

Density plots are best suited for continuous numeric data. A density plot is a smoothed version of a histogram, used to visualize the distribution of a continuous numerical variable. It shows the estimated probability density of the data. The plot consists of a curve that represents the probability density function (PDF) of the variable. The area under the curve always totals to 1.

  • Peaks in the curve indicate regions where data points are more concentrated.
  • Valleys represent areas where data is less frequent.
  • The overall shape gives insights into the spread, skewness, and whether the distribution has multiple modes (peaks).

It is usefull for identifing distributions with multiple peaks, which histograms might obscure.

The smoothness of a density plot is controlled by a parameter called the bandwidth. Experimenting with different bandwidths can change the level of detail revealed.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Encoder

A

A component or algorithm that converts raw input data into a suitable format for processing, analysis, or modeling. Encoders transform data from one representation to another, such as converting categorical variables into numerical representations or compressing high-dimensional data into low-dimensional embeddings. In deep learning, encoders are commonly used in autoencoders, sequence-to-sequence models, and neural network architectures to learn compact and informative representations of input data for downstream tasks such as classification, regression, and generation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Encoding

A

A process of converting categorical variables or features into numerical representations that can be used as input for machine learning algorithms. Encoding allows categorical information to be effectively incorporated into machine learning models, which typically require numerical input data.

We can encoding ordinal data (intrinsic order) or nominal data (without intrinsic order). For ordinal data we often use label encoding. For nominal one-hot encoding. Other common encoding techniques include binary encoding, frequency encoding, target encoding

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Feature engeneering

A

The problem of transforming raw data into a dataset is called feature engineering

Feature Engineering is a process of creating new features or modifying existing features in a dataset to improve the performance of machine learning models. It involves selecting relevant features, transforming data, creating derived features, and reducing dimensionality. Effective feature engineering can enhance model interpretability, accuracy, and generalization to new data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Feature Importance Analysis

A

Feature importance analysis is a set of techniques used to rank the features (input variables) in a machine learning model based on how much they contribute to the model’s predictions. Feature importance doesn’t mean causation. Highly important features may be correlated with other important features.

Common Techniques
1) Permutation Importance:
Shuffle the values of a single feature randomly.
Re-evaluate the model’s performance.
A large drop in performance indicates that the feature is important.
Repeat for all features to see relative importance.

2) Mean Decrease in Impurity (Tree-based Models):
For decision trees and random forests, calculate how much each feature decreases the impurity (e.g., Gini index or entropy) across the splits in the trees.
Features that create purer splits are assigned higher importance.

3) Coefficients (in Linear Models):
For linear models like linear regression and logistic regression, the magnitude of feature coefficients indicates a feature’s impact (assuming features are scaled properly).

4) Partial Dependence Plots (PDP):
Show the marginal effect of one feature on the predicted outcome.
Helps visualize how changes in a feature influence the prediction, even if the relationship is non-linear.

5) Information Gain:
Used in decision trees and similar models to determine the most informative features for splitting nodes.
It measures the reduction in entropy (or increase in information) achieved by splitting data based on a particular feature.
Features with higher information gain are considered more important for classification tasks.

6) SHAP Values (SHapley Additive exPlanations):
Provides a unified measure of feature importance based on game theory concepts.
It calculates the contribution of each feature to the difference between the actual prediction and the average prediction across all samples.
Positive SHAP values indicate features that increase the prediction, while negative values indicate features that decrease the prediction.

7) L1 (Lasso) Regularization:
In regularized linear models like Lasso Regression, features with non-zero coefficients after regularization are considered important.
L1 regularization encourages sparsity by penalizing the absolute values of the coefficients, effectively selecting a subset of the most important features.

Libraries for Feature Importance
Scikit-learn (Python): Offers permutation importance and built-in methods for tree-based models.
ELi5 (Python): Great for explaining model predictions and visualizing feature importance.
DALEX (R): Provides a range of feature importance methods and explainers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

Feature selection

A

A process of choosing a subset of relevant features from a larger set of features in a dataset. It aims to reduce dimensionality, improve model performance, and enhance interpretability by focusing on the most informative features. Feature selection techniques include filter methods, wrapper methods, and embedded methods, which assess feature importance based on statistical measures, model performance, or feature relevance to the target variable.

Filter Methods
- Information Gain: Measures how much information a feature provides about the target variable.
- Chi-Squared Test: Evaluates the independence between a feature and the target variable.
- Correlation Analysis: Identifies features that are highly correlated with the target variable or strongly correlated among themselves (potentially leading to redundancy).
- Variance Threshold: Removes features with low variance, as they are unlikely to carry much predictive information.

Wrapper Methods
- Forward Selection: Starts with an empty feature set and iteratively adds the feature that most improves model performance.
- Backward Elimination: Starts with all features and iteratively removes the least important feature until performance drops below a threshold.
- Recursive Feature Elimination (RFE): A variant of backward elimination that uses a model to rank feature importance and recursively eliminates the least important ones.

Embedded Methods
- Regularization (L1/Lasso, L2/Ridge): Penalizes model complexity, forcing coefficients of less important features towards zero.
- Decision Trees and Tree-Based Ensembles: Tree-based algorithms (like Random Forests) provide feature importance scores that can be used for selection.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Frequency Tables

A

A frequency table is a way to summarize and organize data by showing how often each unique value (or a range of values) appears within a dataset. One column lists all the unique values observed in your data, or groups them into intervals/categories if you have many distinct values. Another column lists the count of how many times each value or category appears in your data.

In classification problems, frequency tables of your target variable classes can reveal if you have a class imbalance issue (e.g., many more positive examples than negative). For categorical features, frequency can be a factor in determining possible one-hot encoding strategies. For instance, rare categories might be grouped into an “Other” category to avoid overly sparse features.

Frequency tables are fundamental in understanding text data. Analyzing word frequencies helps identify stop words (common words like “the”, “a”) for removal, and reveals which words are most significant in a corpus (body of text).

Below are ways to check frequency in pyhton:
frequency_table = data_series.value_counts()
frequency_table = df.groupby(‘color’)[‘size’].value_counts()
frequencies = Counter(data)

27
Q

Heatmaps

A

A graphical representation of data where values are represented by colors. It uses a color gradient (e.g., from blue to red) to show where there are high and low concentrations of values in a table-like arrangement. The human eye is excellent at picking out patterns in color variations. Heatmaps leverage this to reveal hot spots, trends, and clusters in data that might be hard to see in a plain table of numbers.

28
Q

Histogram equalization

A

A technique used in image processing to enhance the contrast and visibility of details in an image. It redistributes pixel intensities in the image histogram to achieve a more uniform distribution, which can improve the appearance of images with low contrast or uneven lighting conditions. Histogram equalization is commonly used as a preprocessing step in computer vision tasks.

29
Q

Histograms

A

Histograms are graphical representations of the distribution of continuous numerical data, where data values are grouped into bins or intervals, and the height of each bar represents the frequency or relative frequency of data points falling within each bin. Histograms provide insights into the shape, central tendency, and spread of the data distribution, allowing for visual assessment of characteristics such as symmetry, skewness, and multimodality. They are commonly used for exploring the distribution of variables, identifying patterns or outliers, and assessing data quality. Histograms are particularly effective for visualizing large datasets and detecting patterns in continuous data.

30
Q

Imbalanced data (and techniques)

A

Imbalanced data refers to datasets where the classes you want to predict have a significantly unequal distribution – for instance, in fraud detection, most transactions are legitimate, with only a tiny percentage being fraudulent. This imbalance can cause standard classification algorithms to prioritize learning the majority class, leading to poor performance in identifying the less frequent but often more important minority class (like the fraudulent cases). Addressing this imbalance is crucial for building machine learning models that can accurately detect the patterns associated with the minority class and make reliable predictions in real-world scenarios.

  1. Resampling Techniques
    * Oversampling: Replicates samples from the minority class to increase its representation.
    • Random Oversampling: Randomly duplicates minority class examples.
    • SMOTE (Synthetic Minority Over-sampling Technique): Creates synthetic minority class samples based on similarities within the existing minority samples.
  • Undersampling: Removes samples from the majority class to reduce its representation.
    • Random Undersampling: Randomly deletes majority class samples.
    • NearMiss: Identifies majority class samples closest to the minority class border and removes those, aiding in better decision boundary definition.
  1. Algorithmic Techniques
    * Cost-Sensitive Learning: Assigns higher misclassification costs to errors on the minority class, forcing the model to prioritize them.
    * Ensemble Methods: Combining multiple models trained on balanced subsets of data, often improves performance on the minority class.
  2. Data-Level Approaches
    * Synthetic Data Generation: Creates artificial samples for the minority class using techniques like SMOTE or more sophisticated approaches from deep learning (e.g., GANs).
  3. Hybrid Approaches
    * Combining Resampling and Algorithmic Methods: Often the most effective strategy (e.g., oversampling followed by applying cost-sensitive learning).
31
Q

Intensity transformations

A

Intensity transformations are image processing techniques used to modify the brightness and contrast of images by adjusting pixel intensities. Common intensity transformations include gamma correction, logarithmic transformation, and contrast stretching. These transformations can enhance the visual quality of images and improve the performance of computer vision algorithms.

32
Q

Label Encoding

A

Label encoding is a technique used to convert categorical labels or target variables into numerical representations. Each unique label is assigned a unique integer value, allowing categorical data to be effectively used in machine learning algorithms that require numerical input. Label encoding is commonly used for ordinal categorical variables with inherent order.

33
Q

Line Plots

A

Line plots are graphs that display data points connected by straight lines, representing changes in values over time or another continuous variable. They typically have an x-axis representing time or a continuous variable and a y-axis representing the values being measured. Line plots are commonly used to visualize trends, patterns, and relationships in time-series data or continuous data. They allow for the examination of how data changes over time or across different conditions.

34
Q

Multicolinearity

A

A presence of high correlation between predictor variables (features) in a regression analysis. It can lead to unstable estimates of regression coefficients, reduced interpretability of the model, and inflated standard errors. Multicollinearity can be detected using statistical measures such as correlation coefficients or variance inflation factors (VIF) and can be addressed through feature selection or regularization techniques.

35
Q

Negative sampling

A

Negative sampling is a clever training trick used in machine learning, particularly in natural language processing and recommender systems. Imagine you’re trying to teach a model to understand the vast world of words or products. Showing it every single true/positive example would be overwhelming and slow. Negative sampling is like strategic flashcards – instead of showing everything, it carefully selects a few “wrong” examples (negative samples) to contrast with the true ones. This helps the model learn the boundaries between categories more efficiently. For example, while teaching a word embedding model, you might show it “cat” as a positive example and a few random, unrelated words as negative examples.

Let’s use word embeddings as an example:
Positive Pair: You start with a target word (e.g., “cat”) and a true context word that appears near it in your real text data (e.g., “pet”). This is your positive example.
Generating Negative Samples: Instead of showing the model every other word that isn’t a context word, you strategically sample a few negative examples. This is where different methods exist:
Random Sampling: Simply pick a few random words from the vocabulary.
Frequency-based Sampling: Words that occur very frequently (like “the”, “and”) are more likely to be chosen as negative examples. This helps balance the focus on rarer words.
Training Update: The model is shown the positive pair and the negative samples. Its goal is to learn to:

Assign a high similarity score to the positive pair.
Assign low similarity scores to the negative pairs.

Why this is efficient? Reduced Computations: Instead of updating the model based on every single word it isn’t associated with, we focus on a few informative negative examples. This speeds up training significantly for large vocabularies.

36
Q

Normalization

A

Normalization is the process of converting an actual range of values which a numerical feature can take, into a standard range of values, typically in the interval [-1, 1] or [0, 1].

The process of dividing a frequency by a sample size to get a probability.

37
Q

One-Hot Encoding

A

Used for encoding nominal data. One-hot encoding is a technique used to convert categorical variables or features into binary vectors, where each unique category is represented by a binary indicator variable. In the one-hot encoding scheme, a single attribute is set to 1 for the corresponding category and 0 for all other categories. One-hot encoding allows categorical information to be effectively incorporated into machine learning models without imposing ordinality or hierarchy among categories.

Usualy means creating seperate columns for each value (treating them as features) and denoting each record with 0 or 1 value indicating if this feature is present in this record

38
Q

Outliers

A

Outliers are data points that deviate significantly from the rest of the dataset. They may arise due to measurement errors, data corruption, or genuine but rare phenomena. Outliers can distort statistical analyses and machine learning models, leading to biased results or decreased predictive accuracy. Identifying and handling outliers is essential in data preprocessing to ensure the robustness and reliability of analyses and models.

Handling Outliers:
- Consider using domain specific knowlage if they are useful
- Winsorization: Replace outliers with the nearest non-outlier values (e.g., 5th and 95th percentiles).
- Trimming: Remove a certain percentage of data points from the tails of the distribution.
- Imputation: Replace outliers with estimated values based on interpolation or other statistical methods.
- Transform skewed data distributions using techniques such as logarithmic or power transformations.
- Standardize or normalize features to ensure they have similar scales and reduce the impact of outliers on the model.
- Focus on models that do well with outliers. Examples: decision tree, random forest, kernel regression etc.

39
Q

Padding

A

Padding is a technique used in image processing and natural language processing (NLP) to add extra information or space around the edges of data samples. In image processing, padding is often applied to ensure that all images have the same dimensions, facilitating batch processing and convolutional operations in neural networks. In NLP, padding is used to standardize the length of text sequences for efficient processing in recurrent neural networks (RNNs) and transformers.

40
Q

Pair Plots

A

Pair plots are grid-like arrangements of scatter plots that visualize pairwise relationships between different variables in a dataset. Each scatter plot in a pair plot represents the relationship between two variables, and the diagonal typically shows histograms or kernel density estimates of each variable. Pair plots are useful for identifying patterns, correlations, and potential interactions between multiple variables in a dataset. They allow for a comprehensive exploration of relationships within the dataset by examining how variables relate to each other.

41
Q

Parallel Coordinates Plots

A

Parallel coordinates plots are graphical representations of multivariate data using parallel axes, where each axis represents a different variable. Data points are represented as lines connecting values on each axis, allowing for the visualization of relationships and patterns across multiple variables simultaneously. Parallel coordinates plots are useful for exploring high-dimensional datasets and identifying clusters or trends across multiple variables. They provide a way to visualize the relationships between variables and identify patterns or outliers in the data.

42
Q

Pie Charts

A

Pie charts are circular statistical graphics divided into slices to illustrate numerical proportions. Each slice represents a proportion of the whole dataset, with the size of the slice corresponding to the relative magnitude of the proportion it represents. Pie charts are commonly used to visualize the distribution of categorical data and highlight the relative contributions of different categories to the whole. However, they can be less effective than other visualization types, such as bar charts, for accurately comparing proportions or displaying complex datasets.

43
Q

Resampling imbalanced dataset

A

A technique used to address class imbalance in classification datasets, where one class is significantly more prevalent than others. Resampling methods involve either oversampling the minority class (adding duplicates or synthetic samples) or undersampling the majority class (removing samples) to balance the class distribution. Resampling helps prevent bias towards the majority class and improves model performance on imbalanced datasets.

44
Q

Scatter Plots

A

Scatter plots are graphs that display individual data points as dots on a two-dimensional plane, with one variable plotted on the x-axis and another on the y-axis. They visually represent the relationship between two variables, showing patterns such as correlation, clustering, or outliers. Scatter plots are commonly used to explore relationships between two variables and identify trends or patterns in the data. They provide a visual way to examine the association between variables and identify any potential relationships or trends.

45
Q

Standardization

A

Standardization (or z-score normalization) is the procedure during which the feature values are rescaled so that they have the properties of a standard normal distribution with μ = 0 and q = 1, where μ is the mean (the average value of the feature, averaged over all examples in the dataset) and q is the standard deviation from the mean.

46
Q

Standarization vs. Normalization

A

There’s no definitive answer to this question. Usually, if your dataset is not too big and you have time, you can try both and see which one performs better for your task.

If you don’t have time to run multiple experiments, as a rule of thumb:
* unsupervised learning algorithms, in practice, more often benefit from standardization than from normalization;
* standardization is also preferred for a feature if the values this feature takes are distributed close to a normal distribution (so-called bell curve);
* again, standardization is preferred for a feature if it can sometimes have extremely high or low values (outliers); this is because normalization will “squeeze” the normal values into a very small range;
* in all other cases, normalization is preferable.

47
Q

Summary Statistics

A

Numerical measures used to describe and summarize the main features of a dataset. Common summary statistics include measures of central tendency (e.g., mean, median, mode) and measures of dispersion or variability (e.g., standard deviation, range, interquartile range). Summary statistics provide insights into the distribution, spread, and shape of data, facilitating comparisons, hypothesis testing, and decision-making. They are essential tools for data exploration, interpretation, and communication in both descriptive and inferential statistics.

48
Q

Threshold and simple segmentation

A

Thresholding and simple segmentation are image processing techniques used to separate objects or regions of interest from the background in digital images. Thresholding involves setting a threshold value and classifying pixels as foreground (object) or background based on their intensity values. Simple segmentation techniques partition an image into regions based on certain criteria, such as color, texture, or intensity gradients.

49
Q

VIFF

A

VIFF, or Variance Inflation Factor, is a metric used to measure multicollinearity in regression analysis. It quantifies how much the variance of the estimated regression coefficients is inflated due to collinearity among predictor variables. A high VIF value indicates strong multicollinearity, suggesting that the corresponding predictor variable may be redundant or highly correlated with other variables in the model. VIF values above a certain threshold (often 5 or 10) are considered indicative of multicollinearity issues that may affect the stability and reliability of regression estimates.

50
Q

Violin Plots

A

Violin plots are graphical representations of the distribution of numerical data, combining aspects of box plots and kernel density plots. They show the median, quartiles, and kernel density estimation of the data, providing insights into both the central tendency and the spread of the data. Violin plots are useful for comparing distributions of multiple groups or variables and visualizing the shape and variability of the data. They offer a way to assess the distribution of data and compare groups or variables visually.

51
Q

Word Embedding

A

Word embedding is a technique used to represent words or tokens as dense vectors in a high-dimensional space, where semantically similar words are mapped to nearby vector representations. Word embeddings capture contextual relationships and semantic meaning of words (HOW?), enabling better representation of language in natural language processing (NLP) tasks such as text classification, sentiment analysis, and machine translation. Popular word embedding models include Word2Vec, GloVe, and FastText.

52
Q

Word Embeding

A

Words are represented as points in a multi-dimensional space. Each dimension in this space represents a feature or attribute, and the position of a word in this space is determined by its relationship with those features.

The process of placing a word in a space of given features and determining its position is typically done through unsupervised learning algorithms, such as Word2Vec or GloVe. The relationship between words and features can be quantified as a vector, which has both magnitude and direction. The magnitude indicates the strength of the relationship, while the direction signifies the type of relation between the words. The distance between words in this multi-dimensional space reflects their semantic similarity or dissimilarity. Words that are similar in meaning tend to be closer together, while those with different meanings are farther apart.By analyzing the distances and directions between words and specific features, biases can be detected. Analogy or biases arise when certain words are disproportionately associated with particular attributes, such as gender or race. We can quantify bias by calculating various metrics, such as cosine similarity or distance, between word embeddings representing sensitive attribute-related terms and non-sensitive terms.

53
Q

Wrangle

A

The process of transforming, cleaning, and preparing raw data into a usable format for analysis or building models. This involves tasks like handling missing values, correcting errors, and formatting the data.

54
Q

Zero padding

A

A technique used in convolutional neural networks (CNNs) and other signal processing applications to add zeros around the edges of input data. Padding ensures that the spatial dimensions of input data are preserved during convolutional operations, preventing information loss at the edges of the input. Zero padding is commonly used to control the spatial size of feature maps and to facilitate the application of convolutional filters across input data.

55
Q

Hierarchical Clustering

A

clustering technique used to group similar data points into clusters based on their pairwise distances or similarities. Unlike partitioning methods like K-means clustering, hierarchical clustering organizes the data points into a hierarchical tree-like structure, known as a dendrogram. Hierarchical clustering can be agglomerative, where each data point starts as a single cluster and is successively merged with its nearest neighbor clusters, or divisive, where all data points start in a single cluster and are successively split into smaller clusters. Hierarchical clustering does not require specifying the number of clusters in advance, making it suitable for exploratory data analysis and visualization of hierarchical relationships within the data. It is commonly used in fields such as biology, ecology, and social sciences to analyze similarities and groupings in complex datasets.

unsupervised machine learning algorithm that builds a hierarchy of clusters for your data. There are two main approaches:

Agglomerative (Bottom-up):

Starts with each data point as its own cluster.
Iteratively merges the closest clusters together until all data points belong to a single cluster.
Divisive (Top-down):

Starts with all data points in one cluster.
Iteratively splits the most dissimilar clusters into smaller ones until each data point is its own cluster.
How It Works (Focusing on Agglomerative)

Distance Calculation: Choose a distance metric to measure how similar/dissimilar data points or clusters are (e.g., Euclidean distance, Manhattan distance).
Merging (Agglomerative):
Find the two closest clusters based on the chosen metric.
Merge them into a new cluster.
Linkage Criteria: Decide how to calculate the distance between clusters that contain multiple points (common methods include):
Single-linkage: Uses the distance between the closest pair of points from each cluster.
Complete-linkage: Uses the distance between the furthest pair of points.
Average-linkage: Uses the average distance between all point pairs from the two clusters.
Repeat: Continue merging the closest clusters until the desired number of clusters is reached or a stopping condition is met.
Results: The Dendrogram

The output of hierarchical clustering is often visualized as a dendrogram, a tree-like diagram showing the relationships between clusters.
Height on the dendrogram represents the distance at which clusters were merged (higher up means clusters were less similar).
Advantages

No Need to Pre-Specify Clusters: The number of clusters emerges naturally from the data.
Dendrogram Provides Insights: The dendrogram offers a visual way to understand the hierarchical structure of the data.
Flexibility: Works with various distance metrics and linkage criteria.

56
Q

Threshold and simple segmentation

A

Thresholding and simple segmentation are image processing techniques used to separate objects or regions of interest from the background in digital images. Thresholding involves setting a threshold value and classifying pixels as foreground (object) or background based on their intensity values. Simple segmentation techniques partition an image into regions based on certain criteria, such as color, texture, or intensity gradients.

57
Q

VIFF

A

VIFF, or Variance Inflation Factor, is a metric used to measure multicollinearity in regression analysis. It quantifies how much the variance of the estimated regression coefficients is inflated due to collinearity among predictor variables. A high VIF value indicates strong multicollinearity, suggesting that the corresponding predictor variable may be redundant or highly correlated with other variables in the model. VIF values above a certain threshold (often 5 or 10) are considered indicative of multicollinearity issues that may affect the stability and reliability of regression estimates.

58
Q

Violin Plots

A

Violin plots are graphical representations of the distribution of numerical data, combining aspects of box plots and kernel density plots. They show the median, quartiles, and kernel density estimation of the data, providing insights into both the central tendency and the spread of the data. Violin plots are useful for comparing distributions of multiple groups or variables and visualizing the shape and variability of the data. They offer a way to assess the distribution of data and compare groups or variables visually.

59
Q

Word Embedding

A

Words are represented as points in a multi-dimensional space. Each dimension in this space represents a feature or attribute, and the position of a word in this space is determined by its relationship with those features. Word embedding is a technique used to represent words or tokens as dense vectors in a high-dimensional space, where semantically similar words are mapped to nearby vector representations. Word embeddings capture contextual relationships and semantic meaning of words, enabling better representation of language in natural language processing (NLP) tasks such as text classification, sentiment analysis, and machine translation. Popular word embedding models include Word2Vec, GloVe, and FastText.

The process of placing a word in a space of given features and determining its position is typically done through unsupervised learning algorithms, such as Word2Vec or GloVe. The relationship between words and features can be quantified as a vector, which has both magnitude and direction. The magnitude indicates the strength of the relationship, while the direction signifies the type of relation between the words. The distance between words in this multi-dimensional space reflects their semantic similarity or dissimilarity. Words that are similar in meaning tend to be closer together, while those with different meanings are farther apart.By analyzing the distances and directions between words and specific features, biases can be detected. Analogy or biases arise when certain words are disproportionately associated with particular attributes, such as gender or race. We can quantify bias by calculating various metrics, such as cosine similarity or distance, between word embeddings representing sensitive attribute-related terms and non-sensitive terms.

60
Q

Wrangle

A

The process of transforming, cleaning, and preparing raw data into a usable format for analysis or building models. This involves tasks like handling missing values, correcting errors, and formatting the data.

61
Q

Zero padding

A

A technique used in convolutional neural networks (CNNs) and other signal processing applications to add zeros around the edges of input data. Padding ensures that the spatial dimensions of input data are preserved during convolutional operations, preventing information loss at the edges of the input. Zero padding is commonly used to control the spatial size of feature maps and to facilitate the application of convolutional filters across input data.

62
Q

cluster sampling

A

Cluster sampling is a probability sampling technique where a population is first divided into naturally occurring groups called clusters. These clusters should ideally be heterogeneous (diverse) within themselves, while being relatively homogeneous (similar) to each other, representing smaller versions of the overall population. Then, instead of selecting random individuals across the whole population, a researcher randomly selects a number of entire clusters.

luster sampling has both applications and considerations within the realm of machine learning (ML):

Applications:
Data Reduction: Cluster sampling can reduce the size of massive datasets. Instead of training an ML model on the entire dataset, you can train on representative samples from selected clusters, speeding up training and potentially improving model performance if clusters are well-defined.
Semi-Supervised Learning: In scenarios with limited labeled data, cluster sampling can focus labeling efforts. By labeling a few examples from each cluster, you provide the model with a broader picture of the data distribution, potentially improving its ability to generalize to unseen data.
Anomaly Detection: Cluster sampling can aid in identifying unusual patterns clustered together in specific regions of the dataset, highlighting potential anomalies that would be harder to spot in the entire dataset.

Considerations:
Biases: As with any sampling method, cluster sampling risks introducing bias if the clusters themselves don’t accurately reflect the overall population. Careful cluster definition is crucial to mitigate this.
Heterogeneity within Clusters: If clusters are too diverse internally, samples drawn from them may not be representative. Understanding the internal composition of your clusters is vital.
Computational Cost: While cluster sampling can reduce data size, the process of clustering itself can be computationally expensive for large datasets.

Overall, cluster sampling can be a useful tool in ML, but it’s important to use it strategically: consider the nature of your dataset, whether pre-existing clusters are meaningful, and potential biases before adopting this sampling technique.

63
Q

Downsampling

A

Data reduction technique used to decrease the number of samples or observations in a dataset by aggregating or summarizing the data. It involves selecting a subset of data points from the original dataset, typically by random selection or systematic sampling. Downsampling is often used to address class imbalance in machine learning tasks, where the number of samples in one class is significantly higher than in others. By reducing the number of samples in the majority class, downsampling helps to balance the class distribution and improve the performance of classification models.