Chapter 3 Flashcards by Matthew Alexander Paudianto

A machine learning pipeline is a method for fully automating a machine learning task’s ______.

workflow

How well did you know this?

Not at all

Perfectly

A general ML pipeline consists of data input, data models, ______, and predicted outcomes.

parameters

How well did you know this?

Not at all

Perfectly

The data analysis process includes Data Extraction, Data Preparation, Data Exploration & Visualization, Predictive Modeling, Model Validation, and ______.

Deploy

How well did you know this?

Not at all

Perfectly

A machine learning pipeline is a series of interconnected data processing and modeling steps designed to automate, standardize and streamline the process of building, training, evaluating and ______ machine learning models.

deploying

How well did you know this?

Not at all

Perfectly

Types of Data Sources include Databases (SQL, NoSQL), APIs and Web Scraping, IoT Devices and Sensors, and ______ Data Sets.

Public

How well did you know this?

Not at all

Perfectly

Key considerations for Data Collection are Data Relevance, Data Volume, and Data ______.

Quality

How well did you know this?

Not at all

Perfectly

Data Validation involves ensuring data integrity, ______, and consistency before modeling.

accuracy

How well did you know this?

Not at all

Perfectly

Verifying that numerical values fall within expected ranges is a part of Data ______.

Validation

How well did you know this?

Not at all

Perfectly

Data Cleaning involves handling ______ values and correcting errors.

missing

How well did you know this?

Not at all

Perfectly

Combining data from multiple sources is known as Data ______.

Integration

How well did you know this?

Not at all

Perfectly

Normalization, standardization, and encoding categorical variables are parts of Data ______.

Transformation

How well did you know this?

Not at all

Perfectly

Dimensionality reduction and feature selection are components of Data ______.

Reduction

How well did you know this?

Not at all

Perfectly

Choosing the most appropriate machine learning model for the task is called Model ______.

Selection

How well did you know this?

Not at all

Perfectly

Factors to consider in Model Selection include Type of Problem, Data Characteristics, Model Complexity vs. Interpretability, and ______ Metrics.

Performance

How well did you know this?

Not at all

Perfectly

Linear Models include Logistic Regression and ______ Regression.

Linear

How well did you know this?

Not at all

Perfectly

Decision Trees, Random Forests, and XGBoost are examples of ______-Based Models.

Tree

How well did you know this?

Not at all

Perfectly

Simple NN, CNN, and RNN are types of ______ Networks.

Neural

How well did you know this?

Not at all

Perfectly

Key metrics for classification model evaluation include Accuracy, Precision, Recall, F1 Score, and ______.

ROC-AUC

How well did you know this?

Not at all

Perfectly

For regression model evaluation, common metrics are MSE and ______.

RMSE

How well did you know this?

Not at all

Perfectly

______-Validation is important for validating model performance on unseen data.

Cross

How well did you know this?

Not at all

Perfectly

Identifying and mitigating Overfitting vs. Underfitting is crucial in Model ______.

Evaluation

How well did you know this?

Not at all

Perfectly

Model ______ is a critical stage where the performance of a trained model is evaluated and interpreted.

Analysis

How well did you know this?

Not at all

Perfectly

Model ______ involves evaluating the model’s performance on a validation dataset to ensure it generalizes well to unseen data.

Validation

How well did you know this?

Not at all

Perfectly

Model ______ is the process of taking a trained machine learning model and making it available for use in a production environment.

Deployment

How well did you know this?

Not at all

Perfectly

After deployment, it's important to continuously ______ the model's performance and retrain it as needed.

monitor

______ engineering is the process of creating new features or selecting relevant features from the data that can improve the model's predictive power.

Feature

______ extraction involves transforming data into a set of features that better represent the underlying patterns.

Feature

Transforming continuous, numerical values into categorical features is a technique called ______.

Binning

______ encoding maps categorical features to binary representations.

One-hot

PCA and SVD are techniques for ______ Reduction.

Dimensionality

Bag of Words and TF-IDF are examples of ______ Feature Extraction.

Text

______ selection denotes techniques for selecting a subset of the most relevant features to represent a model.

Feature

The ______ Coefficient measures the correlation between features and the target variable.

Correlation

The ______-Square Test assesses the relationship between categorical features and the target variable.

Chi

______ Selection starts with no features and adds them one by one based on their contribution to model performance.

Forward

______ Elimination starts with all features and removes them one by one based on their impact on model performance.

Backward

Feature ______ involves standardizing or normalizing the range of independent variables so that they are on a comparable scale.

scaling

Min-Max Scaling formula is $X_{scaled}=\frac{X_{i}-X_{min}}{X_{max}-______}$.

$X_{min}$

Normalization formula is $X_{scaled}=\frac{X_{i}-X_{mean}}{X_{max}-______}$.

$X_{min}$

Standardization formula is $X_{scaled}=\frac{X_{i}-X_{mean}}{______}$.

$\\sigma$

Data ______ techniques can be applied to achieve a reduced representation of the data set.

reduction

The 'Curse of ______' refers to the obstacle that high dimensionality poses for the efficiency of most DM algorithms.

Dimensionality

Principal Components Analysis (PCA) is useful when there are too many independent variables and they show high ______.

correlation

The procedure for PCA includes normalizing input data, computing k orthonormal vectors (principal components), sorting PCs by their eigenvalues, and reducing data by removing ______ components.

weaker

In feature transformation, ______ essentially transforms continuous, numerical values into categorical features.

binning

One-hot encoding is the inverse of ______; it creates numerical features from categorical variables.

binning

Singular Value Decomposition (SVD) is a method to reduce dimensionality by decomposing the data ______ into component matrices.

matrix

Image Feature Extraction often uses ______.

CNN

Feature selection aims to remove irrelevant, redundant, or ______ features.

noisy

The sampling distribution of the sample mean often follows a ______ Distribution.

Normal

A machine learning ______ is a method for fully automating a machine learning task's workflow.

pipeline

Data ______ is the first step in a typical machine learning pipeline.

Collection

Model ______ is the stage that follows model training and may lead to further refinement.

Analysis

Continuously monitoring the model's performance and retraining it as needed is part of ______ and maintenance.

monitoring

Manual feature design, deep networks, trees, and clustering are examples of ______ and model development approaches.

Algorithm

A ______ can be used to transform data into a higher dimension to make it linearly separable.

Kernel

Feature ______ is a crucial step in machine learning where relevant information is derived from raw data.

extraction

Feature ______ aims to select a subset of the most relevant features.

selection

Projecting original dimensions to new, fewer dimensions is a characteristic of feature ______.

extraction

Choosing important features and ignoring the rest is characteristic of feature ______.

selection

Linear Discriminant Analysis (LDA) is a technique for feature ______.

extraction

Scaling features ensures that each characteristic is given the same ______ during the learning process.

consideration

As the number of dimensions increase, the sample size needs to increase ______ in order to have an effective estimate of multivariate densities.

exponentially

PCA helps to find a set of ______ transformations of the original variables.

linear

In PCA, principal components are sorted according to their strength, given by their ______.

eigenvalues

The goal of data reduction is to provide the mining process with a mechanism to produce the same (or almost the same) outcome when it is applied over ______ data.

reduced

Ensuring data points are logically consistent is part of data ______.

validation

Correcting errors in data is part of data ______.

cleaning

The process of converting one feature type into another, more readable form for a particular model is called feature ______.

transformation

Equal-Frequency Binning and Equal ______ Binning are two methods for partitioning data.

Width

Smoothing by bin mean, smoothing by bin median, and smoothing by bin ______ are techniques used after binning.

boundaries

If a categorical feature 'color' has categories Red, Green, Yellow, the one-hot encoded vector for Red would be ______.

[1,0,0]

The ______ of dimensionality becomes a serious obstacle for the efficiency of most Data Mining algorithms.

Curse

It is usual to keep only the first few principal components that may contain ______% or more of the variance of the original data set.

Model ______ is the stage where a trained model is made available for use in a production environment.

deployment

Data ______ involves verifying that numerical values fall within expected ranges.

validation

Handling missing values is a key part of data ______.

cleaning

______ is a technique that transforms continuous numerical values into categorical features.

Binning

The process of selecting a subset of relevant features is known as feature ______.

selection

Min-Max scaling rescales features to a range, typically between 0 and ______.

Standardization scales data to have a mean of 0 and a standard deviation of ______.

When features are scaled, machine learning methods perform better or converge more ______.

quickly

Without scaling, ______ scale features could dominate the learning, producing skewed outcomes.

bigger

PCA is useful when independent variables show high ______.

correlation

In PCA, weaker components are ______ to reduce data.

removed

A general ML pipeline includes data input, data models, parameters, and predicted ______.

outcomes

Data ______ is the initial phase where data is gathered from various sources.

Collection

Data ______ includes cleaning, integration, transformation, and reduction.

Pre-processing

Model ______ involves choosing the best algorithm for the given problem and data.

Selection

Evaluating a model's performance on unseen data is done during model ______.

validation

The final step of putting a model into practical use is model ______.

deployment

Ensuring data accuracy and consistency is the goal of data ______.

validation

______ is a data pre-processing step that handles missing data and errors.

Data Cleaning

Combining data from different sources is called data ______.

integration

Feature ______ is the process of creating new features from existing ones.

engineering

ROC-AUC is a common metric for ______ models.

classification

MSE (Mean Squared Error) is a common metric for ______ models.

regression

______ is a technique to prevent overfitting by splitting data into training and testing sets multiple times.

Cross-Validation

When a model performs well on training data but poorly on unseen data, it is said to be ______.

overfitting

When a model is too simple and performs poorly on both training and unseen data, it is ______.

underfitting

The trade-off between a model's ability to minimize bias and variance is known as the ______-variance tradeoff.

bias

Continuously checking a deployed model's performance is called model ______.

monitoring

______ transforms data by subtracting the mean and dividing by the standard deviation.

Standardization

______ scales data to a specific range, often [0, 1].

Min-Max Scaling

The problem of having too many features, which can degrade model performance, is called the curse of ______.

dimensionality

______ is a dimensionality reduction technique that identifies principal components capturing the most variance.

PCA (Principal Component Analysis)

______ is a feature selection method that starts with all features and iteratively removes the least important ones.

Backward Elimination

______ is a feature selection method that starts with no features and iteratively adds the most important ones.

Forward Selection

The ______ test is used to assess the relationship between categorical features and a categorical target variable.

Chi-Square

Public datasets like UCI and ______ are common sources for machine learning data.

Kaggle

APIs and Web ______ are methods for collecting data from online sources.

Scraping

IoT Devices and ______ can be significant sources of real-time data.

Sensors

Model complexity versus ______ is an important factor in model selection.

interpretability

Support Vector Machines (SVM), K-Nearest Neighbors (KNN), and Naive Bayes are other types of machine learning ______.

Models

The process of adjusting model parameters to optimize performance is known as model ______.

tuning

Model ______ provides insights back to the data collection phase if issues are found.

feedback

A ______ machine learning pipeline automates the workflow from data to deployment.

standard

Feature ______ involves deriving new features from the original ones to improve model performance.

extraction

The step after data collection and before data preprocessing is often data ______.

validation

Model ______ ensures the model remains accurate and reliable in a real-world setting.

maintenance

Chapter 3 Flashcards

(120 cards)