Chapter 3 Flashcards

(120 cards)

1
Q

A machine learning pipeline is a method for fully automating a machine learning task’s ______.

A

workflow

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

A general ML pipeline consists of data input, data models, ______, and predicted outcomes.

A

parameters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

The data analysis process includes Data Extraction, Data Preparation, Data Exploration & Visualization, Predictive Modeling, Model Validation, and ______.

A

Deploy

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

A machine learning pipeline is a series of interconnected data processing and modeling steps designed to automate, standardize and streamline the process of building, training, evaluating and ______ machine learning models.

A

deploying

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Types of Data Sources include Databases (SQL, NoSQL), APIs and Web Scraping, IoT Devices and Sensors, and ______ Data Sets.

A

Public

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Key considerations for Data Collection are Data Relevance, Data Volume, and Data ______.

A

Quality

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Data Validation involves ensuring data integrity, ______, and consistency before modeling.

A

accuracy

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Verifying that numerical values fall within expected ranges is a part of Data ______.

A

Validation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Data Cleaning involves handling ______ values and correcting errors.

A

missing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Combining data from multiple sources is known as Data ______.

A

Integration

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Normalization, standardization, and encoding categorical variables are parts of Data ______.

A

Transformation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Dimensionality reduction and feature selection are components of Data ______.

A

Reduction

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Choosing the most appropriate machine learning model for the task is called Model ______.

A

Selection

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Factors to consider in Model Selection include Type of Problem, Data Characteristics, Model Complexity vs. Interpretability, and ______ Metrics.

A

Performance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Linear Models include Logistic Regression and ______ Regression.

A

Linear

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Decision Trees, Random Forests, and XGBoost are examples of ______-Based Models.

A

Tree

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Simple NN, CNN, and RNN are types of ______ Networks.

A

Neural

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Key metrics for classification model evaluation include Accuracy, Precision, Recall, F1 Score, and ______.

A

ROC-AUC

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

For regression model evaluation, common metrics are MSE and ______.

A

RMSE

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

______-Validation is important for validating model performance on unseen data.

A

Cross

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Identifying and mitigating Overfitting vs. Underfitting is crucial in Model ______.

A

Evaluation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Model ______ is a critical stage where the performance of a trained model is evaluated and interpreted.

A

Analysis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Model ______ involves evaluating the model’s performance on a validation dataset to ensure it generalizes well to unseen data.

A

Validation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Model ______ is the process of taking a trained machine learning model and making it available for use in a production environment.

A

Deployment

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
After deployment, it's important to continuously ______ the model's performance and retrain it as needed.
monitor
26
______ engineering is the process of creating new features or selecting relevant features from the data that can improve the model's predictive power.
Feature
27
______ extraction involves transforming data into a set of features that better represent the underlying patterns.
Feature
28
Transforming continuous, numerical values into categorical features is a technique called ______.
Binning
29
______ encoding maps categorical features to binary representations.
One-hot
30
PCA and SVD are techniques for ______ Reduction.
Dimensionality
31
Bag of Words and TF-IDF are examples of ______ Feature Extraction.
Text
32
______ selection denotes techniques for selecting a subset of the most relevant features to represent a model.
Feature
33
The ______ Coefficient measures the correlation between features and the target variable.
Correlation
34
The ______-Square Test assesses the relationship between categorical features and the target variable.
Chi
35
______ Selection starts with no features and adds them one by one based on their contribution to model performance.
Forward
36
______ Elimination starts with all features and removes them one by one based on their impact on model performance.
Backward
37
Feature ______ involves standardizing or normalizing the range of independent variables so that they are on a comparable scale.
scaling
38
Min-Max Scaling formula is $X_{scaled}=\frac{X_{i}-X_{min}}{X_{max}-______}$.
$X_{min}$
39
Normalization formula is $X_{scaled}=\frac{X_{i}-X_{mean}}{X_{max}-______}$.
$X_{min}$
40
Standardization formula is $X_{scaled}=\frac{X_{i}-X_{mean}}{______}$.
$\\sigma$
41
Data ______ techniques can be applied to achieve a reduced representation of the data set.
reduction
42
The 'Curse of ______' refers to the obstacle that high dimensionality poses for the efficiency of most DM algorithms.
Dimensionality
43
Principal Components Analysis (PCA) is useful when there are too many independent variables and they show high ______.
correlation
44
The procedure for PCA includes normalizing input data, computing k orthonormal vectors (principal components), sorting PCs by their eigenvalues, and reducing data by removing ______ components.
weaker
45
In feature transformation, ______ essentially transforms continuous, numerical values into categorical features.
binning
46
One-hot encoding is the inverse of ______; it creates numerical features from categorical variables.
binning
47
Singular Value Decomposition (SVD) is a method to reduce dimensionality by decomposing the data ______ into component matrices.
matrix
48
Image Feature Extraction often uses ______.
CNN
49
Feature selection aims to remove irrelevant, redundant, or ______ features.
noisy
50
The sampling distribution of the sample mean often follows a ______ Distribution.
Normal
51
A machine learning ______ is a method for fully automating a machine learning task's workflow.
pipeline
52
Data ______ is the first step in a typical machine learning pipeline.
Collection
53
Model ______ is the stage that follows model training and may lead to further refinement.
Analysis
54
Continuously monitoring the model's performance and retraining it as needed is part of ______ and maintenance.
monitoring
55
Manual feature design, deep networks, trees, and clustering are examples of ______ and model development approaches.
Algorithm
56
A ______ can be used to transform data into a higher dimension to make it linearly separable.
Kernel
57
Feature ______ is a crucial step in machine learning where relevant information is derived from raw data.
extraction
58
Feature ______ aims to select a subset of the most relevant features.
selection
59
Projecting original dimensions to new, fewer dimensions is a characteristic of feature ______.
extraction
60
Choosing important features and ignoring the rest is characteristic of feature ______.
selection
61
Linear Discriminant Analysis (LDA) is a technique for feature ______.
extraction
62
Scaling features ensures that each characteristic is given the same ______ during the learning process.
consideration
63
As the number of dimensions increase, the sample size needs to increase ______ in order to have an effective estimate of multivariate densities.
exponentially
64
PCA helps to find a set of ______ transformations of the original variables.
linear
65
In PCA, principal components are sorted according to their strength, given by their ______.
eigenvalues
66
The goal of data reduction is to provide the mining process with a mechanism to produce the same (or almost the same) outcome when it is applied over ______ data.
reduced
67
Ensuring data points are logically consistent is part of data ______.
validation
68
Correcting errors in data is part of data ______.
cleaning
69
The process of converting one feature type into another, more readable form for a particular model is called feature ______.
transformation
70
Equal-Frequency Binning and Equal ______ Binning are two methods for partitioning data.
Width
71
Smoothing by bin mean, smoothing by bin median, and smoothing by bin ______ are techniques used after binning.
boundaries
72
If a categorical feature 'color' has categories Red, Green, Yellow, the one-hot encoded vector for Red would be ______.
[1,0,0]
73
The ______ of dimensionality becomes a serious obstacle for the efficiency of most Data Mining algorithms.
Curse
74
It is usual to keep only the first few principal components that may contain ______% or more of the variance of the original data set.
95
75
Model ______ is the stage where a trained model is made available for use in a production environment.
deployment
76
Data ______ involves verifying that numerical values fall within expected ranges.
validation
77
Handling missing values is a key part of data ______.
cleaning
78
______ is a technique that transforms continuous numerical values into categorical features.
Binning
79
The process of selecting a subset of relevant features is known as feature ______.
selection
80
Min-Max scaling rescales features to a range, typically between 0 and ______.
1
81
Standardization scales data to have a mean of 0 and a standard deviation of ______.
1
82
When features are scaled, machine learning methods perform better or converge more ______.
quickly
83
Without scaling, ______ scale features could dominate the learning, producing skewed outcomes.
bigger
84
PCA is useful when independent variables show high ______.
correlation
85
In PCA, weaker components are ______ to reduce data.
removed
86
A general ML pipeline includes data input, data models, parameters, and predicted ______.
outcomes
87
Data ______ is the initial phase where data is gathered from various sources.
Collection
88
Data ______ includes cleaning, integration, transformation, and reduction.
Pre-processing
89
Model ______ involves choosing the best algorithm for the given problem and data.
Selection
90
Evaluating a model's performance on unseen data is done during model ______.
validation
91
The final step of putting a model into practical use is model ______.
deployment
92
Ensuring data accuracy and consistency is the goal of data ______.
validation
93
______ is a data pre-processing step that handles missing data and errors.
Data Cleaning
94
Combining data from different sources is called data ______.
integration
95
Feature ______ is the process of creating new features from existing ones.
engineering
96
ROC-AUC is a common metric for ______ models.
classification
97
MSE (Mean Squared Error) is a common metric for ______ models.
regression
98
______ is a technique to prevent overfitting by splitting data into training and testing sets multiple times.
Cross-Validation
99
When a model performs well on training data but poorly on unseen data, it is said to be ______.
overfitting
100
When a model is too simple and performs poorly on both training and unseen data, it is ______.
underfitting
101
The trade-off between a model's ability to minimize bias and variance is known as the ______-variance tradeoff.
bias
102
Continuously checking a deployed model's performance is called model ______.
monitoring
103
______ transforms data by subtracting the mean and dividing by the standard deviation.
Standardization
104
______ scales data to a specific range, often [0, 1].
Min-Max Scaling
105
The problem of having too many features, which can degrade model performance, is called the curse of ______.
dimensionality
106
______ is a dimensionality reduction technique that identifies principal components capturing the most variance.
PCA (Principal Component Analysis)
107
______ is a feature selection method that starts with all features and iteratively removes the least important ones.
Backward Elimination
108
______ is a feature selection method that starts with no features and iteratively adds the most important ones.
Forward Selection
109
The ______ test is used to assess the relationship between categorical features and a categorical target variable.
Chi-Square
110
Public datasets like UCI and ______ are common sources for machine learning data.
Kaggle
111
APIs and Web ______ are methods for collecting data from online sources.
Scraping
112
IoT Devices and ______ can be significant sources of real-time data.
Sensors
113
Model complexity versus ______ is an important factor in model selection.
interpretability
114
Support Vector Machines (SVM), K-Nearest Neighbors (KNN), and Naive Bayes are other types of machine learning ______.
Models
115
The process of adjusting model parameters to optimize performance is known as model ______.
tuning
116
Model ______ provides insights back to the data collection phase if issues are found.
feedback
117
A ______ machine learning pipeline automates the workflow from data to deployment.
standard
118
Feature ______ involves deriving new features from the original ones to improve model performance.
extraction
119
The step after data collection and before data preprocessing is often data ______.
validation
120
Model ______ ensures the model remains accurate and reliable in a real-world setting.
maintenance