L2 Flashcards by jolyn Unknown

What are missing values in datasets?

Missing values are NaN, null, ‘’, or placeholder values in datasets.

Missing values can affect data analysis and model performance.

How well did you know this?

Not at all

Perfectly

What is the purpose of missing values imputation?

To fill in missing data so that models can handle complete datasets.

Models cannot directly process missing data.

How well did you know this?

Not at all

Perfectly

What method replaces missing values with the mean or median?

Mean/Median imputation.
- Good for numerical data; fast

It is good for numerical data and is fast.

How well did you know this?

Not at all

Perfectly

When is KNN imputation used?

When the dataset is small to medium-sized.

Find ‘k’ nearest neighbors & average their values

KNN finds ‘k’ nearest neighbors and averages their values.

How well did you know this?

Not at all

Perfectly

What does model-based imputation involve?

Training a regression/classification model to predict missing values.

When relationships exist in data

This method is suitable when relationships exist in the data.

How well did you know this?

Not at all

Perfectly

What is iterative imputation?

A method that repeatedly models each feature with missing values using other features.

More accurate but computationally heavy

It is more accurate but computationally heavy.

How well did you know this?

Not at all

Perfectly

What is the purpose of feature selection?

Reduces overfitting, speeds up training, improves interpretability, and reduces storage.

Effective feature selection is crucial for model performance.

How well did you know this?

Not at all

Perfectly

What does univariate statistics do in feature selection?

Tests each feature individually for relevance to the target.

Tools include f_classif, f_regression, chi2, and mutual_info_classif.

How well did you know this?

Not at all

Perfectly

What is mutual information used for in feature selection?

Part of univariate statistics - To assess relevance without assuming linearity, making it suitable for non-linear relationships.

It helps identify important features in complex datasets.

How well did you know this?

Not at all

Perfectly

What is model-based feature selection?

Using trained models to determine feature importance.

Get best fit for a particular model
Example models: Lasso regression (L1 penalty) / Tree-based models (feature importance from splits)
Can be single-pass or iterative
(single fit)
Build a model, select features most impt to model
Lasso, other linear models, tree-based models
Multivariate – linear models assume linear relation

Example models include Lasso regression and tree-based models.

How well did you know this?

Not at all

Perfectly

What is forward selection in iterative model-based methods?

Starting with no features and adding the best one at each step.

It helps build a model gradually based on feature importance.

How well did you know this?

Not at all

Perfectly

What does backward elimination involve?

Starting with all features and removing the least useful one at each step.

This method aims to simplify the model by eliminating unnecessary features.

How well did you know this?

Not at all

Perfectly

What is Recursive Feature Elimination (RFE)?

An automated method for backwards removal using model feedback.

It is considered an expensive feature selection method.

How well did you know this?

Not at all

Perfectly

What is categorical data?

Data that falls into categories, such as color, country, or product type.

measurement levels: categorical – ordinal – interval – ratio

Categorical data is crucial for various types of analysis.

How well did you know this?

Not at all

Perfectly

What is One-Hot Encoding?

Adds one binary column per category to avoid ordinal assumptions.

Avoids ordinal assumptions
Works well for low-cardinality data (few unique values)

**Avoid using raw labels like 1, 2, 3—they introduce false order.

It is effective for low-cardinality data.

How well did you know this?

Not at all

Perfectly

What is Count-Based Encoding?

Study These Flashcards

Replaces category with mean target value for that category.

For regression: mean target value
For classification: class probabilities (“people in this state have likelihood p for class 1”)
Useful for high-cardinality features (e.g., ZIP codes, countries)

Useful for high-cardinality features like ZIP codes.

What are digital images made of?

Study These Flashcards

Grids (matrices) of pixels.

Pixel = smallest unit; stores brightness or color

Each pixel is the smallest unit that stores brightness or color.

What is the storage requirement for binary images?

Study These Flashcards

1 bit per pixel.

Each pixel is 0 / 1 (black/white)

Each pixel is either 0 (black) or 1 (white).

What is the storage requirement for grayscale images?

Study These Flashcards

1 byte (8 bits) per pixel.

Pixel value: 0–255 (brightness scale)

Pixel values range from 0 to 255, representing brightness.

What is the storage requirement for color (RGB) images?

Study These Flashcards

3 bytes per pixel.

3 values per pixel (Red, Green, Blue)

Each pixel contains three values for Red, Green, and Blue.

What are the types of image tasks in machine learning?

Study These Flashcards

Classification: Label whole image (e.g., Fracture / No fracture)
Detection: Locate object within image
Segmentation: Classify each pixel (e.g., tumor pixels vs normal)

An example is classifying an image as ‘Fracture’ or ‘No fracture’.

What is the definition of precision in evaluation metrics?

Study These Flashcards

Precision = TP / (TP + FP)

Precision measures the accuracy of positive predictions.

What is the definition of recall in evaluation metrics?

Study These Flashcards

Recall = TP / (TP + FN).

Recall measures the ability to find all relevant instances.

What does F1 Score represent?

Study These Flashcards

F1 score = harmonic mean of precision and recall.

F = 2 x (precision x recall) / (precision + recall)

It balances both metrics for better evaluation.

What is accuracy in evaluation metrics?

Accuracy = (TP + TN) / Total ## Footnote Accuracy can be misleading with class imbalance.

What is tokenization in text data processing?

Splitting text into words (tokens). - list of tokens turns into input for in additional processing → parsing / text mining - can swap out sensitive data (E.g. Typically payment card / bank accnt no. —with a randomized no. in same format) ## Footnote It is a crucial step for further text analysis.

What is stemming?

Reducing a word to its root form by trimming. (e.g., "studying" → "study") ## Footnote Example: 'studying' becomes 'study'.

What is lemmatization?

Using grammar and vocabulary to reduce a word to its proper base form. (e.g., "studies" → "study") ## Footnote Example: 'studies' becomes 'study'.

What does Bag-of-Words (BoW) represent?

Each document is represented by a vector with a value for each word in the vocabulary. - Represents each sentence / document as a vector with a value for each word in vocab --> Binary: word present / absent in doc --> Count: how often word appears in doc --> Popular approach: Term Frequency x Inverse Document Frequency (TF-IDF) - Length = vocab size - Values = word presence (binary), count, weight - Each component → no. of times that word appears in sample (e.g. in a sentence / document) - Useful for sentiment analysis (e.g. recognizing words such as “great”, “terrible” …) ## Footnote It can be binary or count-based.

What does TF-IDF stand for?

Term Frequency-Inverse Document Frequency. - Emphasizes rare but impt words ✔ Helps reduce impact of common words (e.g., “the”) ✔ Highlights informative terms ✔ Highly interpretable → each word is an independent feature BUT - All structure is lost → Crucial info may be lost (e.g. “I passed the IML exam, but I failed the Computational Linguistics exam”) - Misspellings (e.g. “machine” & “machnie” counted as diff word - Some expressions consists of diff (multiple) words (e.g. product review with word “worth” -> What if the review said “not worth” vs “definitely worth”) ## Footnote It emphasizes rare but important words.

What is the formula for TF in TF-IDF?

TF(t, d) = count of term t in document d / total terms in d. ## Footnote It measures how frequently a term appears in a document.

What is the formula for IDF in TF-IDF?

IDF(t) = log(total docs / # docs with term t). IDF of a rare word high / IDF of a frequent word low → having effect of highlighting words that are distinct **TF-IDF(t, d) = TF × IDF ## Footnote It highlights the importance of rare words.

What is an n-gram in text processing?

Multi-word tokens that capture short phrases instead of single words. Examples: 1-gram: “not”, “worth” 2-gram: “not worth” (captures sentiment better) ## Footnote Examples include 1-grams and 2-grams.

L2 Flashcards

(33 cards)