L2 Flashcards

(33 cards)

1
Q

What are missing values in datasets?

A

Missing values are NaN, null, ‘’, or placeholder values in datasets.

Missing values can affect data analysis and model performance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the purpose of missing values imputation?

A

To fill in missing data so that models can handle complete datasets.

Models cannot directly process missing data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What method replaces missing values with the mean or median?

A

Mean/Median imputation.
- Good for numerical data; fast

It is good for numerical data and is fast.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

When is KNN imputation used?

A

When the dataset is small to medium-sized.

  • Find ‘k’ nearest neighbors & average their values

KNN finds ‘k’ nearest neighbors and averages their values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What does model-based imputation involve?

A

Training a regression/classification model to predict missing values.

  • When relationships exist in data

This method is suitable when relationships exist in the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is iterative imputation?

A

A method that repeatedly models each feature with missing values using other features.

  • More accurate but computationally heavy

It is more accurate but computationally heavy.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the purpose of feature selection?

A

Reduces overfitting, speeds up training, improves interpretability, and reduces storage.

Effective feature selection is crucial for model performance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What does univariate statistics do in feature selection?

A

Tests each feature individually for relevance to the target.

Tools include f_classif, f_regression, chi2, and mutual_info_classif.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is mutual information used for in feature selection?

A

Part of univariate statistics - To assess relevance without assuming linearity, making it suitable for non-linear relationships.

It helps identify important features in complex datasets.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is model-based feature selection?

A

Using trained models to determine feature importance.

  • Get best fit for a particular model
  • Example models: Lasso regression (L1 penalty) / Tree-based models (feature importance from splits)
  • Can be single-pass or iterative
    (single fit)
  • Build a model, select features most impt to model
  • Lasso, other linear models, tree-based models
  • Multivariate – linear models assume linear relation

Example models include Lasso regression and tree-based models.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is forward selection in iterative model-based methods?

A

Starting with no features and adding the best one at each step.

It helps build a model gradually based on feature importance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What does backward elimination involve?

A

Starting with all features and removing the least useful one at each step.

This method aims to simplify the model by eliminating unnecessary features.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is Recursive Feature Elimination (RFE)?

A

An automated method for backwards removal using model feedback.

It is considered an expensive feature selection method.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is categorical data?

A

Data that falls into categories, such as color, country, or product type.

measurement levels: categorical – ordinal – interval – ratio

Categorical data is crucial for various types of analysis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is One-Hot Encoding?

A

Adds one binary column per category to avoid ordinal assumptions.

  • Avoids ordinal assumptions
    Works well for low-cardinality data (few unique values)

**Avoid using raw labels like 1, 2, 3—they introduce false order.

It is effective for low-cardinality data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is Count-Based Encoding?

A

Replaces category with mean target value for that category.

  • For regression: mean target value
  • For classification: class probabilities (“people in this state have likelihood p for class 1”)
  • Useful for high-cardinality features (e.g., ZIP codes, countries)

Useful for high-cardinality features like ZIP codes.

17
Q

What are digital images made of?

A

Grids (matrices) of pixels.

  • Pixel = smallest unit; stores brightness or color

Each pixel is the smallest unit that stores brightness or color.

18
Q

What is the storage requirement for binary images?

A

1 bit per pixel.

  • Each pixel is 0 / 1 (black/white)

Each pixel is either 0 (black) or 1 (white).

19
Q

What is the storage requirement for grayscale images?

A

1 byte (8 bits) per pixel.

  • Pixel value: 0–255 (brightness scale)

Pixel values range from 0 to 255, representing brightness.

20
Q

What is the storage requirement for color (RGB) images?

A

3 bytes per pixel.

  • 3 values per pixel (Red, Green, Blue)

Each pixel contains three values for Red, Green, and Blue.

21
Q

What are the types of image tasks in machine learning?

A
  1. Classification: Label whole image (e.g., Fracture / No fracture)
  2. Detection: Locate object within image
  3. Segmentation: Classify each pixel (e.g., tumor pixels vs normal)

An example is classifying an image as ‘Fracture’ or ‘No fracture’.

22
Q

What is the definition of precision in evaluation metrics?

A

Precision = TP / (TP + FP)

Precision measures the accuracy of positive predictions.

23
Q

What is the definition of recall in evaluation metrics?

A

Recall = TP / (TP + FN).

Recall measures the ability to find all relevant instances.

24
Q

What does F1 Score represent?

A

F1 score = harmonic mean of precision and recall.

F = 2 x (precision x recall) / (precision + recall)

It balances both metrics for better evaluation.

25
What is accuracy in evaluation metrics?
Accuracy = (TP + TN) / Total ## Footnote Accuracy can be misleading with class imbalance.
26
What is tokenization in text data processing?
Splitting text into words (tokens). - list of tokens turns into input for in additional processing → parsing / text mining - can swap out sensitive data (E.g. Typically payment card / bank accnt no. —with a randomized no. in same format) ## Footnote It is a crucial step for further text analysis.
27
What is stemming?
Reducing a word to its root form by trimming. (e.g., "studying" → "study") ## Footnote Example: 'studying' becomes 'study'.
28
What is lemmatization?
Using grammar and vocabulary to reduce a word to its proper base form. (e.g., "studies" → "study") ## Footnote Example: 'studies' becomes 'study'.
29
What does Bag-of-Words (BoW) represent?
Each document is represented by a vector with a value for each word in the vocabulary. - Represents each sentence / document as a vector with a value for each word in vocab --> Binary: word present / absent in doc --> Count: how often word appears in doc --> Popular approach: Term Frequency x Inverse Document Frequency (TF-IDF) - Length = vocab size - Values = word presence (binary), count, weight - Each component → no. of times that word appears in sample (e.g. in a sentence / document) - Useful for sentiment analysis (e.g. recognizing words such as “great”, “terrible” …) ## Footnote It can be binary or count-based.
30
What does TF-IDF stand for?
Term Frequency-Inverse Document Frequency. - Emphasizes rare but impt words ✔ Helps reduce impact of common words (e.g., “the”) ✔ Highlights informative terms ✔ Highly interpretable → each word is an independent feature BUT - All structure is lost → Crucial info may be lost (e.g. “I passed the IML exam, but I failed the Computational Linguistics exam”) - Misspellings (e.g. “machine” & “machnie” counted as diff word - Some expressions consists of diff (multiple) words (e.g. product review with word “worth” -> What if the review said “not worth” vs “definitely worth”) ## Footnote It emphasizes rare but important words.
31
What is the formula for TF in TF-IDF?
TF(t, d) = count of term t in document d / total terms in d. ## Footnote It measures how frequently a term appears in a document.
32
What is the formula for IDF in TF-IDF?
IDF(t) = log(total docs / # docs with term t). IDF of a rare word high / IDF of a frequent word low → having effect of highlighting words that are distinct **TF-IDF(t, d) = TF × IDF ## Footnote It highlights the importance of rare words.
33
What is an n-gram in text processing?
Multi-word tokens that capture short phrases instead of single words. Examples: 1-gram: “not”, “worth” 2-gram: “not worth” (captures sentiment better) ## Footnote Examples include 1-grams and 2-grams.