Learning from Data Flashcards

Question

What is the equation for simple linear regression?

Answer 1

y_p(x)=β_0 + β_1x+ϵ where: β_0 = y-intercept β_1 = slope ϵ = error term

Answer 2

β_1 = r × (S_x/S_y) where r is the Pearson correlation coefficient, S_x and S_y are standard deviations. β_0= yˉ −β_1 × xˉ

Answer 3

It measures the strength and direction of the linear relationship between two variables, ranging from -1 (perfect negative correlation) to +1 (perfect positive correlation).

Answer 4

Clustering houses based on electricity usage patterns to identify groups with similar consumption behaviors.

Answer 5

Logistic regression estimates probabilities (a continuous output) for binary or multi-class responses, making it a regression method that is adapted for classification tasks.

Answer 6

- Features (inputs): Variables used to predict the outcome (e.g., movie budget). - Labels (outputs): The target variable being predicted (e.g., box office revenue).

Answer 7

The error term captures the difference between the observed and predicted values, accounting for noise or unexplained variability in the data.

Answer 8

A data warehouse stores a duplicate copy of data in a central repository, enabling sophisticated queries and analysis without compromising the integrity of the original data sources.

Answer 9

It can compromise data integrity and may overwhelm host systems not designed to handle high volumes or frequent data requests.

Answer 10

MSE averages the squared errors to measure model accuracy. Purpose: Minimizing MSE helps find optimal β_0 and β_1 (least squares method).

Answer 11

- Prediction: Focus on minimizing error (e.g., MSE) to forecast y. Ignores interpretability (e.g., black-box models). - Interpretation: Focus on β coefficients to understand relationships (e.g., "How does marketing budget affect revenue?"). Example: Predicting customer churn (prediction) vs. analyzing feature importance in housing prices (interpretation).

Answer 12

from sklearn.linear_model import LinearRegression model = LinearRegression() model.fit(X_train, y_train) # Fit model y_pred = model.predict(X_test) # Predict

Answer 13

An overdetermined system (more equations than unknowns) is solved by minimizing SSE using OLS. Example: For points (1,6), (2,5), (3,7), (4,10), OLS finds the line y = β_0 + β_1x that best fits all points.

Answer 14

- Simple models (e.g., linear regression): Easier to interpret but may underfit. - Complex models (e.g., neural networks): Better prediction but harder to interpret ("black box"). Best practice: Choose based on goal (e.g., interpretability for policy-making, accuracy for forecasts).

Answer 15

- Define a clear cost function (e.g., MSE for regression). - Train multiple models (e.g., different hyperparameters). - Compare performance using metrics (e.g., R^2R, MSE). - Avoid overfitting by validating on test data.

Answer 16

The OLS method ensures the mean error is zero by design (derivative of SSE w.r.t. β_0 enforces ∑ε_i =0). Implication: Positive and negative errors cancel out.

Answer 17

- Linear regression: Models the relationship as a straight line (y = β_0 + β_1x) - Polynomial regression: Extends linear regression by adding higher-order terms (y = β_0 + β_1x + β_2x^2 + …) to capture nonlinear patterns. Use case: Polynomial regression is preferred when data shows curvature (e.g., quadratic trends).

Answer 18

Use Bayes Information Criterion (BIC): BIC = nln(SS_e) − nln(n) + ln⁡(n)p SS_e: Sum of squared errors. p: Number of parameters. n: Sample size. Rule: Lower BIC indicates a better balance of fit and complexity

Answer 19

Despite fitting curves, polynomial regression remains linear in its parameters (e.g., β_0, β_1, β_2). The "linear" refers to the model’s linearity in coefficients, not predictors.

Answer 20

- Leakage: When information from the test set inadvertently influences the training process (e.g., scaling using test data). - Prevention: Split data into training/test sets before preprocessing, or use pipelines.

Answer 21

To evaluate model performance on unseen data: Train set (e.g., 70%): Fit the model. Test set (e.g., 30%): Assess generalization. Python: from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

Answer 22

Process: Split data into k folds (e.g., k=5); each fold serves as the test set once. Advantage: Reduces variance in performance estimates by averaging results across multiple splits. Python: from sklearn.model_selection import cross_val_score scores = cross_val_score(model, X, y, cv=5, scoring="neg_mean_squared_error")

Answer 23

Definition: Ensures each split (train/test) maintains the same class proportions as the original dataset. Use case: Critical for imbalanced datasets (e.g., 90% Class A, 10% Class B). Python: train_test_split(X, y, test_size=0.2, stratify=y)

Answer 24

- Standard k-fold: Randomly splits data, risking uneven class distribution in folds. - Stratified k-fold: Preserves class ratios in each fold. Example: For a binary classification with 60% positives, each fold will have ~60% positives.

Answer 25

Interaction terms capture how the effect of one predictor depends on another: y = β_0 + β_1x_1 + β_2x_2 + β_3(x_1 × x_2) Use case: Testing if the impact of x_1 on y changes with x_2 (e.g., drug efficacy varying by age).

Answer 26

- Bias: Error from overly simplistic assumptions (underfitting). - Variance: Error from excessive sensitivity to training noise (overfitting). Tradeoff: Increasing model complexity reduces bias but increases variance. Polynomial regression exemplifies this tradeoff.

Answer 27

Create polynomial features: from sklearn.preprocessing import PolynomialFeatures poly = PolynomialFeatures(degree=2) X_poly = poly.fit_transform(X) Fit linear regression: from sklearn.linear_model import LinearRegression model = LinearRegression().fit(X_poly, y)

Answer 28

- Overfitting: The model fits training noise, leading to poor generalization. - Interpretability: Higher-order terms are harder to explain. Solution: Use BIC or cross-validation to select optimal degree.

Answer 29

Random splits may create training/test sets with skewed class distributions, biasing performance metrics (e.g., accuracy). Stratified sampling ensures representative splits.

Answer 30

An array of scores (e.g., MSE, R^2) for each fold. The average score estimates model performance. Example: scores = cross_val_score(model, X, y, cv=5, scoring="r2") mean_r2 = scores.mean()

Answer 31

- Holdout: Single train-test split (fast but prone to high variance). - k-fold: Multiple splits (slower but more reliable performance estimate). Use holdout for large datasets; k-fold for small datasets.

Answer 32

- Training error: Decreases as complexity increases (model fits training data better). - Cross-validation error: Initially decreases, then increases due to overfitting. Optimal complexity: Choose the point where CV error is minimized (before it starts rising).

Answer 33

- Bias: Consistent deviation from true values (underfitting). - Variance: Sensitivity to small fluctuations in training data (overfitting). - Irreducible error: Noise inherent in the data (unavoidable).

Answer 34

- Low degree (e.g., 1): High bias (rigid model), low variance. - High degree (e.g., 14): Low bias, high variance (fits noise). - Optimal degree (e.g., 4): Balances bias and variance.

Answer 35

- Definition: Technique to prevent overfitting by adding a penalty term to the loss function. - Purpose: Reduces variance by shrinking coefficients (introduces slight bias). Example methods: Ridge (L2) and LASSO (L1) regression.

Answer 36

- Ridge: Penalizes squared coefficients (λ∑β_j^2). Shrinks coefficients but rarely zero. - LASSO: Penalizes absolute coefficients (λ∑∣β_j∣). Can drive coefficients to zero (feature selection). Use case: Ridge for correlated features; LASSO for feature selection.

Answer 37

- Large λ: Strong penalty → simpler model (high bias, low variance). - Small λ: Weak penalty → complex model (low bias, high variance). Optimization: Choose λ via cross-validation.

Answer 38

RSS+λ ∑_{j=1}^pβ_j^2 RSS: Residual sum of squares. λ∑β_j^2: L2 penalty term.

Answer 39

RSS+λ ∑_{j=1}^p|β_j| RSS: Residual sum of squares. λ∑|β_j|: L1 penalty term (promotes sparsity).

Answer 40

By driving some coefficients to exactly zero, effectively removing those features from the model. This is due to the L1 penalty’s geometric properties (sharp corners at zero).

Answer 41

- L1 (LASSO): Constraint region is a diamond (sparse solutions at corners). - L2 (Ridge): Constraint region is a circle (smooth shrinkage). Visual: L1 tends to intersect axes (zero coefficients); L2 does not.

Answer 42

from sklearn.linear_model import Ridge, Lasso ridge = Ridge(alpha=1.0).fit(X, y) # alpha = λ lasso = Lasso(alpha=1.0).fit(X, y) Note: Always standardize features before regularization.

Answer 43

Regularization penalizes coefficients equally. Unscaled features (e.g., age vs. income) would unfairly bias the penalty toward larger-scale features.

Answer 44

Error caused by noise or randomness in the data that cannot be reduced by any model. It sets a lower bound on the achievable prediction error.

Answer 45

By evaluating model performance across multiple train-test splits for different λ values, then choosing the λ that minimizes average validation error.

Answer 46

- Low complexity: High bias (underfitting), low variance. - High complexity: Low bias, high variance (overfitting). Goal: Find the "Goldilocks" complexity where total error (bias² + variance + irreducible error) is minimized

Answer 47

Ridge regression is equivalent to placing a Gaussian prior on the coefficients (mean zero, variance 1/λ), encouraging small but non-zero values.

Answer 48

When all features are potentially relevant and you want to retain them (e.g., correlated features in genomics). Ridge shrinks coefficients but rarely zeroes them.

Answer 49

When you suspect many features are irrelevant and want automatic feature selection (e.g., high-dimensional data with sparse signals).

Answer 50

Definition: Combines L1 and L2 penalties (λ_1∑|β_j| + λ_2∑β_j^2) Use case: When features are correlated AND you want feature selection (balances Ridge and LASSO).

Answer 51

- MCAR (Missing Completely At Random): Missingness is unrelated to any data (e.g., sensor random failure). - MAR (Missing At Random): Missingness relates to observed data (e.g., high wind causing sensor malfunctions). - MNAR (Missing Not At Random): Missingness relates to unobserved data (e.g., pollution tampering with sensors).

Answer 52

- Keep as-is: For tools that handle missing values (e.g., KNN). - Remove rows: Risky for MNAR/MAR (may introduce bias). - Remove columns: When >25% values are missing in non-critical features. - Impute: Use mean/median (MCAR), subgroup means (MAR), or regression (MNAR).

Answer 53

Q1 = df['col'].quantile(0.25) Q3 = df['col'].quantile(0.75) outliers = df[(df['col'] < (Q1 - 1.5*IQR)) | (df['col'] > (Q3 + 1.5*IQR))]

Answer 54

- Do nothing: For robust models (e.g., Random Forests). - Cap values: Replace with upper/lower bounds. - Log transform: For skewed data. - Remove rows: Last resort (risks losing information).

Answer 55

- Standardization: Rescales data to mean=0, SD=1: X′ = (X−μ)/σ Use case: Algorithms assuming Gaussian distributions (e.g., SVM, PCA). - Normalization: Rescales to [0, 1]: X′ = (X−X_min)/(X_max−X_min) ⁡Use case: Neural networks, distance-based algorithms (e.g., KNN)

Answer 56

- Highly skewed data (e.g., income, city populations). - Data spanning orders of magnitude. - When analyzing ratios (log transforms multiplicative relationships to additive). Formula: X'=log(X).

Answer 57

- One-Hot Encoding: Creates binary columns for each category (e.g., "Education Level" → "Bachelor", "Masters"). - Ordinal Encoding: Assigns ranks (e.g., "High School"=1, "PhD"=4). - Target Encoding: Replaces categories with mean target value (for supervised learning).

Answer 58

Definition: Converting continuous data into bins (e.g., age → "Child", "Adult"). Use cases: - Simplifying models (e.g., decision trees). - Handling non-linear relationships. - Improving interpretability.

Answer 59

Purpose: Reduce noise to reveal trends. Moving Average: Replaces each point with the average of its neighbors. Use case: Time series analysis (e.g., stock prices).

Answer 60

- Comparing >5 categories (hard to distinguish slices). - Differences between values are small (angles are hard to compare). - Data has similar proportions (use bar charts instead).

Answer 61

- Limit bars: ≤10 categories. - Horizontal layout: For long category names. - Order bars: By value (ascending/descending). - Avoid 3D effects: Distorts proportions.

Answer 62

- Show relationships between two continuous variables (e.g., correlation). - Identify clusters or outliers. - Compare multiple groups (use colors/markers).

Answer 63

- Label axes clearly. - Use solid lines (no markers for many points). - Directly label lines (avoid legends if possible). - Highlight key events (e.g., policy changes).

Answer 64

Definition: Matrix where colors represent values. Use cases: - Correlation matrices. - Time-series patterns (e.g., temperature over months). - Geospatial data (e.g., population density).

Answer 65

- Systematic: Consistent bias (e.g., faulty sensor calibration). Hard to detect. - Random: Unpredictable fluctuations (e.g., measurement noise). Averages out over time.

Answer 66

Use subgroup means/medians (e.g., impute missing GPA for "Freshmen" using the median GPA of other Freshmen). Avoid global imputation to reduce bias.

Answer 67

Ensures the test set has the same class proportions as the training set, preventing skewed performance metrics (critical for imbalanced datasets).

Answer 68

- Cleaning: Handle missing values, outliers, errors. - Transformation: Scale, normalize, encode. - Reduction: Feature selection, dimensionality reduction. - Visualization: Explore patterns, validate preprocessing.

Answer 69

The train-test split divides the dataset into two parts: the training set (typically 80% of the data) and the test set (20%). The training set is used to train the model, while the test set evaluates its performance on unseen data. This helps assess the model's generalization ability and avoid overfitting.

Answer 70

Classification predicts discrete labels (e.g., "red" or "blue," "spam" or "not spam"), while regression predicts continuous values (e.g., temperature, price). Classification is used for categorical outcomes, whereas regression is used for numerical outcomes.

Answer 71

Logistic regression is a classification algorithm that predicts the probability of a binary outcome (e.g., yes/no) using the logistic function: y= 1/ (1+e )^−(ax+b) It is used when the dependent variable is categorical and the relationship between features and outcome is nonlinear (sigmoidal).

Answer 72

The output is the probability that the input belongs to class "1" (e.g., "spam"). The decision boundary is typically set at 0.5: if the probability ≥ 0.5, the prediction is class 1; otherwise, it is class 0.

Answer 73

from sklearn.linear_model import LogisticRegression log_reg = LogisticRegression() log_reg.fit(X_train, y_train) predictions = log_reg.predict(X_test) probs = log_reg.predict_proba(X_test)

Answer 74

LOOK AT LECTURE 7

Answer 75

Accuracy is the ratio of correct predictions to total predictions: Accuracy = (TP+TN)/ (TP+TN+FP+FN) where TP = True Positives, TN = True Negatives, FP = False Positives, FN = False Negatives.

Answer 76

A perceptron is a binary linear classifier. It takes inputs x_1, x_2, ..., x_K, applies weights w_1, w_2, ..., w_K, sums them with a bias b, and passes the result through a step function: y = f(b +∑_i=1^K(w_ix_i)) where f outputs 1 if the sum ≥ 0, else 0.

Answer 77

The weights are updated iteratively: - Initialize weights to 0. - For each misclassified point x, update weights: w_(t+1) = w_t + α(y^data - y^pred)x

Answer 78

Perceptrons can only classify linearly separable data. They fail for problems like XOR, where no straight line can separate classes.

Answer 79

An MLP is a neural network with hidden layers between input and output layers. Each neuron uses nonlinear activation functions (e.g., ReLU, sigmoid), enabling the network to learn complex, non-linear decision boundaries.

Answer 80

An MLP stacks multiple perceptrons (neurons) in layers, with nonlinear activation functions. This allows it to model intricate patterns, unlike a single perceptron, which is limited to linear separation.

Answer 81

Hidden layers transform input data hierarchically, extracting higher-level features. Each layer’s neurons apply weighted sums and activation functions to progressively refine the model’s predictions.

Answer 82

It requires labeled training data (ground truth) to learn the relationship between features and outcomes. The model is trained to minimize prediction error on this labeled data.

Answer 83

Email spam detection: The classifier predicts whether an email is "spam" or "not spam" based on features like word frequency, sender, and subject line.

Answer 84

Gradient descent is an optimization algorithm used to minimize the loss function of a model by iteratively adjusting parameters in the direction of the steepest negative gradient. It is used because it efficiently finds optimal parameters for models like linear regression, logistic regression, and neural networks, especially when closed-form solutions are infeasible.

Answer 85

- Deterministic: Follows a fixed rule (e.g., gradient direction) to update parameters. No randomness; steps are calculated based on gradients. - Stochastic: Introduces randomness (e.g., random steps or sampling). Used in genetic algorithms or stochastic gradient descent (SGD) to escape local minima.

Answer 86

- L1 norm (Manhattan distance): Sum of absolute errors. Robust to outliers but less smooth. - L2 norm (Euclidean distance): Square root of squared errors. Sensitive to outliers but differentiable. Both measure prediction error and serve as loss functions (e.g., L2 for linear regression).

Answer 87

- Initialize parameters randomly. - Compute the loss (e.g., L2 norm). - Calculate gradients (partial derivatives of loss w.r.t. parameters). - Update parameters: θ_new = θ_old −α∇_θLoss, where α is the learning rate. - Repeat until convergence or max iterations.

Answer 88

A table comparing predicted vs. actual classes: Actual + Actual - Predicted + TP FP Predicted - FN TN Metrics: - Accuracy: (TP+TN)/(TP+TN+FP+FN) - Precision: TP/(TP+FP) - Recall/Sensitivity: TP/TP+FN - Specificity: TN/TN+FP

Answer 89

An ROC curve plots the True Positive Rate (TPR, recall) vs. False Positive Rate (FPR, 1-specificity) across different classification thresholds. Interpretation: - Top-left corner (TPR=1, FPR=0): Perfect classifier. - Diagonal line: Random guessing. - Higher AUC (Area Under Curve) = Better performance.

Answer 90

The F1 score is the harmonic mean of precision and recall: F1=2×(Precision×Recall)/(Precision+Recall) It balances precision and recall, useful for imbalanced datasets where one class dominates.

Answer 91

AUC measures the classifier’s ability to distinguish classes: - AUC=1: Perfect separation. - AUC=0.5: No better than random. - AUC>0.7: Good model. It is threshold-independent and works for binary/multi-class problems.

Answer 92

- High sensitivity: Few false negatives (e.g., correctly identifying all sick patients). - High specificity: Few false positives (e.g., correctly ruling out healthy patients). Adjusting the classification threshold shifts this trade-off (e.g., lower threshold increases sensitivity but reduces specificity).

Answer 93

- Resampling: Oversample minority class or undersample majority class. - Cost-sensitive learning: Penalize misclassifications of the minority class more. - Use metrics like F1 or AUC: Less sensitive to imbalance than accuracy.

Answer 94

- Grid search: Exhaustively tests predefined parameter combinations. Computationally expensive but simple. - Gradient descent: Iteratively adjusts parameters based on gradients. Efficient for high-dimensional spaces but may get stuck in local minima.

Answer 95

SGD updates parameters using a random subset (mini-batch) of data per iteration, offering: - Faster convergence for large datasets. - Better escape from local minima due to noise in updates. - Lower computational cost per iteration.

Answer 96

- Too small: Slow convergence; may not reach optimum. - Too large: Overshoots minima, causing divergence. - Adaptive methods (e.g., Adam): Dynamically adjust α for faster, stable convergence.

Answer 97

The threshold converts predicted probabilities (e.g., from logistic regression) into class labels: - Threshold=0.5: Default for balanced classes. - Adjusting threshold: Increases sensitivity (lower threshold) or specificity (higher threshold).

Answer 98

"The ROC curve shows how well a test (e.g., medical diagnosis) balances catching all true cases (sensitivity) vs. avoiding false alarms (specificity). A curve closer to the top-left means the test is highly accurate."

Answer 99

NLP is a field of AI focused on enabling computers to understand, interpret, and generate human language. Common applications include: - Text similarity: Comparing documents for plagiarism or search. - Sentiment analysis: Determining emotional tone (e.g., positive/negative reviews). - Topic extraction: Identifying key themes in large texts (e.g., news articles). - Spam detection: Classifying emails/texts as spam or not.

Answer 100

It calculates the probability that a message is spam based on word frequencies: P(spam∣message) = P(message∣spam)P(spam)/P(message) Example: For "send us your password," compute P("send"∣spam)×P("us"∣spam)×… Assumption: Words are independent (Naive Bayes).

Answer 101

- Tokenisation: Splitting text into words/tokens (e.g., "Fear of the dark" → ["Fear", "of", "the", "dark"]). - Stemming: Reducing words to root form by chopping suffixes (e.g., "running" → "run"). - Lemmatisation: Using linguistics to convert words to base/dictionary form (e.g., "was" → "be").

Answer 102

- Stemming: Crude heuristic (e.g., "adjustable" → "adjust"). Fast but may yield non-words. - Lemmatisation: Accurate base forms using dictionaries/grammar (e.g., "better" → "good"). Slower but precise.

Answer 103

These languages lack spaces between words, requiring advanced methods: - Vietnamese: "thời gian" (time) must be tokenized as one unit, not ["thời", "gian"]. - Chinese/Japanese: Use segmentation algorithms (e.g., Jieba for Chinese).

Answer 104

A text vectorization method where: - Each document is represented as word counts (e.g., {"fear": 3, "dark": 2}). - Limitation: Ignores word order/semantics but useful for simple models like Naive Bayes.

Answer 105

Term Frequency-Inverse Document Frequency measures word importance in a document relative to a corpus: - TF: (Word count in document)/(Total words in document) - IDF: log((Total documents)/(Documents containing the word)) - TF-IDF = TF × IDF. High scores indicate rare but significant words (e.g., "blood" in metal lyrics).

Answer 106

By comparing word frequencies in metal (f_metal) vs. other genres (f_others): -High f_metal/f_others : Words like "burn," "fire" are metal-specific. - Low ratio: Common words (e.g., "the," "of") are ignored.

Answer 107

A matrix (e.g., song-word counts) where most entries are zero because: - Each document uses only a small subset of the vocabulary. - Example: A song may contain 10 unique words out of 10,000 in the corpus.

Answer 108

Converted vectors enable: - Clustering: Group similar documents (e.g., metal vs. pop lyrics). - Classification: Train models (e.g., spam detection). - Dimensionality reduction: PCA to visualize high-dimensional data.

Answer 109

- Ignores semantics: "Happy" and "joyful" are treated as unrelated. - No word order: "Not good" vs. "good" may have similar vectors. - Domain dependence: Stopwords (e.g., "the") may be irrelevant in some contexts.

Answer 110

- Tokenisation: Split into words. - Lowercasing: Standardize case (e.g., "Fear" → "fear"). - Stopword removal: Drop common words (e.g., "the," "and"). - Stemming/Lemmatisation: Reduce inflectional forms. - Vectorization: Convert to BoW or TF-IDF.

Answer 111

- TF: How often a word appears in a single document. - DF: How many documents in the corpus contain the word. - IDF downweights high-DF words (e.g., "the") to highlight rare terms.

Answer 112

- Reduces bias from frequent but meaningless words (e.g., "the"). - Emphasizes discriminative words (e.g., "password" in spam). - Improves model performance by focusing on informative terms.

Answer 113

- Language-specific tokenisers: Use spaCy for English, PySastrawi for Indonesian. - Stopword lists: Customize per language. - Lemmatisation: Requires language-specific dictionaries (e.g., NLTK for English).

Answer 114

Topic modelling is a statistical method to discover abstract "topics" in a collection of documents. It addresses: - High-dimensional sparse data: Reduces document-term matrices (e.g., TF-IDF) to lower-dimensional topic representations. - Summarization: Identifies dominant themes (e.g., "sports," "politics") without prior labeling. - Example: LDA (Latent Dirichlet Allocation) models documents as mixtures of topics, where each topic is a distribution over words.

Answer 115

- Each document is a mix of topics (e.g., 60% sports, 40% politics). - Each topic is a distribution over words (e.g., "sports" → {"ball": 0.3, "team": 0.2}). - Process: Assign random topics to words. Iteratively update topic-word and document-topic distributions using Gibbs sampling or variational inference.

Answer 116

Pros: - Language-agnostic (works with any document-term matrix). - Unsupervised (no labeled data needed). - Provides interpretable topics (e.g., "genetics," "space"). Cons: - Poor for short texts (e.g., tweets). - Prone to overfitting; requires tuning topic count. - Cannot generalize to unseen documents without retraining.

Answer 117

A matrix where: - Rows: Documents. - Columns: Words (from the corpus vocabulary). - Values: Word counts or TF-IDF scores. - Sparsity: Most documents use only a small subset of the vocabulary (e.g., 50 words out of 10,000), resulting in many zeros

Answer 118

- Word Embedding: Dense vector representation capturing semantic/contextual relationships (e.g., Word2Vec, GloVe). Example: "king" - "man" + "woman" ≈ "queen." - TF-IDF: Sparse vector based on word frequency, ignoring semantics. Key Difference: Embeddings preserve meaning; TF-IDF focuses on word importance.

Answer 119

- Skip-gram: Predicts context words given a target word (e.g., "cat" → "purrs," "meows"). - CBOW (Continuous Bag-of-Words): Predicts target word from context (e.g., ["purrs," "meows"] → "cat"). Training: A shallow neural network optimizes weights to maximize prediction accuracy, producing embeddings in the hidden layer.

Answer 120

- Semantic Similarity: Related words (e.g., "ocean," "sea") have similar vectors. - Analogies: Linear relationships reflect semantic rules (e.g., "king" - "man" + "woman" ≈ "queen"). - Multilingual Alignment: Embeddings can map similar words across languages (e.g., "ship" ↔ "navio").

Answer 121

- Polysemy: Words with multiple meanings (e.g., "queen" as monarch vs. band). - Challenge: Single embedding may conflate meanings. - Solution: Sense embeddings (e.g., one vector per meaning) using labeled data.

Answer 122

Averaging Word Embeddings: Simple but loses context. Advanced Methods: - Doc2Vec: Extends Word2Vec to paragraphs. - BERT: Contextual embeddings using transformer networks. Use Case: Measures similarity between texts (e.g., "I traveled to Estonia" ≈ "She flew to Tallinn").

Answer 123

- Semantic Search: Find documents with similar meaning. - Paraphrase Detection: Identify equivalent sentences. - Multimodal Learning: Align text and images in shared space (e.g., "cat on table" ≈ image of a cat). - Automatic translation - Text summarization

Answer 124

- Efficiency: Sparse high-dimensional data (e.g., 10K-word vocab) is computationally expensive. - Noise Reduction: Removes irrelevant features (e.g., stopwords). - Visualization: Projects data to 2D/3D for exploration (e.g., t-SNE plots of topics).

Answer 125

- LDA: Pros: Interpretable topics, works with unlabeled data. Cons: Struggles with short texts, no word-level semantics. - Word2Vec: Pros: Captures word relationships, works for short texts. Cons: No document-level topics; requires large corpus.

Answer 126

By applying LDA to time-stamped documents (e.g., news articles) and plotting topic prevalence: Example: Rise of "quantum computing" topics in 2010s vs. "alchemy" in 1900s.

Answer 127

Ignores: - Word Order: "dog bites man" vs. "man bites dog." - Context: "bank" (financial vs. river). - Solution: Contextual models like BERT.

Answer 128

- Coherence Score: Measures semantic consistency of top words in a topic (e.g., "ball," "team," "score" for sports). - Human Judgment: Manual review for interpretability. - Perplexity: Lower values indicate better generalization (rarely used due to poor correlation with quality).

Answer 129

KNN is a supervised learning algorithm commonly used for classification problems. It operates on the assumption that similar data points exist close to each other, using a distance metric (e.g., Euclidean distance) to determine similarity.

Answer 130

KNN finds the k nearest neighbours to the new data point and assigns the majority class among those neighbours. This majority vote determines the new point's class label.

Answer 131

The value of 'k' is a hyperparameter that determines how many neighbours are considered when classifying a new data point. It directly influences the model's performance.

Answer 132

1. Compute the distance between the new point and all other points. 2. Identify the k nearest neighbours. 3. Determine the majority class among these neighbours. 4. Assign the new point to this majority class.

Answer 133

KNN considers the closest points regardless of how representative they are. Outliers can be mistakenly considered valid neighbours and thus mislead the classification due to their extreme or unrepresentative values.

Answer 134

If one class significantly outweighs others in the dataset, KNN may become biased toward predicting that majority class, especially for larger k values.

Answer 135

A common solution is weighted KNN, where each neighbour's vote is weighted inversely to its distance from the query point, giving more influence to closer, potentially more relevant neighbours.

Answer 136

1. Empirical testing: Start with k=1 and evaluate accuracy on test data, incrementally increasing k. 2. Heuristic approach: Use the square root of the number of training samples, ensuring k is odd to avoid ties.

Answer 137

To avoid ties during majority voting, where two classes could receive an equal number of vote

Answer 138

Weighted KNN assigns higher weights to closer neighbours during classification, improving robustness especially in cases with class imbalance or when data density varies.

Answer 139

KNN is an instance-based learning algorithm, meaning it does not build a model but rather makes decisions based on stored data. The hypothesis grows with data size, which can lead to performance issues with large datasets.

Answer 140

High dimensionality leads to increased computational cost and can reduce the effectiveness of distance metrics due to the curse of dimensionality. Feature selection or dimensionality reduction is often required.

Answer 141

Since KNN relies on distance metrics, features must be on the same scale to prevent some from dominating. Normalization (e.g., scaling to [0,1]) is typically used.

Answer 142

Categorical features are converted to numerical values using LabelEncoder from sklearn.preprocessing, enabling distance computation.

Answer 143

1. Encode categorical features. 2. Combine features into a dataset. 3. Create a KNeighborsClassifier object with a specified k. 4. Fit the classifier on the training set. 5. Use predict() on the test set to make predictions.

Answer 144

The classifier predicted that it would rain.

Answer 145

The classifier predicted that it would not rain.

Answer 146

It is used to demonstrate multi-class classification using KNN. The dataset includes 13 features and 3 wine cultivars (class_0, class_1, class_2).

Answer 147

The dataset is loaded from sklearn, then split into training and test sets using train_test_split(), with 20% reserved for testing.

Answer 148

The accuracy was 72.22%.

Answer 149

The accuracy increased to 80.5%, showing that performance can improve with higher k, though this depends on the data.

Answer 150

1. KNN is intuitive and powerful for both binary and multiclass classification. 2. It is sensitive to outliers, class imbalance, and feature scaling. 3. Model complexity grows with data, making it unsuitable for very large or high-dimensional datasets without preprocessing. 4. Careful tuning of k and data normalization is critical for good performance.

Answer 151

SVM is a supervised machine learning model used for classification, regression, and clustering problems. It works by finding a hyperplane that best separates data points of different classes with the largest possible margin.

Answer 152

The main goal of an SVM is to find a decision boundary (hyperplane) that maximizes the margin, which is the shortest distance from the hyperplane to the closest data points of any class (support vectors).

Answer 153

Support vectors are the data points closest to the decision boundary. They determine the position and orientation of the hyperplane. Only these points are used in defining the margin.

Answer 154

A larger margin is believed to lead to better generalization on unseen data. It reduces the risk of overfitting.

Answer 155

It is an SVM model that selects the decision boundary that maximizes the margin between classes. It assumes perfect separation without outliers.

Answer 156

Because it tries to perfectly separate all training data, including outliers, which can distort the decision boundary significantly and lead to poor generalization.

Answer 157

SVM can use a soft margin, which allows some misclassification in the training data to prevent the model from being overly influenced by outliers.

Answer 158

A soft margin permits misclassification of some training points, enabling better generalization by preventing overfitting to noise or outliers.

Answer 159

A soft margin introduces higher bias but often lower variance, making the model more robust and better at generalizing to new data

Answer 160

Nonlinear transformation maps data from the original input space to a higher-dimensional feature space where a linear separator (hyperplane) can be applied.

Answer 161

Applying the transformation f(x) = x² to 1D data adds a second dimension, making originally non-linearly separable data linearly separable in 2D.

Answer 162

The kernel trick is a method that allows SVM to compute the dot product in a high-dimensional space without explicitly transforming the data, significantly reducing computational complexity.

Answer 163

A kernel function computes the dot product between two vectors in a higher-dimensional feature space without performing the actual transformation. It is defined as k(x, z) = ⟨f(x), f(z)⟩.

Answer 164

It avoids the explicit transformation of data to high-dimensional space, which can be computationally expensive or intractable. Instead, similarity is computed directly using kernel functions.

Answer 165

A common example is the second-degree polynomial kernel, which computes similarity in a 2D-transformed space using inputs from the original space, enabling linear separation of nonlinear data.

Answer 166

The IRIS dataset is a classical dataset with 3 classes of iris flowers (Setosa, Versicolour, Virginica). It is used to demonstrate multi-class classification using SVM with different support vector classifiers (SVCs).

Answer 167

The C parameter controls the trade-off between margin size and misclassification. A smaller C creates a wider margin allowing more misclassifications (higher bias), while a larger C reduces misclassifications but can overfit (higher variance).

Answer 168

Gamma defines how far the influence of a single training example reaches. Low gamma means far reach (generalization), high gamma means close reach (can lead to overfitting).

Answer 169

A mesh grid is used to create a grid of points over the feature space to visualize decision boundaries produced by SVM models.

Answer 170

The Labeled Faces in the Wild (LFW) dataset is used, containing thousands of images of public figures with labeled identities.

Answer 171

Principal Component Analysis (PCA) is used to reduce the high-dimensional image data (nearly 3000 pixels per image) to 150 principal features before classification.

Answer 172

A pipeline automates preprocessing (like PCA) followed by classification (SVM), ensuring reproducibility and simplification of the modeling process.

Answer 173

GridSearchCV is used to find the best combination of hyperparameters (e.g., C and gamma) by evaluating performance through cross-validation.

Answer 174

A classification report shows metrics such as precision, recall, and F1-score for each class, helping to assess the performance of the classifier on test data.

Answer 175

SVM separates data with a maximum-margin hyperplane. 2. It can be extended to non-linear cases via kernel tricks. 3. It’s sensitive to outliers without soft margin. 4. Useful in both binary and multiclass problems (e.g., IRIS, LFW). 5. Requires careful tuning of C and gamma for optimal results.

Answer 176

Decision Trees split the data into subsets based on feature values to form a tree-like structure. Each internal node represents a decision based on a feature, each branch represents an outcome of that decision, and each leaf node represents a class label (in classification) or a predicted value (in regression). They recursively divide the feature space to create increasingly homogeneous subsets with respect to the target variable.

Answer 177

Decision Trees are greedy because they make the optimal decision at each node (i.e., selecting the best feature to split on) without considering future splits. This locally optimal choice aims to reduce impurity the most at each stage, rather than finding a globally optimal tree.

Answer 178

The hypothesis space for Decision Trees includes all possible trees that can be formed using the given features. Each tree represents a piecewise constant function that partitions the input space and assigns outputs based on the values in the leaf nodes.

Answer 179

In classification trees, the output at the leaf nodes is a class label, and the model learns to classify inputs into discrete categories. In regression trees, the output is a continuous value, and the model predicts real-valued outputs.

Answer 180

A Decision Tree learns by recursively splitting the training data into subsets based on feature values that maximize some splitting criterion (e.g., information gain, Gini impurity, or variance reduction), and continues this process until a stopping criterion is met (like minimum leaf size, maximum depth, or pure leaves).

Answer 181

A decision stump is a Decision Tree with only one level of decision—i.e., it consists of a single split on one feature, with two leaf nodes. It represents the simplest possible decision tree.

Answer 182

- Root node: The top of the tree where the first split occurs. - Internal nodes: Points where the data is split based on a feature. - Branches: Outcomes of the split (conditions leading to sub-nodes). - Leaf nodes: Terminal nodes that output the prediction.

Answer 183

The splitting criterion determines how the tree chooses the best feature to split on at each node. It quantifies the "purity" of the resulting subsets. Common criteria include: - Information Gain (Entropy reduction) - Gini Impurity - Variance Reduction (for regression)

Answer 184

Information Gain measures the reduction in entropy after a dataset is split on a feature. It is calculated as: IG(S, A) = Entropy(S) - ∑(|Sv|/|S|) * Entropy(Sv) where S is the current dataset and Sv are the subsets after splitting on attribute A. Features with higher information gain are preferred for splits.

Answer 185

Entropy measures the amount of uncertainty or impurity in a dataset. For a binary classification: Entropy(S) = -p+ log2(p+) - p- log2(p-), where p+ and p- are the proportions of positive and negative examples. A pure dataset (all one class) has entropy 0.

Answer 186

Gini Impurity is another metric for measuring the purity of a dataset. For a dataset with K classes: Gini(S) = 1 - ∑ pk^2 where pk is the proportion of class k. Like entropy, a lower Gini implies higher purity. It is often preferred due to computational simplicity.

Answer 187

Continuous features are handled by finding a threshold that best splits the data. This is done by sorting the data values and evaluating potential split points between consecutive values, selecting the one that yields the highest information gain or lowest impurity.

Answer 188

Regression trees use variance reduction (or squared error reduction) instead of entropy or Gini impurity. The goal is to find splits that reduce the variance of the target variable within each resulting subset.

Answer 189

Pruning reduces the size of a decision tree by removing sections that provide little predictive power. It helps prevent overfitting by simplifying the tree. Two types are: - Pre-pruning: Halts tree growth early based on criteria like max depth. - Post-pruning: Grows the full tree, then removes branches that don't improve validation performance.

Answer 190

- Easy to interpret and visualize - Capable of handling both numerical and categorical data - Require little data preprocessing - Able to model nonlinear relationships

Answer 191

- Prone to overfitting - Unstable with small changes in data - Can be biased toward features with more levels - Often inferior performance compared to ensemble methods

Answer 192

- max_depth: Maximum depth of the tree - min_samples_split: Minimum samples required to split a node - min_samples_leaf: Minimum samples required at a leaf node - max_features: Maximum features considered for a split

Answer 193

Feature selection directly impacts the structure of a decision tree, as the algorithm selects features for splitting. Irrelevant or redundant features may lead to unnecessary splits, increased complexity, and reduced generalization.

Answer 194

Because decision trees use greedy splitting (locally optimal), they do not backtrack or consider global optimality. This means the final tree may not be the smallest or most accurate possible tree across all combinations of splits.

Answer 195

A social network is a structure made up of nodes (representing people) and edges (representing relationships between those people). It is used to understand the topology, communities, and centrality of interactions among individuals.

Answer 196

The three main areas are: 1) Topology (structure of connections), 2) Communities (clusters of tightly connected nodes), and 3) Centrality (importance or influence of individual nodes).

Answer 197

Online social networks can be analyzed using graph theory and network science tools like Gephi. Data such as user interactions, retweets, mentions, and co-occurrences can serve as proxies to reconstruct social networks.

Answer 198

A proxy social network uses alternative data to define relationships. Examples include: Retweets, Mentions, Co-occurrences (e.g., in documents), Citations (academic works), and Co-authorships. These represent indirect social ties derived from activity data.

Answer 199

Centrality measures the importance, influence, or connectedness of a node within a network. It can indicate access to information, control over communication, or prestige within the network.

Answer 200

Degree centrality is the number of edges connected to a node. In directed networks, it's divided into in-degree (incoming edges) and out-degree (outgoing edges). A higher degree may indicate greater influence or access to information.

Answer 201

Degree centrality is most useful in simple analyses for identifying active nodes. However, it does not consider the importance of neighbors or indirect connections.

Answer 202

Eigenvector centrality accounts for both the quantity and quality of connections. A node is more central if it is connected to other highly central nodes. It’s particularly effective in undirected networks.

Answer 203

Katz centrality extends eigenvector centrality by allowing for influence from distant nodes using a damping factor. It’s useful in directed networks where eigenvector centrality might not apply well.

Answer 204

PageRank is a variant of Katz centrality used by Google Search. It distributes importance based on the out-degree of neighbors, prioritizing links from more influential sources.

Answer 205

Closeness centrality measures the average length of the shortest paths from a node to all others. It identifies nodes that can quickly interact with others, useful for assessing communication efficiency.

Answer 206

Closeness centrality may yield inconsistent comparisons due to small range differences and can struggle with disconnected networks (n-components).

Answer 207

Betweenness centrality quantifies the number of times a node acts as a bridge along the shortest path between two other nodes. It measures control over communication and potential power in the network.

Answer 208

It's useful in identifying brokers or gatekeepers in a network, i.e., nodes that control information flow between others. It’s effective in analyzing power and robustness.

Answer 209

Different measures highlight different aspects: Degree focuses on direct connections, Eigenvector on influential neighbors, Closeness on speed of reach, and Betweenness on control of flow.

Answer 210

Examples include JL Moreno's sociograms from 1934 analyzing student friendships and runaway behavior in girls' homes. They used directional and mutual attraction relationships to model behavior.

Answer 211

The Königsberg bridge problem inspired graph theory, showing how real-world problems can be modeled mathematically. Euler's solution laid the foundation for modern network analysis.

Answer 212

The Watts-Strogatz model introduces 'small-world' networks, where most nodes are not neighbors but can be reached through a small number of steps. It explains phenomena like short path lengths in real-world networks.

Answer 213

Scale-free networks, introduced by Barabási and Bonabeau, follow a power-law degree distribution, meaning a few nodes have many connections while most have few. These networks are robust and naturally arise in many systems.

Answer 214

Networks exhibit emergent properties, where the whole is more than the sum of its parts. These properties, such as resilience, influence, or information spread, arise from the structure and interaction patterns (topology) rather than individual elements.

Answer 215

Learning denotes changes in a system that are adaptive, enabling the system to perform the same or similar tasks more effectively in the future. This implies improvement, memory, and generalization.

Answer 216

In supervised learning, the model is trained using labeled data (i.e., ground truth), while unsupervised learning works without labeled outputs, discovering hidden patterns or structures in input data.

Answer 217

Applications include customer segmentation, fraud detection, identifying new species, and pre-processing steps for supervised learning such as defining classes or topics.

Answer 218

Clustering involves iterating over data points, calculating distance or similarity between them, and grouping points that are closer to each other than to those outside their group

Answer 219

Clustering identifies inherent groupings in data by measuring pairwise similarities and forming clusters with higher internal similarity and lower external similarity.

Answer 220

Community detection is the identification of modules or communities in network structures. It's conceptually similar to clustering but applied to nodes and edges in graphs, often used in network science.

Answer 221

It helps uncover hidden structures in social networks such as user groups or communities that interact more heavily with each other, aiding in analysis of influence and behavior.

Answer 222

Topic modelling extracts abstract topics from text documents using clustering techniques. Each topic is represented by keywords. Algorithms like LDA (Latent Dirichlet Allocation) are commonly used (external source).

Answer 223

- K-Means (user-defined K centroids) - DBSCAN (density-based) - Hierarchical Clustering (agglomerative or divisive trees) Each has different assumptions and outcomes.

Answer 224

K-Means assigns each point to the nearest of K centroids and iteratively updates centroid positions to minimize within-cluster variance. The value of K is chosen by the user.

Answer 225

DBSCAN groups points in high-density regions, automatically detecting the number of clusters and identifying outliers. It is useful for irregular cluster shapes and noisy data.

Answer 226

Hierarchical clustering creates a tree (dendrogram) of nested clusters. It can be agglomerative (bottom-up) or divisive (top-down), and doesn’t require specifying the number of clusters in advance.

Answer 227

Hard clustering assigns each object to a single cluster, while soft clustering allows objects to belong to multiple clusters with probabilities, similar to probabilistic models like Gaussian Mixture Models.

Answer 228

Algorithms vary in their assumptions, initialization methods, and sensitivity to noise. Different algorithms may detect different structures in the same dataset.

Answer 229

Two critical components are: 1) How the data is represented and 2) The similarity or distance metric used to compare data points.

Answer 230

The choice of features and how they are structured (e.g., vectors, categorical values) affect the clustering result. Poor representations can hide true groupings or create misleading clusters.

Answer 231

Metrics define how close or far points are. Common ones include Euclidean distance (L2 norm), Manhattan distance (L1 norm), and Jaccard similarity for sets. The chosen metric impacts clustering shape and sensitivity.

Answer 232

Euclidean distance measures the straight-line distance between two points in Euclidean space. It’s widely used in algorithms like K-Means when data is continuous and real-valued.

Answer 233

Manhattan distance (L1 norm) is the sum of absolute differences across dimensions. It's more robust to outliers and better for high-dimensional or grid-based data.

Answer 234

Jaccard similarity measures overlap between two sets, calculated as intersection over union. It’s useful for binary or categorical data (e.g., tag co-occurrence, basket analysis).

Answer 235

These are metrics for comparing probability distributions. KL divergence is asymmetric, while Jensen-Shannon is a symmetric, smoothed version. Used in topic modelling and probabilistic clustering (external source).

Answer 236

Clustering is unsupervised, and different algorithms may reveal different but equally valid structures in the same dataset. The best choice depends on the problem and evaluation criteria.

Answer 237

How data is labeled and categorized (e.g., UK ethnicity categories) can reflect biases and power dynamics. Understanding these contexts is vital in ethical data analysis. See: "Data Feminism" by D’Ignazio & Klein (external source).

Answer 238

It removes noise, focuses on the most informative features or combinations, and reduces computational complexity, making data analysis more efficient and interpretable

Answer 239

Feature selection is the process of selecting a subset of relevant features while removing less informative or redundant ones. It can use filter, wrapper, or embedded methods.

Answer 240

A filter method that removes features with variance below a certain threshold, under the assumption that low-variance features contribute little information. Features must be normalized or standardized first.

Answer 241

A wrapper method that starts with one feature, builds models incrementally by adding features that improve performance, and continues until the optimal number of features is selected.

Answer 242

A wrapper method that starts with all features and iteratively removes the least important ones based on model performance until a specified number of features remain.

Answer 243

Embedded methods perform feature selection during model training. Example: Decision Trees, which choose features based on criteria like Gini impurity, information gain, or variance reduction.

Answer 244

Feature extraction transforms the data into a new space by combining original features into new, more informative dimensions, often reducing redundancy and highlighting structure.

Answer 245

PCA is a linear transformation that identifies the directions (principal components) of maximum variance in the data. These are orthogonal and ranked by the amount of variance they explain.

Answer 246

Principal components are linear combinations of the original features. They are uncorrelated and ordered by their contribution to the data's total variance.

Answer 247

1) Compute the covariance matrix from the data. 2) Diagonalize it to find eigenvalues and eigenvectors. 3) Eigenvectors are PCs, and eigenvalues represent variances. 4) Multiply the data by the eigenvectors to transform it.

Answer 248

Eigenvalues indicate the variance explained by each principal component. Higher eigenvalues correspond to more informative components.

Answer 249

When all variables are equally important and uncorrelated, PCA provides little to no dimensionality reduction benefit.

Answer 250

PCA assumes linear relationships and is sensitive to feature scaling. It also performs poorly when important structure lies in nonlinear relationships.

Answer 251

t-SNE (t-distributed Stochastic Neighbor Embedding) is a nonlinear dimensionality reduction method used for visualization. It preserves local structures by matching distance distributions between high- and low-dimensional spaces.

Answer 252

1) Calculate pairwise distances in high-D space and fit Gaussian distribution. 2) Randomly scatter points in low-D space. 3) Fit t-distribution in low-D space. 4) Use gradient descent to minimize divergence between the two distributions.

Answer 253

Strengths: Excellent for high-dimensional data visualization. Weaknesses: Faraway distances are unreliable, memory-intensive, results depend heavily on hyperparameters, and axes are uninterpretable.

Answer 254

UMAP (Uniform Manifold Approximation and Projection) is a faster, more scalable nonlinear technique that preserves both local and global structure and can project to more than 3 dimensions.

Answer 255

UMAP runs faster, uses less memory, retains both local and global structure, and can be used beyond 2D or 3D visualizations.

Answer 256

Issues include high sensitivity to hyperparameters, misleading cluster sizes and distances, and lack of interpretability for the resulting axes.

Answer 257

Understanding their mechanics allows critical use and interpretation, helping users assess when they’re effective and when they might mislead. Tools must be evaluated, not just applied.

Answer 258

“With four parameters I can fit an elephant, and with five I can make him wiggle his trunk.” It highlights the dangers of overfitting and blindly trusting flexible models without understanding.

Answer 259

Because it represents varying degrees of similarity through a tree structure, enabling flexible data partitioning at different levels.

Answer 260

A tree diagram used in hierarchical clustering to show nested groupings of data. Each node is a cluster; clusters of size one are singletons.

Answer 261

It starts with individual data points and iteratively merges the closest clusters until a full hierarchy (dendrogram) is built.

Answer 262

The number of possible dendrograms grows super-exponentially with the number of data points, making brute-force approaches computationally infeasible.

Answer 263

- Single linkage: distance between closest points in clusters (can cause chaining) - Complete linkage: distance between farthest points (can split large clusters) - Average linkage: average distance between all points - Centroid linkage: distance between cluster centroids (biased to spherical clusters) - Ward’s method: minimizes total within-cluster variance (also spherical bias)

Answer 264

Phylogenetics, for constructing evolutionary trees. Other methods in the field include Maximum Likelihood Estimates and Bayesian Inference.

Answer 265

It determines where to 'cut' the tree to form flat clusters. While sometimes clear, often many valid cutoffs exist, and judgment or evaluation metrics are needed.

Answer 266

When clusters vary in shape and size, and when data contains noise. DBSCAN does not assume a particular shape and can exclude outliers.

Answer 267

A point belongs to a cluster if the density of its neighborhood exceeds a threshold. Clusters are formed from densely connected regions.

Answer 268

- eps: radius of the neighborhood - MinPts: minimum number of points required within that radius to form a core point

Answer 269

- Core point: has at least MinPts neighbors within eps - Border point: fewer than MinPts within eps, but close to a core point - Noise point: neither core nor border; outliers

Answer 270

All points in a cluster are reachable from one another by paths of length eps through core points, forming arbitrarily shaped clusters.

Answer 271

- Handles noise well - Identifies clusters of various shapes and sizes - Doesn’t require predefining number of clusters

Answer 272

- Sensitive to choice of eps and MinPts - Struggles with datasets having clusters of varying densities

Answer 273

A smaller eps requires a denser neighborhood to qualify as a core point. Adjusting one affects the required density threshold set by the other.

Answer 274

Because it defines clusters based on the local density of data points rather than relying on distance to centroids or linkage heuristics.

Answer 275

DBSCAN does not require specifying the number of clusters, works with arbitrary shapes, and can exclude noise, unlike K-means which assumes spherical clusters and needs a predefined K.

Answer 276

Algorithms that partition the dataset into K non-overlapping, flat clusters. Unlike hierarchical methods, partitional clustering is faster (O(NK) vs. O(N²)) and includes algorithms like K-means.

Answer 277

Efficiency—partitional methods scale better by comparing each point to K centroids (O(NK)) rather than all other points (O(N²))

Answer 278

- Choose K and initialize centroids randomly. - Assign each point to the nearest centroid. - Recompute centroids based on current cluster members. - Repeat steps 2–3 until convergence (minimal change in assignments or centroids).

Answer 279

Few or no point reassignments, minimal change in centroid positions, or minimal change in Sum of Squared Errors (SSE).

Answer 280

O(NdKt), where N = number of points, d = dimensions, K = clusters, t = iterations.

Answer 281

It’s efficient, conceptually simple, and works well for large datasets where clusters are spherical and similar in size

Answer 282

- Assumes centroids are meaningful (troublesome for categorical data) - Sensitive to outliers - Requires predefining K - Struggles with clusters of different size, shape, or density - Sensitive to initial centroids

Answer 283

Normalize or standardize data, remove outliers. Post-processing: Eliminate small clusters, split loose ones, merge compact clusters.

Answer 284

Gaussian Mixture Models (GMM), which use soft assignments and probabilistic cluster representations.

Answer 285

A probabilistic model where each cluster is represented by a Gaussian distribution. GMMs estimate the mean (μ), variance (σ²), and mixing coefficient (π) for each Gaussian.

Answer 286

- GMM uses soft assignment (probabilities) vs. hard assignment in K-means - GMM clusters can be elliptical vs. spherical for K-means - GMM maximizes log-likelihood; K-means minimizes SSE

Answer 287

The probability of a point x under the model, defined as the weighted sum of all Gaussian densities [SEE EQUATION]

Answer 288

- External: Compare clusters to ground truth labels - Internal: Measure cohesion/separation (e.g. Silhouette Score) - Relative: Compare different clustering algorithms

Answer 289

Using a known label (e.g. class labels) to assess how well the clustering aligns with ground truth using matrices like incidence or confusion matrices.

Answer 290

Evaluates clustering based on intrinsic structure, including: - Cohesion (Within Sum of Squares): how compact clusters are - Separation (Between Sum of Squares): how distinct clusters are

Answer 291

Silhouette coefficient (s) for a point is: s = (b-a)/max(a,b) Where: - a = mean intra-cluster distance - b = mean nearest-cluster distance Range: -1 (bad) to +1 (good).

Answer 292

Used for individual points or averaged over all points. Commonly visualized as bar plots grouped by clusters to assess cluster quality.

Answer 293

K-means makes hard assignments with spherical boundaries, while GMM gives soft assignments and adapts better to elliptical cluster shapes.

Answer 294

To extract high-level information from images or videos, such as object recognition, scene understanding, and 3D reconstruction.

Answer 295

3D image reconstruction, object detection, image segmentation, panorama stitching, 3D terrain modelling, and position tracking (e.g., used by the NASA Spirit rover).

Answer 296

As a matrix of pixel values. For a color image, it's typically represented by three matrices: one each for Red, Green, and Blue channels.

Answer 297

The layered structure of the visual cortex, where information flows through hierarchical layers. Deep neural networks try to mimic this structure.

Answer 298

A basic computational unit that takes multiple inputs, applies weights, sums them, and passes them through an activation function to produce a binary output.

Answer 299

They scale poorly (100x100 image = 10,000 weights per node), are sensitive to small changes in input, and do not exploit spatial correlations between pixels.

Answer 300

They were neuroscientists who found that specific neurons in a cat's visual cortex respond to edges and lines of certain orientations. This inspired feature detection in convolutional neural networks.

Answer 301

It's the process of applying a filter (kernel) to an image using a dot product and sliding window approach, generating a feature map that highlights the presence of specific patterns (like edges).

Answer 302

By applying a filter to each region of the image (via convolution), summing the results, adding a bias, and storing the output in a new matrix—the feature map.

Answer 303

It indicates where a particular feature (e.g., edge or pattern) occurs in the input image. Positive values suggest the feature is present; negative values suggest absence.

Answer 304

They reduce the number of parameters compared to fully connected networks, exploit spatial relationships, and detect hierarchical features efficiently.

Answer 305

Advancements include deeper architectures (deep CNNs), more data, GPU acceleration, cloud computing, and accessible libraries like TensorFlow, Keras, and PyTorch.

Answer 306

A subfield of machine learning involving neural networks with many layers, enabling high-level abstractions in data, particularly useful for image and video tasks.

Answer 307

It requires large datasets, significant computational resources (e.g., GPUs), lacks uncertainty representation, is hard to optimize, and is often seen as a black box.

Answer 308

Research into interpretable models, efficient architectures (e.g., MobileNet), improved training strategies, uncertainty-aware models, and transfer learning are ongoing solutions.

Answer 309

Applying non-linear activation functions (e.g., ReLU), downsampling (e.g., pooling), and fully connected layers to perform tasks like classification or detection.

Answer 310

It enables the network to learn the optimal weights (filters) by minimizing the loss function through gradient descent across all layers.

Answer 311

ReLU (Rectified Linear Unit) outputs max(0, x), discarding all negative values and keeping positive ones unchanged. It introduces non-linearity and is computationally efficient.

Answer 312

It avoids the vanishing gradient problem seen in sigmoid/tanh, speeds up training, and works well for many image-based tasks.

Answer 313

Pooling downsamples feature maps by summarizing regions (e.g., max or average pooling), reducing dimensionality and computation while preserving key features

Answer 314

Max pooling selects the highest value in a patch of the feature map. It helps retain the most salient features and makes the model more robust to spatial variations.

Answer 315

Average pooling computes the average value of a patch. It may be used when smoother feature representations are desired.

Answer 316

Stride is the number of pixels the filter moves during convolution or pooling. A larger stride reduces output size more aggressively.

Answer 317

Padding adds extra pixels (usually zeros) around the input image to control the spatial size of the output and preserve edge information.

Answer 318

The output is flattened and passed into traditional machine learning classifiers, like logistic regression or fully connected neural networks, for final prediction.

Answer 319

It discards all negative values, which can lead to dead neurons (never activated). Alternative activations like Leaky ReLU can help.

Answer 320

The derivative of tanh tends to zero for large input values, causing vanishing gradients that slow down or halt learning during backpropagation.

Answer 321

It converts output scores into a probability distribution over classes, making it useful for multi-class classification.

Answer 322

It determines how much weights are updated during training. Too high can overshoot minima; too low may converge slowly or get stuck.

Answer 323

Large rates may skip minima; small ones may lead to slow training or local minima entrapment. Choosing the right rate is crucial.

Answer 324

Use decay (reduce over time), constant rates, or advanced optimizers like Adagrad, Adadelta, RMSprop, or Adam that adjust learning rates dynamically.

Answer 325

A technique that adds a fraction of the previous update to the current one, helping accelerate learning and smooth out oscillations.

Answer 326

Settings defined before training, including filter size, number of filters, padding, stride, learning rate, dropout, batch size, activation function, etc.

Answer 327

The performance of a CNN depends heavily on well-chosen hyperparameters. They influence learning speed, generalization, and model accuracy.

Learning from Data Flashcards

(353 cards)