Learning from Data Flashcards

(353 cards)

1
Q

What are the four Vs of Big Data?

A
  • Volume: Scale of Data
  • Variety: Different Forms of Data
  • Velocity: Analysis of Streaming Data
  • Veracity: Uncertainty of Data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is Structured Data?

A
  • Data that adheres to a data model
  • Conforms to a tabular format with relationship between the different rows and columns
  • Makes it easier to contextualise and understand the data
  • Examples include tables in SQL databases
  • Data elements are addressable for effective analysis
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is unstructured data?

A
  • Data which is not organised according to a preset data model or schema, therefore cannot be stored in a traditional relational database
  • 80% - 90% of data generated and collected by organisations is unstructured. It is rich in content but not immediately usable without first being sorted
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is semi-structured data?

A
  • Data that does not adhere to a data model, but has some level of structure
  • It contains tags, hierarchies, and other types of markers that give data structure
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the technology of different data types?

A
  • Structured: Based on relational database table
  • Semi-structured: Based on XML/RDF
  • Unstructured: Based on charcter and binary data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the transaction management of different data types?

A
  • Structured: Matured transaction and concurrency techniques
  • Semi-structured: Transaction is adopted from DBMS, not matured
  • Unstructured: No transaction management and no concurrency
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the version management of different data types?

A
  • Structured: Versioning over tuples, rows, tables
  • Semi-structured: Versioning over tuples or graph is possible
  • Unstructured: Versioned as a whole
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the flexibility of different data types?

A
  • Structured: Schema dependent, less flexible
  • Semi-structured: More flexible than structured, but less than unstructured
  • Unstructured: More flexible and there is an absence of a schema
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are the analysis methods of different data types?

A
  • Structured: SQL queries
  • Semi-structured: Query languages (e.g., Cassandra, MangoDB)
    -Unstructured: Natural language processing, audio analysis, video analysis, text analysis
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is the primary goal of data integration?

A

The goal of data integration is to combine data from heterogeneous sources into a single coherent data store, providing users with consistent access and delivery of data across various subjects and data structure types. This is particularly useful when data sources are disparate or siloed, such as across different hardware devices, software applications, or operating systems.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Name and describe the five data integration strategies

A
  • Common User Interface (Manual Integration): Data managers manually handle every step of integration, from retrieval to presentation.
  • Middleware Data Integration: Uses middleware software to bridge communication between systems, especially legacy and newer systems.
  • Application-Based Integration: Software applications locate, retrieve, and integrate data by making it compatible across systems.
  • Uniform Data Access: Provides a consistent view of data without moving or altering it, keeping data in its original location.
  • Common Data Storage (Data Warehouse): Stores a duplicate copy of data in a central repository for uniform retrieval and presentation.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are the advantages and disadvantages of the Common User Interface strategy?

A
  • Advantages: Reduced cost, requires little maintenance, integrates a small number of data sources, and gives users total control.
  • Disadvantages: Data must be handled manually at each stage, scaling requires changing code, and the process is labor-intensive.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are the advantages and disadvantages of the Middleware Data Integration strategy?

A
  • Advantages: Middleware software conducts the integration automatically, and the same way each time
  • Disadvantages: Middleware needs to be deployed and maintained.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are the advantages and disadvantages of the Application-based Integration strategy?

A
  • Advantages: Simplified process, application allows systems to transfer information seamlessly, much of the process is automated.
  • Disadvantages: Requires specialist technical knowledge and maintenance, complicated setup.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are the advantages and disadvantages of the Uniform Access Integration strategy?

A
  • Advantages: Lower storage requirements, provides a simplified view of the data to the end user, easier data access
  • Disadvantages: Can compromise data integrity, data host systems are not designed to handle amount and frequency of data requests.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What are the advantages and disadvantages of the Common Data Storage strategy?

A
  • Advantages: Reduced burden on the host system, increased data version management control, can run sophisticated queries on a stored copy of the data without compromising data integrity
  • Disadvantages: Need to find a place to store a copy of the data, increases storage cost, require technical experts to set up the integration, oversee and maintain the data warehouse.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What percentage of their time do data scientists spend cleaning and organizing data?

A

Data scientists spend 60% of their time cleaning and organizing data, making it the most time-consuming part of their work.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is the least enjoyable part of data science according to surveys?

A

Cleaning and organizing data is the least enjoyable part, cited by 57% of respondents.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What are the three main types of learning in machine learning?

A
  • Supervised Learning: Uses labeled data to learn a mapping from inputs to outputs.
  • Unsupervised Learning: Works with unlabeled data to find patterns or groupings.
  • Semi-Supervised Learning: Combines both labeled and unlabeled data.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What is the difference between regression and classification in supervised learning?

A
  • Regression: Predicts a continuous quantitative response (e.g., income, stock price).
  • Classification: Predicts a qualitative response (e.g., marital status, cancer diagnosis).
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What is the general form of a supervised learning model?

A

The model learns a mapping function: y_p= f(Ω, x)

where:

y_p = predicted output
Ω = model parameters
x = input features

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What are hyperparameters in machine learning?

A

Hyperparameters are parameters not learned directly from the data but set before training (e.g., learning rate, number of layers in a neural network). They control the learning process and are often tuned for optimal performance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

How is the quality of predictions measured in supervised learning?

A

A loss function J(y, y_p) quantifies the difference between predicted values y_p and actual values y. The goal is to minimize this function during training.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What is the purpose of a loss function?

A

The loss function measures how well the model’s predictions match the actual data. It guides the optimization process to adjust model parameters (Ω) for better accuracy.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
What is the equation for simple linear regression?
y_p(x)=β_0 + β_1x+ϵ where: β_0 = y-intercept β_1 = slope ϵ = error term
26
How are the coefficients β_0 and β_1 calculated in linear regression?
β_1 = r × (S_x/S_y) where r is the Pearson correlation coefficient, S_x and S_y are standard deviations. β_0= yˉ −β_1 × xˉ
27
What does the Pearson correlation coefficient measure?
It measures the strength and direction of the linear relationship between two variables, ranging from -1 (perfect negative correlation) to +1 (perfect positive correlation).
28
What is an example of unsupervised learning?
Clustering houses based on electricity usage patterns to identify groups with similar consumption behaviors.
29
Why is logistic regression considered a regression method even though it’s used for classification?
Logistic regression estimates probabilities (a continuous output) for binary or multi-class responses, making it a regression method that is adapted for classification tasks.
30
What is the difference between features and labels in supervised learning?
- Features (inputs): Variables used to predict the outcome (e.g., movie budget). - Labels (outputs): The target variable being predicted (e.g., box office revenue).
31
What is the role of the error term (ϵ) in linear regression?
The error term captures the difference between the observed and predicted values, accounting for noise or unexplained variability in the data.
32
What is the purpose of a data warehouse in data integration?
A data warehouse stores a duplicate copy of data in a central repository, enabling sophisticated queries and analysis without compromising the integrity of the original data sources.
33
What are the disadvantages of uniform data access integration?
It can compromise data integrity and may overwhelm host systems not designed to handle high volumes or frequent data requests.
34
What is the Mean Squared Error (MSE), and how is it used in linear regression?
MSE averages the squared errors to measure model accuracy. Purpose: Minimizing MSE helps find optimal β_0 and β_1 (least squares method).
35
[REVISE EQUATIONS FROM LECTURE 2/3]
36
Compare regression for prediction vs. regression for interpretation.
- Prediction: Focus on minimizing error (e.g., MSE) to forecast y. Ignores interpretability (e.g., black-box models). - Interpretation: Focus on β coefficients to understand relationships (e.g., "How does marketing budget affect revenue?"). Example: Predicting customer churn (prediction) vs. analyzing feature importance in housing prices (interpretation).
37
What are the steps to implement linear regression in Python using sklearn?
from sklearn.linear_model import LinearRegression model = LinearRegression() model.fit(X_train, y_train) # Fit model y_pred = model.predict(X_test) # Predict
38
How do you handle an overdetermined system in linear regression?
An overdetermined system (more equations than unknowns) is solved by minimizing SSE using OLS. Example: For points (1,6), (2,5), (3,7), (4,10), OLS finds the line y = β_0 + β_1x that best fits all points.
39
What is the trade-off between model interpretation and prediction accuracy?
- Simple models (e.g., linear regression): Easier to interpret but may underfit. - Complex models (e.g., neural networks): Better prediction but harder to interpret ("black box"). Best practice: Choose based on goal (e.g., interpretability for policy-making, accuracy for forecasts).
40
What are the best modeling practices in supervised learning?
- Define a clear cost function (e.g., MSE for regression). - Train multiple models (e.g., different hyperparameters). - Compare performance using metrics (e.g., R^2R, MSE). - Avoid overfitting by validating on test data.
41
Why is the sum of errors (∑ε_i) zero in OLS regression?
The OLS method ensures the mean error is zero by design (derivative of SSE w.r.t. β_0 enforces ∑ε_i =0). Implication: Positive and negative errors cancel out.
42
What is the key difference between linear regression and polynomial regression?
- Linear regression: Models the relationship as a straight line (y = β_0 + β_1x) - Polynomial regression: Extends linear regression by adding higher-order terms (y = β_0 + β_1x + β_2x^2 + …) to capture nonlinear patterns. Use case: Polynomial regression is preferred when data shows curvature (e.g., quadratic trends).
43
How do you determine the optimal polynomial order for a regression model?
Use Bayes Information Criterion (BIC): BIC = nln(SS_e) − nln(n) + ln⁡(n)p SS_e: Sum of squared errors. p: Number of parameters. n: Sample size. Rule: Lower BIC indicates a better balance of fit and complexity
44
Why is polynomial regression still considered a linear model?
Despite fitting curves, polynomial regression remains linear in its parameters (e.g., β_0, β_1, β_2). The "linear" refers to the model’s linearity in coefficients, not predictors.
45
What is data leakage, and how can it be avoided?
- Leakage: When information from the test set inadvertently influences the training process (e.g., scaling using test data). - Prevention: Split data into training/test sets before preprocessing, or use pipelines.
46
What is the purpose of a train-test split in machine learning?
To evaluate model performance on unseen data: Train set (e.g., 70%): Fit the model. Test set (e.g., 30%): Assess generalization. Python: from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
47
What is k-fold cross-validation, and why is it better than a single train-test split?
Process: Split data into k folds (e.g., k=5); each fold serves as the test set once. Advantage: Reduces variance in performance estimates by averaging results across multiple splits. Python: from sklearn.model_selection import cross_val_score scores = cross_val_score(model, X, y, cv=5, scoring="neg_mean_squared_error")
48
What is stratified sampling, and when should it be used?
Definition: Ensures each split (train/test) maintains the same class proportions as the original dataset. Use case: Critical for imbalanced datasets (e.g., 90% Class A, 10% Class B). Python: train_test_split(X, y, test_size=0.2, stratify=y)
49
How does stratified k-fold cross-validation differ from standard k-fold?
- Standard k-fold: Randomly splits data, risking uneven class distribution in folds. - Stratified k-fold: Preserves class ratios in each fold. Example: For a binary classification with 60% positives, each fold will have ~60% positives.
50
What are interaction terms in regression, and how are they modeled?
Interaction terms capture how the effect of one predictor depends on another: y = β_0 + β_1x_1 + β_2x_2 + β_3(x_1 × x_2) Use case: Testing if the impact of x_1 on y changes with x_2 (e.g., drug efficacy varying by age).
51
What is the Bias-Variance Tradeoff, and how does it relate to model complexity?
- Bias: Error from overly simplistic assumptions (underfitting). - Variance: Error from excessive sensitivity to training noise (overfitting). Tradeoff: Increasing model complexity reduces bias but increases variance. Polynomial regression exemplifies this tradeoff.
52
How do you implement polynomial regression in Python?
Create polynomial features: from sklearn.preprocessing import PolynomialFeatures poly = PolynomialFeatures(degree=2) X_poly = poly.fit_transform(X) Fit linear regression: from sklearn.linear_model import LinearRegression model = LinearRegression().fit(X_poly, y)
53
What are the risks of using a high-degree polynomial in regression?
- Overfitting: The model fits training noise, leading to poor generalization. - Interpretability: Higher-order terms are harder to explain. Solution: Use BIC or cross-validation to select optimal degree.
54
Why is random sampling insufficient for imbalanced datasets?
Random splits may create training/test sets with skewed class distributions, biasing performance metrics (e.g., accuracy). Stratified sampling ensures representative splits.
55
What is the key output of cross_val_score in scikit-learn?
An array of scores (e.g., MSE, R^2) for each fold. The average score estimates model performance. Example: scores = cross_val_score(model, X, y, cv=5, scoring="r2") mean_r2 = scores.mean()
56
How does holdout validation differ from k-fold cross-validation?
- Holdout: Single train-test split (fast but prone to high variance). - k-fold: Multiple splits (slower but more reliable performance estimate). Use holdout for large datasets; k-fold for small datasets.
57
How do training error and cross-validation error behave with increasing model complexity?
- Training error: Decreases as complexity increases (model fits training data better). - Cross-validation error: Initially decreases, then increases due to overfitting. Optimal complexity: Choose the point where CV error is minimized (before it starts rising).
58
What are the three sources of model error?
- Bias: Consistent deviation from true values (underfitting). - Variance: Sensitivity to small fluctuations in training data (overfitting). - Irreducible error: Noise inherent in the data (unavoidable).
59
How does polynomial degree relate to bias and variance?
- Low degree (e.g., 1): High bias (rigid model), low variance. - High degree (e.g., 14): Low bias, high variance (fits noise). - Optimal degree (e.g., 4): Balances bias and variance.
60
What is regularization, and why is it used?
- Definition: Technique to prevent overfitting by adding a penalty term to the loss function. - Purpose: Reduces variance by shrinking coefficients (introduces slight bias). Example methods: Ridge (L2) and LASSO (L1) regression.
61
Compare Ridge (L2) and LASSO (L1) regularization.
- Ridge: Penalizes squared coefficients (λ∑β_j^2). Shrinks coefficients but rarely zero. - LASSO: Penalizes absolute coefficients (λ∑∣β_j∣). Can drive coefficients to zero (feature selection). Use case: Ridge for correlated features; LASSO for feature selection.
62
How does the regularization parameter (λ) affect the model?
- Large λ: Strong penalty → simpler model (high bias, low variance). - Small λ: Weak penalty → complex model (low bias, high variance). Optimization: Choose λ via cross-validation.
63
What is the cost function for Ridge regression?
RSS+λ ∑_{j=1}^pβ_j^2 RSS: Residual sum of squares. λ∑β_j^2: L2 penalty term.
64
What is the cost function for LASSO regression?
RSS+λ ∑_{j=1}^p|β_j| RSS: Residual sum of squares. λ∑|β_j|: L1 penalty term (promotes sparsity).
65
How does LASSO perform feature selection?
By driving some coefficients to exactly zero, effectively removing those features from the model. This is due to the L1 penalty’s geometric properties (sharp corners at zero).
66
What is the geometric interpretation of L1 vs. L2 regularization?
- L1 (LASSO): Constraint region is a diamond (sparse solutions at corners). - L2 (Ridge): Constraint region is a circle (smooth shrinkage). Visual: L1 tends to intersect axes (zero coefficients); L2 does not.
67
How do you implement Ridge and LASSO regression in Python?
from sklearn.linear_model import Ridge, Lasso ridge = Ridge(alpha=1.0).fit(X, y) # alpha = λ lasso = Lasso(alpha=1.0).fit(X, y) Note: Always standardize features before regularization.
68
Why is feature scaling important for regularization?
Regularization penalizes coefficients equally. Unscaled features (e.g., age vs. income) would unfairly bias the penalty toward larger-scale features.
69
What is the irreducible error in modeling?
Error caused by noise or randomness in the data that cannot be reduced by any model. It sets a lower bound on the achievable prediction error.
70
How does cross-validation help in selecting λ?
By evaluating model performance across multiple train-test splits for different λ values, then choosing the λ that minimizes average validation error.
71
What is the relationship between model complexity and the bias-variance tradeoff?
- Low complexity: High bias (underfitting), low variance. - High complexity: Low bias, high variance (overfitting). Goal: Find the "Goldilocks" complexity where total error (bias² + variance + irreducible error) is minimized
72
What is the Bayesian interpretation of Ridge regression?
Ridge regression is equivalent to placing a Gaussian prior on the coefficients (mean zero, variance 1/λ), encouraging small but non-zero values.
73
When would you prefer Ridge over LASSO?
When all features are potentially relevant and you want to retain them (e.g., correlated features in genomics). Ridge shrinks coefficients but rarely zeroes them.
74
When would you prefer LASSO over Ridge?
When you suspect many features are irrelevant and want automatic feature selection (e.g., high-dimensional data with sparse signals).
75
What is elastic net regularization, and when is it used?
Definition: Combines L1 and L2 penalties (λ_1∑|β_j| + λ_2∑β_j^2) Use case: When features are correlated AND you want feature selection (balances Ridge and LASSO).
76
What are the three types of missing values in datasets?
- MCAR (Missing Completely At Random): Missingness is unrelated to any data (e.g., sensor random failure). - MAR (Missing At Random): Missingness relates to observed data (e.g., high wind causing sensor malfunctions). - MNAR (Missing Not At Random): Missingness relates to unobserved data (e.g., pollution tampering with sensors).
77
What are the four approaches to handling missing values?
- Keep as-is: For tools that handle missing values (e.g., KNN). - Remove rows: Risky for MNAR/MAR (may introduce bias). - Remove columns: When >25% values are missing in non-critical features. - Impute: Use mean/median (MCAR), subgroup means (MAR), or regression (MNAR).
78
How do you detect outliers using the IQR method?
Q1 = df['col'].quantile(0.25) Q3 = df['col'].quantile(0.75) outliers = df[(df['col'] < (Q1 - 1.5*IQR)) | (df['col'] > (Q3 + 1.5*IQR))]
79
What are the strategies for handling outliers?
- Do nothing: For robust models (e.g., Random Forests). - Cap values: Replace with upper/lower bounds. - Log transform: For skewed data. - Remove rows: Last resort (risks losing information).
80
What is the difference between standardization and normalization?
- Standardization: Rescales data to mean=0, SD=1: X′ = (X−μ)/σ Use case: Algorithms assuming Gaussian distributions (e.g., SVM, PCA). - Normalization: Rescales to [0, 1]: X′ = (X−X_min)/(X_max−X_min) ⁡Use case: Neural networks, distance-based algorithms (e.g., KNN)
81
When should you use log transformation?
- Highly skewed data (e.g., income, city populations). - Data spanning orders of magnitude. - When analyzing ratios (log transforms multiplicative relationships to additive). Formula: X'=log(X).
82
How do you convert categorical data to numerical?
- One-Hot Encoding: Creates binary columns for each category (e.g., "Education Level" → "Bachelor", "Masters"). - Ordinal Encoding: Assigns ranks (e.g., "High School"=1, "PhD"=4). - Target Encoding: Replaces categories with mean target value (for supervised learning).
83
What is discretization, and when is it useful?
Definition: Converting continuous data into bins (e.g., age → "Child", "Adult"). Use cases: - Simplifying models (e.g., decision trees). - Handling non-linear relationships. - Improving interpretability.
84
What is data smoothing, and how does a moving average work?
Purpose: Reduce noise to reveal trends. Moving Average: Replaces each point with the average of its neighbors. Use case: Time series analysis (e.g., stock prices).
85
When should you avoid pie charts?
- Comparing >5 categories (hard to distinguish slices). - Differences between values are small (angles are hard to compare). - Data has similar proportions (use bar charts instead).
86
What are the best practices for bar charts?
- Limit bars: ≤10 categories. - Horizontal layout: For long category names. - Order bars: By value (ascending/descending). - Avoid 3D effects: Distorts proportions.
87
When should you use a scatter plot?
- Show relationships between two continuous variables (e.g., correlation). - Identify clusters or outliers. - Compare multiple groups (use colors/markers).
88
What makes a good line plot?
- Label axes clearly. - Use solid lines (no markers for many points). - Directly label lines (avoid legends if possible). - Highlight key events (e.g., policy changes).
89
What is a heatmap, and when is it useful?
Definition: Matrix where colors represent values. Use cases: - Correlation matrices. - Time-series patterns (e.g., temperature over months). - Geospatial data (e.g., population density).
90
What are systematic vs. random errors in data?
- Systematic: Consistent bias (e.g., faulty sensor calibration). Hard to detect. - Random: Unpredictable fluctuations (e.g., measurement noise). Averages out over time.
91
How do you impute missing values for MAR data?
Use subgroup means/medians (e.g., impute missing GPA for "Freshmen" using the median GPA of other Freshmen). Avoid global imputation to reduce bias.
92
Why is stratified sampling important in train-test splits?
Ensures the test set has the same class proportions as the training set, preventing skewed performance metrics (critical for imbalanced datasets).
93
What are the key steps in data preprocessing?
- Cleaning: Handle missing values, outliers, errors. - Transformation: Scale, normalize, encode. - Reduction: Feature selection, dimensionality reduction. - Visualization: Explore patterns, validate preprocessing.
94
What is the purpose of a train-test split in machine learning?
The train-test split divides the dataset into two parts: the training set (typically 80% of the data) and the test set (20%). The training set is used to train the model, while the test set evaluates its performance on unseen data. This helps assess the model's generalization ability and avoid overfitting.
95
How is classification different from regression?
Classification predicts discrete labels (e.g., "red" or "blue," "spam" or "not spam"), while regression predicts continuous values (e.g., temperature, price). Classification is used for categorical outcomes, whereas regression is used for numerical outcomes.
96
What is logistic regression, and when is it used?
Logistic regression is a classification algorithm that predicts the probability of a binary outcome (e.g., yes/no) using the logistic function: y= 1/ (1+e )^−(ax+b) It is used when the dependent variable is categorical and the relationship between features and outcome is nonlinear (sigmoidal).
97
What does the logistic regression model output represent?
The output is the probability that the input belongs to class "1" (e.g., "spam"). The decision boundary is typically set at 0.5: if the probability ≥ 0.5, the prediction is class 1; otherwise, it is class 0.
98
How do you implement logistic regression in Python using scikit-learn?
from sklearn.linear_model import LogisticRegression log_reg = LogisticRegression() log_reg.fit(X_train, y_train) predictions = log_reg.predict(X_test) probs = log_reg.predict_proba(X_test)
99
What is the cost function for logistic regression?
LOOK AT LECTURE 7
100
How is accuracy calculated for a classifier?
Accuracy is the ratio of correct predictions to total predictions: Accuracy = (TP+TN)/ (TP+TN+FP+FN) where TP = True Positives, TN = True Negatives, FP = False Positives, FN = False Negatives.
101
What is a perceptron, and how does it work?
A perceptron is a binary linear classifier. It takes inputs x_1, x_2, ..., x_K, applies weights w_1, w_2, ..., w_K, sums them with a bias b, and passes the result through a step function: y = f(b +∑_i=1^K(w_ix_i)) where f outputs 1 if the sum ≥ 0, else 0.
102
How is a perceptron trained?
The weights are updated iteratively: - Initialize weights to 0. - For each misclassified point x, update weights: w_(t+1) = w_t + α(y^data - y^pred)x
103
What is the limitation of a perceptron?
Perceptrons can only classify linearly separable data. They fail for problems like XOR, where no straight line can separate classes.
104
What is a Multi-Layer Perceptron (MLP), and how does it solve the perceptron's limitation?
An MLP is a neural network with hidden layers between input and output layers. Each neuron uses nonlinear activation functions (e.g., ReLU, sigmoid), enabling the network to learn complex, non-linear decision boundaries.
105
How does an MLP differ from a single perceptron?
An MLP stacks multiple perceptrons (neurons) in layers, with nonlinear activation functions. This allows it to model intricate patterns, unlike a single perceptron, which is limited to linear separation.
106
What is the role of hidden layers in an MLP?
Hidden layers transform input data hierarchically, extracting higher-level features. Each layer’s neurons apply weighted sums and activation functions to progressively refine the model’s predictions.
107
Why is logistic regression considered a supervised learning method?
It requires labeled training data (ground truth) to learn the relationship between features and outcomes. The model is trained to minimize prediction error on this labeled data.
108
Give an example of a real-world classification problem.
Email spam detection: The classifier predicts whether an email is "spam" or "not spam" based on features like word frequency, sender, and subject line.
109
What is gradient descent, and why is it used in machine learning?
Gradient descent is an optimization algorithm used to minimize the loss function of a model by iteratively adjusting parameters in the direction of the steepest negative gradient. It is used because it efficiently finds optimal parameters for models like linear regression, logistic regression, and neural networks, especially when closed-form solutions are infeasible.
110
What is the difference between deterministic and stochastic parameter-fitting methods?
- Deterministic: Follows a fixed rule (e.g., gradient direction) to update parameters. No randomness; steps are calculated based on gradients. - Stochastic: Introduces randomness (e.g., random steps or sampling). Used in genetic algorithms or stochastic gradient descent (SGD) to escape local minima.
111
What are the L1 and L2 norms, and how are they used in model fitting?
- L1 norm (Manhattan distance): Sum of absolute errors. Robust to outliers but less smooth. - L2 norm (Euclidean distance): Square root of squared errors. Sensitive to outliers but differentiable. Both measure prediction error and serve as loss functions (e.g., L2 for linear regression).
112
How does gradient descent work step-by-step?
- Initialize parameters randomly. - Compute the loss (e.g., L2 norm). - Calculate gradients (partial derivatives of loss w.r.t. parameters). - Update parameters: θ_new = θ_old −α∇_θLoss, where α is the learning rate. - Repeat until convergence or max iterations.
113
What is a confusion matrix, and what metrics derive from it?
A table comparing predicted vs. actual classes: Actual + Actual - Predicted + TP FP Predicted - FN TN Metrics: - Accuracy: (TP+TN)/(TP+TN+FP+FN) - Precision: TP/(TP+FP) - Recall/Sensitivity: TP/TP+FN - Specificity: TN/TN+FP ​
114
What is an ROC curve, and how is it interpreted?
An ROC curve plots the True Positive Rate (TPR, recall) vs. False Positive Rate (FPR, 1-specificity) across different classification thresholds. Interpretation: - Top-left corner (TPR=1, FPR=0): Perfect classifier. - Diagonal line: Random guessing. - Higher AUC (Area Under Curve) = Better performance.
115
What is the F1 score, and why is it useful?
The F1 score is the harmonic mean of precision and recall: F1=2×(Precision×Recall)/(Precision+Recall) ​It balances precision and recall, useful for imbalanced datasets where one class dominates.
116
How is AUC (Area Under the ROC Curve) used to evaluate classifiers?
AUC measures the classifier’s ability to distinguish classes: - AUC=1: Perfect separation. - AUC=0.5: No better than random. - AUC>0.7: Good model. It is threshold-independent and works for binary/multi-class problems.
117
What is the trade-off between sensitivity (recall) and specificity?
- High sensitivity: Few false negatives (e.g., correctly identifying all sick patients). - High specificity: Few false positives (e.g., correctly ruling out healthy patients). Adjusting the classification threshold shifts this trade-off (e.g., lower threshold increases sensitivity but reduces specificity).
118
How do you address class imbalance in a confusion matrix?
- Resampling: Oversample minority class or undersample majority class. - Cost-sensitive learning: Penalize misclassifications of the minority class more. - Use metrics like F1 or AUC: Less sensitive to imbalance than accuracy.
119
What is the difference between grid search and gradient descent for parameter optimization?
- Grid search: Exhaustively tests predefined parameter combinations. Computationally expensive but simple. - Gradient descent: Iteratively adjusts parameters based on gradients. Efficient for high-dimensional spaces but may get stuck in local minima.
120
Why might you use stochastic gradient descent (SGD) over batch gradient descent?
SGD updates parameters using a random subset (mini-batch) of data per iteration, offering: - Faster convergence for large datasets. - Better escape from local minima due to noise in updates. - Lower computational cost per iteration.
121
How does the learning rate (α) affect gradient descent?
- Too small: Slow convergence; may not reach optimum. - Too large: Overshoots minima, causing divergence. - Adaptive methods (e.g., Adam): Dynamically adjust α for faster, stable convergence.
122
What is the role of the threshold in binary classification?
The threshold converts predicted probabilities (e.g., from logistic regression) into class labels: - Threshold=0.5: Default for balanced classes. - Adjusting threshold: Increases sensitivity (lower threshold) or specificity (higher threshold).
123
How would you explain the ROC curve to a non-technical audience?
"The ROC curve shows how well a test (e.g., medical diagnosis) balances catching all true cases (sensitivity) vs. avoiding false alarms (specificity). A curve closer to the top-left means the test is highly accurate."
124
What is Natural Language Processing (NLP), and what are its common applications?
NLP is a field of AI focused on enabling computers to understand, interpret, and generate human language. Common applications include: - Text similarity: Comparing documents for plagiarism or search. - Sentiment analysis: Determining emotional tone (e.g., positive/negative reviews). - Topic extraction: Identifying key themes in large texts (e.g., news articles). - Spam detection: Classifying emails/texts as spam or not.
125
How does Bayesian spam detection work?
It calculates the probability that a message is spam based on word frequencies: P(spam∣message) = P(message∣spam)P(spam)/P(message) Example: For "send us your password," compute P("send"∣spam)×P("us"∣spam)×… Assumption: Words are independent (Naive Bayes).
126
What are tokenisation, stemming, and lemmatisation in NLP?
- Tokenisation: Splitting text into words/tokens (e.g., "Fear of the dark" → ["Fear", "of", "the", "dark"]). - Stemming: Reducing words to root form by chopping suffixes (e.g., "running" → "run"). - Lemmatisation: Using linguistics to convert words to base/dictionary form (e.g., "was" → "be").
127
What is the difference between stemming and lemmatisation?
- Stemming: Crude heuristic (e.g., "adjustable" → "adjust"). Fast but may yield non-words. - Lemmatisation: Accurate base forms using dictionaries/grammar (e.g., "better" → "good"). Slower but precise.
128
Why is tokenisation challenging for languages like Vietnamese or Chinese?
These languages lack spaces between words, requiring advanced methods: - Vietnamese: "thời gian" (time) must be tokenized as one unit, not ["thời", "gian"]. - Chinese/Japanese: Use segmentation algorithms (e.g., Jieba for Chinese).
129
What is a bag-of-words (BoW) representation?
A text vectorization method where: - Each document is represented as word counts (e.g., {"fear": 3, "dark": 2}). - Limitation: Ignores word order/semantics but useful for simple models like Naive Bayes.
130
What is TF-IDF, and how is it calculated?
Term Frequency-Inverse Document Frequency measures word importance in a document relative to a corpus: - TF: (Word count in document)/(Total words in document) ​- IDF: log((Total documents)/(Documents containing the word)) - TF-IDF = TF × IDF. High scores indicate rare but significant words (e.g., "blood" in metal lyrics).
131
How does TF-IDF help identify "metal-like" words in lyrics?
By comparing word frequencies in metal (f_metal) vs. other genres (f_others): -High f_metal/f_others : Words like "burn," "fire" are metal-specific. - Low ratio: Common words (e.g., "the," "of") are ignored.
132
What is a sparse matrix in NLP, and why does it occur?
A matrix (e.g., song-word counts) where most entries are zero because: - Each document uses only a small subset of the vocabulary. - Example: A song may contain 10 unique words out of 10,000 in the corpus.
133
How can vectorized text (e.g., TF-IDF) be used in machine learning?
Converted vectors enable: - Clustering: Group similar documents (e.g., metal vs. pop lyrics). - Classification: Train models (e.g., spam detection). - Dimensionality reduction: PCA to visualize high-dimensional data.
134
What are the limitations of TF-IDF?
- Ignores semantics: "Happy" and "joyful" are treated as unrelated. - No word order: "Not good" vs. "good" may have similar vectors. - Domain dependence: Stopwords (e.g., "the") may be irrelevant in some contexts.
135
How would you preprocess text for a sentiment analysis task?
- Tokenisation: Split into words. - Lowercasing: Standardize case (e.g., "Fear" → "fear"). - Stopword removal: Drop common words (e.g., "the," "and"). - Stemming/Lemmatisation: Reduce inflectional forms. - Vectorization: Convert to BoW or TF-IDF.
136
What is the difference between term frequency (TF) and document frequency (DF)?
- TF: How often a word appears in a single document. - DF: How many documents in the corpus contain the word. - IDF downweights high-DF words (e.g., "the") to highlight rare terms.
137
Why might TF-IDF be better than raw word counts for text classification?
- Reduces bias from frequent but meaningless words (e.g., "the"). - Emphasizes discriminative words (e.g., "password" in spam). - Improves model performance by focusing on informative terms.
138
How would you handle multi-language text processing (e.g., English and Indonesian)?
- Language-specific tokenisers: Use spaCy for English, PySastrawi for Indonesian. - Stopword lists: Customize per language. - Lemmatisation: Requires language-specific dictionaries (e.g., NLTK for English).
139
What is topic modelling, and what problem does it solve in NLP?
Topic modelling is a statistical method to discover abstract "topics" in a collection of documents. It addresses: - High-dimensional sparse data: Reduces document-term matrices (e.g., TF-IDF) to lower-dimensional topic representations. - Summarization: Identifies dominant themes (e.g., "sports," "politics") without prior labeling. - Example: LDA (Latent Dirichlet Allocation) models documents as mixtures of topics, where each topic is a distribution over words.
140
How does Latent Dirichlet Allocation (LDA) work?
- Each document is a mix of topics (e.g., 60% sports, 40% politics). - Each topic is a distribution over words (e.g., "sports" → {"ball": 0.3, "team": 0.2}). - Process: Assign random topics to words. Iteratively update topic-word and document-topic distributions using Gibbs sampling or variational inference.
141
What are the pros and cons of topic modelling?
Pros: - Language-agnostic (works with any document-term matrix). - Unsupervised (no labeled data needed). - Provides interpretable topics (e.g., "genetics," "space"). Cons: - Poor for short texts (e.g., tweets). - Prone to overfitting; requires tuning topic count. - Cannot generalize to unseen documents without retraining.
142
What is a document-term matrix, and why is it sparse?
A matrix where: - Rows: Documents. - Columns: Words (from the corpus vocabulary). - Values: Word counts or TF-IDF scores. - Sparsity: Most documents use only a small subset of the vocabulary (e.g., 50 words out of 10,000), resulting in many zeros
143
What is word embedding, and how does it differ from TF-IDF?
- Word Embedding: Dense vector representation capturing semantic/contextual relationships (e.g., Word2Vec, GloVe). Example: "king" - "man" + "woman" ≈ "queen." - TF-IDF: Sparse vector based on word frequency, ignoring semantics. Key Difference: Embeddings preserve meaning; TF-IDF focuses on word importance.
144
How does Word2Vec generate word embeddings?
- Skip-gram: Predicts context words given a target word (e.g., "cat" → "purrs," "meows"). - CBOW (Continuous Bag-of-Words): Predicts target word from context (e.g., ["purrs," "meows"] → "cat"). Training: A shallow neural network optimizes weights to maximize prediction accuracy, producing embeddings in the hidden layer.
145
What are the properties of good word embeddings?
- Semantic Similarity: Related words (e.g., "ocean," "sea") have similar vectors. - Analogies: Linear relationships reflect semantic rules (e.g., "king" - "man" + "woman" ≈ "queen"). - Multilingual Alignment: Embeddings can map similar words across languages (e.g., "ship" ↔ "navio").
146
What is polysemy, and how does it challenge word embeddings?
- Polysemy: Words with multiple meanings (e.g., "queen" as monarch vs. band). - Challenge: Single embedding may conflate meanings. - Solution: Sense embeddings (e.g., one vector per meaning) using labeled data.
147
How are sentence/document embeddings created?
Averaging Word Embeddings: Simple but loses context. Advanced Methods: - Doc2Vec: Extends Word2Vec to paragraphs. - BERT: Contextual embeddings using transformer networks. Use Case: Measures similarity between texts (e.g., "I traveled to Estonia" ≈ "She flew to Tallinn").
148
What are applications of sentence embeddings?
- Semantic Search: Find documents with similar meaning. - Paraphrase Detection: Identify equivalent sentences. - Multimodal Learning: Align text and images in shared space (e.g., "cat on table" ≈ image of a cat). - Automatic translation - Text summarization
149
Why is dimensionality reduction important in NLP?
- Efficiency: Sparse high-dimensional data (e.g., 10K-word vocab) is computationally expensive. - Noise Reduction: Removes irrelevant features (e.g., stopwords). - Visualization: Projects data to 2D/3D for exploration (e.g., t-SNE plots of topics).
150
Compare LDA and Word2Vec for text analysis.
- LDA: Pros: Interpretable topics, works with unlabeled data. Cons: Struggles with short texts, no word-level semantics. - Word2Vec: Pros: Captures word relationships, works for short texts. Cons: No document-level topics; requires large corpus.
151
How can topic models track trends over time?
By applying LDA to time-stamped documents (e.g., news articles) and plotting topic prevalence: Example: Rise of "quantum computing" topics in 2010s vs. "alchemy" in 1900s.
152
What is the "bag-of-words" limitation in embeddings?
Ignores: - Word Order: "dog bites man" vs. "man bites dog." - Context: "bank" (financial vs. river). - Solution: Contextual models like BERT.
153
How do you evaluate topic models?
- Coherence Score: Measures semantic consistency of top words in a topic (e.g., "ball," "team," "score" for sports). - Human Judgment: Manual review for interpretability. - Perplexity: Lower values indicate better generalization (rarely used due to poor correlation with quality).
154
What type of algorithm is K-Nearest Neighbours (KNN)?
KNN is a supervised learning algorithm commonly used for classification problems. It operates on the assumption that similar data points exist close to each other, using a distance metric (e.g., Euclidean distance) to determine similarity.
155
How does KNN determine the class of a new data point?
KNN finds the k nearest neighbours to the new data point and assigns the majority class among those neighbours. This majority vote determines the new point's class label.
156
What is the role of the parameter 'k' in KNN?
The value of 'k' is a hyperparameter that determines how many neighbours are considered when classifying a new data point. It directly influences the model's performance.
157
What steps are involved in the KNN classification process?
1. Compute the distance between the new point and all other points. 2. Identify the k nearest neighbours. 3. Determine the majority class among these neighbours. 4. Assign the new point to this majority class.
158
Why is KNN susceptible to outliers?
KNN considers the closest points regardless of how representative they are. Outliers can be mistakenly considered valid neighbours and thus mislead the classification due to their extreme or unrepresentative values.
159
How does class imbalance affect KNN performance?
If one class significantly outweighs others in the dataset, KNN may become biased toward predicting that majority class, especially for larger k values.
160
What is a common solution to mitigate class imbalance in KNN?
A common solution is weighted KNN, where each neighbour's vote is weighted inversely to its distance from the query point, giving more influence to closer, potentially more relevant neighbours.
161
What are two common strategies for choosing the optimal value of k?
1. Empirical testing: Start with k=1 and evaluate accuracy on test data, incrementally increasing k. 2. Heuristic approach: Use the square root of the number of training samples, ensuring k is odd to avoid ties.
162
Why should k be an odd number in binary classification?
To avoid ties during majority voting, where two classes could receive an equal number of vote
163
What is Weighted KNN and how does it improve upon basic KNN?
Weighted KNN assigns higher weights to closer neighbours during classification, improving robustness especially in cases with class imbalance or when data density varies.
164
How does KNN compare to other algorithms in terms of learning?
KNN is an instance-based learning algorithm, meaning it does not build a model but rather makes decisions based on stored data. The hypothesis grows with data size, which can lead to performance issues with large datasets.
165
Why is KNN not ideal for high-dimensional datasets?
High dimensionality leads to increased computational cost and can reduce the effectiveness of distance metrics due to the curse of dimensionality. Feature selection or dimensionality reduction is often required.
166
Why is feature scaling important in KNN?
Since KNN relies on distance metrics, features must be on the same scale to prevent some from dominating. Normalization (e.g., scaling to [0,1]) is typically used.
167
In the weather prediction example, how are categorical features handled in KNN?
Categorical features are converted to numerical values using LabelEncoder from sklearn.preprocessing, enabling distance computation.
168
What is the process of building a KNN classifier using sklearn?
1. Encode categorical features. 2. Combine features into a dataset. 3. Create a KNeighborsClassifier object with a specified k. 4. Fit the classifier on the training set. 5. Use predict() on the test set to make predictions.
169
In the weather example, what was the KNN prediction for (weather: overcast, temperature: mild)?
The classifier predicted that it would rain.
170
In the weather example, what was the prediction for (weather: sunny, temperature: hot)?
The classifier predicted that it would not rain.
171
What is the wine dataset used for in the KNN examples?
It is used to demonstrate multi-class classification using KNN. The dataset includes 13 features and 3 wine cultivars (class_0, class_1, class_2).
172
How is the wine dataset prepared for training and testing in the KNN example?
The dataset is loaded from sklearn, then split into training and test sets using train_test_split(), with 20% reserved for testing.
173
What was the accuracy of the wine classifier with k=5?
The accuracy was 72.22%.
174
What happened when k was increased to 7 in the wine dataset example?
The accuracy increased to 80.5%, showing that performance can improve with higher k, though this depends on the data.
175
What are key takeaways from using KNN?
1. KNN is intuitive and powerful for both binary and multiclass classification. 2. It is sensitive to outliers, class imbalance, and feature scaling. 3. Model complexity grows with data, making it unsuitable for very large or high-dimensional datasets without preprocessing. 4. Careful tuning of k and data normalization is critical for good performance.
176
What is a Support Vector Machine (SVM)?
SVM is a supervised machine learning model used for classification, regression, and clustering problems. It works by finding a hyperplane that best separates data points of different classes with the largest possible margin.
177
What is the objective of an SVM?
The main goal of an SVM is to find a decision boundary (hyperplane) that maximizes the margin, which is the shortest distance from the hyperplane to the closest data points of any class (support vectors).
178
What are support vectors in SVM?
Support vectors are the data points closest to the decision boundary. They determine the position and orientation of the hyperplane. Only these points are used in defining the margin.
179
Why do we maximize the margin in SVM?
A larger margin is believed to lead to better generalization on unseen data. It reduces the risk of overfitting.
180
What is a maximum margin classifier?
It is an SVM model that selects the decision boundary that maximizes the margin between classes. It assumes perfect separation without outliers.
181
Why is the maximum margin classifier sensitive to outliers?
Because it tries to perfectly separate all training data, including outliers, which can distort the decision boundary significantly and lead to poor generalization.
182
How does SVM deal with outliers?
SVM can use a soft margin, which allows some misclassification in the training data to prevent the model from being overly influenced by outliers.
183
What is the soft margin in SVM?
A soft margin permits misclassification of some training points, enabling better generalization by preventing overfitting to noise or outliers.
184
What are the implications of a soft margin on bias and variance?
A soft margin introduces higher bias but often lower variance, making the model more robust and better at generalizing to new data
185
What is the purpose of nonlinear transformation in SVM?
Nonlinear transformation maps data from the original input space to a higher-dimensional feature space where a linear separator (hyperplane) can be applied.
186
Give an example of a nonlinear transformation that helps SVM separate data.
Applying the transformation f(x) = x² to 1D data adds a second dimension, making originally non-linearly separable data linearly separable in 2D.
187
What is the kernel trick in SVM?
The kernel trick is a method that allows SVM to compute the dot product in a high-dimensional space without explicitly transforming the data, significantly reducing computational complexity.
188
What is a kernel function?
A kernel function computes the dot product between two vectors in a higher-dimensional feature space without performing the actual transformation. It is defined as k(x, z) = ⟨f(x), f(z)⟩.
189
Why is the kernel trick useful?
It avoids the explicit transformation of data to high-dimensional space, which can be computationally expensive or intractable. Instead, similarity is computed directly using kernel functions.
190
Name a kernel function example and explain its use.
A common example is the second-degree polynomial kernel, which computes similarity in a 2D-transformed space using inputs from the original space, enabling linear separation of nonlinear data.
191
What is the IRIS dataset and how is it used in SVM?
The IRIS dataset is a classical dataset with 3 classes of iris flowers (Setosa, Versicolour, Virginica). It is used to demonstrate multi-class classification using SVM with different support vector classifiers (SVCs).
192
What does the regularization parameter C control in SVM?
The C parameter controls the trade-off between margin size and misclassification. A smaller C creates a wider margin allowing more misclassifications (higher bias), while a larger C reduces misclassifications but can overfit (higher variance).
193
What does the gamma parameter in SVM control?
Gamma defines how far the influence of a single training example reaches. Low gamma means far reach (generalization), high gamma means close reach (can lead to overfitting).
194
What is the purpose of using a mesh grid in SVM classification visualization?
A mesh grid is used to create a grid of points over the feature space to visualize decision boundaries produced by SVM models.
195
What dataset is used in the face recognition SVM example?
The Labeled Faces in the Wild (LFW) dataset is used, containing thousands of images of public figures with labeled identities.
196
How is dimensionality reduction used in the face recognition example?
Principal Component Analysis (PCA) is used to reduce the high-dimensional image data (nearly 3000 pixels per image) to 150 principal features before classification.
197
Why is a pipeline used in the SVM face recognition example?
A pipeline automates preprocessing (like PCA) followed by classification (SVM), ensuring reproducibility and simplification of the modeling process.
198
What is the role of grid search in training the SVM for face recognition?
GridSearchCV is used to find the best combination of hyperparameters (e.g., C and gamma) by evaluating performance through cross-validation.
199
What is a classification report and how is it used in the SVM example?
A classification report shows metrics such as precision, recall, and F1-score for each class, helping to assess the performance of the classifier on test data.
200
What are the key takeaways from the SVM lecture?
SVM separates data with a maximum-margin hyperplane. 2. It can be extended to non-linear cases via kernel tricks. 3. It’s sensitive to outliers without soft margin. 4. Useful in both binary and multiclass problems (e.g., IRIS, LFW). 5. Requires careful tuning of C and gamma for optimal results.
201
What is the fundamental idea behind Decision Trees in machine learning?
Decision Trees split the data into subsets based on feature values to form a tree-like structure. Each internal node represents a decision based on a feature, each branch represents an outcome of that decision, and each leaf node represents a class label (in classification) or a predicted value (in regression). They recursively divide the feature space to create increasingly homogeneous subsets with respect to the target variable.
202
Why are Decision Trees considered a type of greedy algorithm?
Decision Trees are greedy because they make the optimal decision at each node (i.e., selecting the best feature to split on) without considering future splits. This locally optimal choice aims to reduce impurity the most at each stage, rather than finding a globally optimal tree.
203
What is the hypothesis space in Decision Trees?
The hypothesis space for Decision Trees includes all possible trees that can be formed using the given features. Each tree represents a piecewise constant function that partitions the input space and assigns outputs based on the values in the leaf nodes.
204
What is the difference between regression and classification trees?
In classification trees, the output at the leaf nodes is a class label, and the model learns to classify inputs into discrete categories. In regression trees, the output is a continuous value, and the model predicts real-valued outputs.
205
How does a Decision Tree learn from training data?
A Decision Tree learns by recursively splitting the training data into subsets based on feature values that maximize some splitting criterion (e.g., information gain, Gini impurity, or variance reduction), and continues this process until a stopping criterion is met (like minimum leaf size, maximum depth, or pure leaves).
206
What is a decision stump?
A decision stump is a Decision Tree with only one level of decision—i.e., it consists of a single split on one feature, with two leaf nodes. It represents the simplest possible decision tree.
207
What are the key elements of Decision Tree structure?
- Root node: The top of the tree where the first split occurs. - Internal nodes: Points where the data is split based on a feature. - Branches: Outcomes of the split (conditions leading to sub-nodes). - Leaf nodes: Terminal nodes that output the prediction.
208
What is the role of the splitting criterion in Decision Trees?
The splitting criterion determines how the tree chooses the best feature to split on at each node. It quantifies the "purity" of the resulting subsets. Common criteria include: - Information Gain (Entropy reduction) - Gini Impurity - Variance Reduction (for regression)
209
What is Information Gain and how is it used in Decision Trees?
Information Gain measures the reduction in entropy after a dataset is split on a feature. It is calculated as: IG(S, A) = Entropy(S) - ∑(|Sv|/|S|) * Entropy(Sv) where S is the current dataset and Sv are the subsets after splitting on attribute A. Features with higher information gain are preferred for splits.
210
What is Entropy in the context of Decision Trees?
Entropy measures the amount of uncertainty or impurity in a dataset. For a binary classification: Entropy(S) = -p+ log2(p+) - p- log2(p-), where p+ and p- are the proportions of positive and negative examples. A pure dataset (all one class) has entropy 0.
211
What is Gini Impurity and how does it compare to Entropy?
Gini Impurity is another metric for measuring the purity of a dataset. For a dataset with K classes: Gini(S) = 1 - ∑ pk^2 where pk is the proportion of class k. Like entropy, a lower Gini implies higher purity. It is often preferred due to computational simplicity.
212
How do Decision Trees handle continuous-valued features?
Continuous features are handled by finding a threshold that best splits the data. This is done by sorting the data values and evaluating potential split points between consecutive values, selecting the one that yields the highest information gain or lowest impurity.
213
How does a Decision Tree for regression differ in its splitting criterion?
Regression trees use variance reduction (or squared error reduction) instead of entropy or Gini impurity. The goal is to find splits that reduce the variance of the target variable within each resulting subset.
214
What is pruning in Decision Trees and why is it necessary?
Pruning reduces the size of a decision tree by removing sections that provide little predictive power. It helps prevent overfitting by simplifying the tree. Two types are: - Pre-pruning: Halts tree growth early based on criteria like max depth. - Post-pruning: Grows the full tree, then removes branches that don't improve validation performance.
215
What are some advantages of Decision Trees?
- Easy to interpret and visualize - Capable of handling both numerical and categorical data - Require little data preprocessing - Able to model nonlinear relationships
216
What are the main disadvantages of Decision Trees?
- Prone to overfitting - Unstable with small changes in data - Can be biased toward features with more levels - Often inferior performance compared to ensemble methods
217
What are common hyperparameters in Decision Trees?
- max_depth: Maximum depth of the tree - min_samples_split: Minimum samples required to split a node - min_samples_leaf: Minimum samples required at a leaf node - max_features: Maximum features considered for a split
218
How does feature selection affect Decision Trees?
Feature selection directly impacts the structure of a decision tree, as the algorithm selects features for splitting. Irrelevant or redundant features may lead to unnecessary splits, increased complexity, and reduced generalization.
219
How does the greedy approach limit the optimality of Decision Trees?
Because decision trees use greedy splitting (locally optimal), they do not backtrack or consider global optimality. This means the final tree may not be the smallest or most accurate possible tree across all combinations of splits.
220
What is a social network in the context of data science?
A social network is a structure made up of nodes (representing people) and edges (representing relationships between those people). It is used to understand the topology, communities, and centrality of interactions among individuals.
221
What are the three main areas of study in social network analysis mentioned in the lecture?
The three main areas are: 1) Topology (structure of connections), 2) Communities (clusters of tightly connected nodes), and 3) Centrality (importance or influence of individual nodes).
222
How can online social networks be analyzed through data?
Online social networks can be analyzed using graph theory and network science tools like Gephi. Data such as user interactions, retweets, mentions, and co-occurrences can serve as proxies to reconstruct social networks.
223
What is a proxy social network and what are some examples of edges in it?
A proxy social network uses alternative data to define relationships. Examples include: Retweets, Mentions, Co-occurrences (e.g., in documents), Citations (academic works), and Co-authorships. These represent indirect social ties derived from activity data.
224
Define centrality in the context of network analysis.
Centrality measures the importance, influence, or connectedness of a node within a network. It can indicate access to information, control over communication, or prestige within the network.
225
What is degree centrality and how is it interpreted in directed networks?
Degree centrality is the number of edges connected to a node. In directed networks, it's divided into in-degree (incoming edges) and out-degree (outgoing edges). A higher degree may indicate greater influence or access to information.
226
When is degree centrality most useful and what are its limitations?
Degree centrality is most useful in simple analyses for identifying active nodes. However, it does not consider the importance of neighbors or indirect connections.
227
What is eigenvector centrality and what does it measure?
Eigenvector centrality accounts for both the quantity and quality of connections. A node is more central if it is connected to other highly central nodes. It’s particularly effective in undirected networks.
228
What is Katz centrality and how does it differ from eigenvector centrality?
Katz centrality extends eigenvector centrality by allowing for influence from distant nodes using a damping factor. It’s useful in directed networks where eigenvector centrality might not apply well.
229
What is PageRank and how is it related to Katz centrality?
PageRank is a variant of Katz centrality used by Google Search. It distributes importance based on the out-degree of neighbors, prioritizing links from more influential sources.
230
What is closeness centrality and what does it measure?
Closeness centrality measures the average length of the shortest paths from a node to all others. It identifies nodes that can quickly interact with others, useful for assessing communication efficiency.
231
What are the challenges of using closeness centrality?
Closeness centrality may yield inconsistent comparisons due to small range differences and can struggle with disconnected networks (n-components).
232
What is betweenness centrality and what does it indicate?
Betweenness centrality quantifies the number of times a node acts as a bridge along the shortest path between two other nodes. It measures control over communication and potential power in the network.
233
In what scenarios is betweenness centrality particularly useful?
It's useful in identifying brokers or gatekeepers in a network, i.e., nodes that control information flow between others. It’s effective in analyzing power and robustness.
234
How do centrality measures compare in identifying important nodes?
Different measures highlight different aspects: Degree focuses on direct connections, Eigenvector on influential neighbors, Closeness on speed of reach, and Betweenness on control of flow.
235
What are historical examples of social network analysis?
Examples include JL Moreno's sociograms from 1934 analyzing student friendships and runaway behavior in girls' homes. They used directional and mutual attraction relationships to model behavior.
236
What was the significance of the Königsberg bridges in network science?
The Königsberg bridge problem inspired graph theory, showing how real-world problems can be modeled mathematically. Euler's solution laid the foundation for modern network analysis.
237
What is the Watts-Strogatz model and what does it explain?
The Watts-Strogatz model introduces 'small-world' networks, where most nodes are not neighbors but can be reached through a small number of steps. It explains phenomena like short path lengths in real-world networks.
238
What are scale-free networks and who introduced the concept?
Scale-free networks, introduced by Barabási and Bonabeau, follow a power-law degree distribution, meaning a few nodes have many connections while most have few. These networks are robust and naturally arise in many systems.
239
Why are networks considered to have emergent properties?
Networks exhibit emergent properties, where the whole is more than the sum of its parts. These properties, such as resilience, influence, or information spread, arise from the structure and interaction patterns (topology) rather than individual elements.
240
What is the definition of learning according to Herbert Simon?
Learning denotes changes in a system that are adaptive, enabling the system to perform the same or similar tasks more effectively in the future. This implies improvement, memory, and generalization.
241
What distinguishes unsupervised learning from supervised learning?
In supervised learning, the model is trained using labeled data (i.e., ground truth), while unsupervised learning works without labeled outputs, discovering hidden patterns or structures in input data.
242
What are common real-world applications of unsupervised learning?
Applications include customer segmentation, fraud detection, identifying new species, and pre-processing steps for supervised learning such as defining classes or topics.
243
What are the general steps involved in clustering?
Clustering involves iterating over data points, calculating distance or similarity between them, and grouping points that are closer to each other than to those outside their group
244
How does clustering work conceptually?
Clustering identifies inherent groupings in data by measuring pairwise similarities and forming clusters with higher internal similarity and lower external similarity.
245
What is community detection and how is it related to clustering?
Community detection is the identification of modules or communities in network structures. It's conceptually similar to clustering but applied to nodes and edges in graphs, often used in network science.
246
Why is community detection important in social network analysis?
It helps uncover hidden structures in social networks such as user groups or communities that interact more heavily with each other, aiding in analysis of influence and behavior.
247
What is topic modelling and how does it apply unsupervised learning?
Topic modelling extracts abstract topics from text documents using clustering techniques. Each topic is represented by keywords. Algorithms like LDA (Latent Dirichlet Allocation) are commonly used (external source).
248
What are some examples of clustering algorithms?
- K-Means (user-defined K centroids) - DBSCAN (density-based) - Hierarchical Clustering (agglomerative or divisive trees) Each has different assumptions and outcomes.
249
How does K-Means clustering work?
K-Means assigns each point to the nearest of K centroids and iteratively updates centroid positions to minimize within-cluster variance. The value of K is chosen by the user.
250
How does DBSCAN differ from K-Means?
DBSCAN groups points in high-density regions, automatically detecting the number of clusters and identifying outliers. It is useful for irregular cluster shapes and noisy data.
251
What is hierarchical clustering and how is it performed?
Hierarchical clustering creates a tree (dendrogram) of nested clusters. It can be agglomerative (bottom-up) or divisive (top-down), and doesn’t require specifying the number of clusters in advance.
252
What is the difference between hard and soft clustering?
Hard clustering assigns each object to a single cluster, while soft clustering allows objects to belong to multiple clusters with probabilities, similar to probabilistic models like Gaussian Mixture Models.
253
Why do different clustering algorithms produce different results?
Algorithms vary in their assumptions, initialization methods, and sensitivity to noise. Different algorithms may detect different structures in the same dataset.
254
What are the key ingredients for successful clustering?
Two critical components are: 1) How the data is represented and 2) The similarity or distance metric used to compare data points.
255
Why is data representation important in clustering?
The choice of features and how they are structured (e.g., vectors, categorical values) affect the clustering result. Poor representations can hide true groupings or create misleading clusters.
256
How does similarity or distance metric affect clustering?
Metrics define how close or far points are. Common ones include Euclidean distance (L2 norm), Manhattan distance (L1 norm), and Jaccard similarity for sets. The chosen metric impacts clustering shape and sensitivity.
257
What is Euclidean distance and when is it used?
Euclidean distance measures the straight-line distance between two points in Euclidean space. It’s widely used in algorithms like K-Means when data is continuous and real-valued.
258
What is Manhattan distance and how does it differ from Euclidean?
Manhattan distance (L1 norm) is the sum of absolute differences across dimensions. It's more robust to outliers and better for high-dimensional or grid-based data.
259
What is Jaccard similarity and when is it used?
Jaccard similarity measures overlap between two sets, calculated as intersection over union. It’s useful for binary or categorical data (e.g., tag co-occurrence, basket analysis).
260
What are KL and Jensen-Shannon divergences used for?
These are metrics for comparing probability distributions. KL divergence is asymmetric, while Jensen-Shannon is a symmetric, smoothed version. Used in topic modelling and probabilistic clustering (external source).
261
Why is there often no single correct clustering answer?
Clustering is unsupervised, and different algorithms may reveal different but equally valid structures in the same dataset. The best choice depends on the problem and evaluation criteria.
262
How do social constructs and survey categories affect clustering results?
How data is labeled and categorized (e.g., UK ethnicity categories) can reflect biases and power dynamics. Understanding these contexts is vital in ethical data analysis. See: "Data Feminism" by D’Ignazio & Klein (external source).
263
Why is dimensionality reduction important?
It removes noise, focuses on the most informative features or combinations, and reduces computational complexity, making data analysis more efficient and interpretable
264
What is feature selection in dimensionality reduction?
Feature selection is the process of selecting a subset of relevant features while removing less informative or redundant ones. It can use filter, wrapper, or embedded methods.
265
What is variance thresholding in feature selection?
A filter method that removes features with variance below a certain threshold, under the assumption that low-variance features contribute little information. Features must be normalized or standardized first.
266
What is forward search in feature selection?
A wrapper method that starts with one feature, builds models incrementally by adding features that improve performance, and continues until the optimal number of features is selected.
267
What is recursive feature elimination (RFE)?
A wrapper method that starts with all features and iteratively removes the least important ones based on model performance until a specified number of features remain.
268
What is an embedded method for feature selection and give an example.
Embedded methods perform feature selection during model training. Example: Decision Trees, which choose features based on criteria like Gini impurity, information gain, or variance reduction.
269
What is the core idea behind feature extraction?
Feature extraction transforms the data into a new space by combining original features into new, more informative dimensions, often reducing redundancy and highlighting structure.
270
What is Principal Component Analysis (PCA)?
PCA is a linear transformation that identifies the directions (principal components) of maximum variance in the data. These are orthogonal and ranked by the amount of variance they explain.
271
How are principal components constructed in PCA?
Principal components are linear combinations of the original features. They are uncorrelated and ordered by their contribution to the data's total variance.
272
What are the mathematical steps to perform PCA?
1) Compute the covariance matrix from the data. 2) Diagonalize it to find eigenvalues and eigenvectors. 3) Eigenvectors are PCs, and eigenvalues represent variances. 4) Multiply the data by the eigenvectors to transform it.
273
What is the significance of eigenvalues in PCA?
Eigenvalues indicate the variance explained by each principal component. Higher eigenvalues correspond to more informative components.
274
What is the worst-case scenario for PCA?
When all variables are equally important and uncorrelated, PCA provides little to no dimensionality reduction benefit.
275
What are the limitations of PCA?
PCA assumes linear relationships and is sensitive to feature scaling. It also performs poorly when important structure lies in nonlinear relationships.
276
What is t-SNE and what does it do?
t-SNE (t-distributed Stochastic Neighbor Embedding) is a nonlinear dimensionality reduction method used for visualization. It preserves local structures by matching distance distributions between high- and low-dimensional spaces.
277
How does t-SNE work step-by-step?
1) Calculate pairwise distances in high-D space and fit Gaussian distribution. 2) Randomly scatter points in low-D space. 3) Fit t-distribution in low-D space. 4) Use gradient descent to minimize divergence between the two distributions.
278
What are strengths and weaknesses of t-SNE?
Strengths: Excellent for high-dimensional data visualization. Weaknesses: Faraway distances are unreliable, memory-intensive, results depend heavily on hyperparameters, and axes are uninterpretable.
279
What is UMAP and how does it differ from t-SNE?
UMAP (Uniform Manifold Approximation and Projection) is a faster, more scalable nonlinear technique that preserves both local and global structure and can project to more than 3 dimensions.
280
What are benefits of UMAP over t-SNE?
UMAP runs faster, uses less memory, retains both local and global structure, and can be used beyond 2D or 3D visualizations.
281
What are common issues with t-SNE and UMAP?
Issues include high sensitivity to hyperparameters, misleading cluster sizes and distances, and lack of interpretability for the resulting axes.
282
Why is it important to learn how t-SNE and UMAP work?
Understanding their mechanics allows critical use and interpretation, helping users assess when they’re effective and when they might mislead. Tools must be evaluated, not just applied.
283
What is the John von Neumann quote and its relevance here?
“With four parameters I can fit an elephant, and with five I can make him wiggle his trunk.” It highlights the dangers of overfitting and blindly trusting flexible models without understanding.
284
Why is hierarchical clustering a natural way to organize data?
Because it represents varying degrees of similarity through a tree structure, enabling flexible data partitioning at different levels.
285
What is a dendrogram?
A tree diagram used in hierarchical clustering to show nested groupings of data. Each node is a cluster; clusters of size one are singletons.
286
What is agglomerative (bottom-up) hierarchical clustering?
It starts with individual data points and iteratively merges the closest clusters until a full hierarchy (dendrogram) is built.
287
Why can't we try all dendrograms in agglomerative clustering?
The number of possible dendrograms grows super-exponentially with the number of data points, making brute-force approaches computationally infeasible.
288
What are the common methods for calculating distance between clusters?
- Single linkage: distance between closest points in clusters (can cause chaining) - Complete linkage: distance between farthest points (can split large clusters) - Average linkage: average distance between all points - Centroid linkage: distance between cluster centroids (biased to spherical clusters) - Ward’s method: minimizes total within-cluster variance (also spherical bias)
289
In what field is agglomerative hierarchical clustering historically significant?
Phylogenetics, for constructing evolutionary trees. Other methods in the field include Maximum Likelihood Estimates and Bayesian Inference.
290
How is the cutoff level in a dendrogram used?
It determines where to 'cut' the tree to form flat clusters. While sometimes clear, often many valid cutoffs exist, and judgment or evaluation metrics are needed.
291
When is DBSCAN more suitable than hierarchical clustering?
When clusters vary in shape and size, and when data contains noise. DBSCAN does not assume a particular shape and can exclude outliers.
292
What is the core idea behind DBSCAN?
A point belongs to a cluster if the density of its neighborhood exceeds a threshold. Clusters are formed from densely connected regions.
293
What are the two key hyperparameters in DBSCAN?
- eps: radius of the neighborhood - MinPts: minimum number of points required within that radius to form a core point
294
What are the three types of points in DBSCAN?
- Core point: has at least MinPts neighbors within eps - Border point: fewer than MinPts within eps, but close to a core point - Noise point: neither core nor border; outliers
295
What property does DBSCAN ensure for clusters?
All points in a cluster are reachable from one another by paths of length eps through core points, forming arbitrarily shaped clusters.
296
What are advantages of DBSCAN?
- Handles noise well - Identifies clusters of various shapes and sizes - Doesn’t require predefining number of clusters
297
What are the limitations of DBSCAN?
- Sensitive to choice of eps and MinPts - Struggles with datasets having clusters of varying densities
298
How do eps and MinPts interact in DBSCAN?
A smaller eps requires a denser neighborhood to qualify as a core point. Adjusting one affects the required density threshold set by the other.
299
Why is DBSCAN considered a density-based method?
Because it defines clusters based on the local density of data points rather than relying on distance to centroids or linkage heuristics.
300
What makes DBSCAN different from K-means?
DBSCAN does not require specifying the number of clusters, works with arbitrary shapes, and can exclude noise, unlike K-means which assumes spherical clusters and needs a predefined K.
301
What is partitional clustering?
Algorithms that partition the dataset into K non-overlapping, flat clusters. Unlike hierarchical methods, partitional clustering is faster (O(NK) vs. O(N²)) and includes algorithms like K-means.
302
What is the main advantage of partitional methods over hierarchical clustering?
Efficiency—partitional methods scale better by comparing each point to K centroids (O(NK)) rather than all other points (O(N²))
303
How does the K-means clustering algorithm work?
- Choose K and initialize centroids randomly. - Assign each point to the nearest centroid. - Recompute centroids based on current cluster members. - Repeat steps 2–3 until convergence (minimal change in assignments or centroids).
304
What are common stopping criteria for K-means?
Few or no point reassignments, minimal change in centroid positions, or minimal change in Sum of Squared Errors (SSE).
305
What is the computational complexity of K-means?
O(NdKt), where N = number of points, d = dimensions, K = clusters, t = iterations.
306
What are the main pros of K-means?
It’s efficient, conceptually simple, and works well for large datasets where clusters are spherical and similar in size
307
What are the major cons of K-means?
- Assumes centroids are meaningful (troublesome for categorical data) - Sensitive to outliers - Requires predefining K - Struggles with clusters of different size, shape, or density - Sensitive to initial centroids
308
What are some preprocessing steps to improve K-means?
Normalize or standardize data, remove outliers. Post-processing: Eliminate small clusters, split loose ones, merge compact clusters.
309
What is a robust alternative to K-means for outlier handling and flexible clustering?
Gaussian Mixture Models (GMM), which use soft assignments and probabilistic cluster representations.
310
What is a Gaussian Mixture Model (GMM)?
A probabilistic model where each cluster is represented by a Gaussian distribution. GMMs estimate the mean (μ), variance (σ²), and mixing coefficient (π) for each Gaussian.
311
How does GMM differ from K-means?
- GMM uses soft assignment (probabilities) vs. hard assignment in K-means - GMM clusters can be elliptical vs. spherical for K-means - GMM maximizes log-likelihood; K-means minimizes SSE
312
What is P(x) in a GMM context?
The probability of a point x under the model, defined as the weighted sum of all Gaussian densities [SEE EQUATION]
313
What are clustering validation metrics?
- External: Compare clusters to ground truth labels - Internal: Measure cohesion/separation (e.g. Silhouette Score) - Relative: Compare different clustering algorithms
314
What is external validation in clustering?
Using a known label (e.g. class labels) to assess how well the clustering aligns with ground truth using matrices like incidence or confusion matrices.
315
What is internal validation?
Evaluates clustering based on intrinsic structure, including: - Cohesion (Within Sum of Squares): how compact clusters are - Separation (Between Sum of Squares): how distinct clusters are
316
What is the silhouette coefficient and how is it calculated?
Silhouette coefficient (s) for a point is: s = (b-a)/max(a,b) Where: - a = mean intra-cluster distance - b = mean nearest-cluster distance Range: -1 (bad) to +1 (good).
317
How is the silhouette coefficient used in practice?
Used for individual points or averaged over all points. Commonly visualized as bar plots grouped by clusters to assess cluster quality.
318
How do K-means and GMM perform on datasets like Iris?
K-means makes hard assignments with spherical boundaries, while GMM gives soft assignments and adapts better to elliptical cluster shapes.
319
What is the primary goal of computer vision?
To extract high-level information from images or videos, such as object recognition, scene understanding, and 3D reconstruction.
320
What are common applications of computer vision?
3D image reconstruction, object detection, image segmentation, panorama stitching, 3D terrain modelling, and position tracking (e.g., used by the NASA Spirit rover).
321
How is an image represented in computer vision?
As a matrix of pixel values. For a color image, it's typically represented by three matrices: one each for Red, Green, and Blue channels.
322
What aspect of the human brain inspired deep neural networks?
The layered structure of the visual cortex, where information flows through hierarchical layers. Deep neural networks try to mimic this structure.
323
What is a perceptron in neural networks?
A basic computational unit that takes multiple inputs, applies weights, sums them, and passes them through an activation function to produce a binary output.
324
What are the drawbacks of using fully connected neural networks on images?
They scale poorly (100x100 image = 10,000 weights per node), are sensitive to small changes in input, and do not exploit spatial correlations between pixels.
325
Who were Hubel and Wiesel, and what did they discover?
They were neuroscientists who found that specific neurons in a cat's visual cortex respond to edges and lines of certain orientations. This inspired feature detection in convolutional neural networks.
326
What is a convolution in the context of neural networks?
It's the process of applying a filter (kernel) to an image using a dot product and sliding window approach, generating a feature map that highlights the presence of specific patterns (like edges).
327
How is a feature map generated in a CNN?
By applying a filter to each region of the image (via convolution), summing the results, adding a bias, and storing the output in a new matrix—the feature map.
328
What does a feature map represent in CNNs?
It indicates where a particular feature (e.g., edge or pattern) occurs in the input image. Positive values suggest the feature is present; negative values suggest absence.
329
Why are convolutional neural networks effective for images?
They reduce the number of parameters compared to fully connected networks, exploit spatial relationships, and detect hierarchical features efficiently.
330
How has computer vision evolved since ~1980?
Advancements include deeper architectures (deep CNNs), more data, GPU acceleration, cloud computing, and accessible libraries like TensorFlow, Keras, and PyTorch.
331
What is deep learning in the context of computer vision?
A subfield of machine learning involving neural networks with many layers, enabling high-level abstractions in data, particularly useful for image and video tasks.
332
What are limitations of deep learning in computer vision?
It requires large datasets, significant computational resources (e.g., GPUs), lacks uncertainty representation, is hard to optimize, and is often seen as a black box.
333
What are some solutions or directions to overcome deep learning limitations?
Research into interpretable models, efficient architectures (e.g., MobileNet), improved training strategies, uncertainty-aware models, and transfer learning are ongoing solutions.
334
What comes next after generating feature maps in a CNN?
Applying non-linear activation functions (e.g., ReLU), downsampling (e.g., pooling), and fully connected layers to perform tasks like classification or detection.
335
Why is backpropagation essential in CNNs?
It enables the network to learn the optimal weights (filters) by minimizing the loss function through gradient descent across all layers.
336
What does the ReLU activation function do?
ReLU (Rectified Linear Unit) outputs max(0, x), discarding all negative values and keeping positive ones unchanged. It introduces non-linearity and is computationally efficient.
337
Why is ReLU preferred in CNNs?
It avoids the vanishing gradient problem seen in sigmoid/tanh, speeds up training, and works well for many image-based tasks.
338
What is the purpose of pooling in CNNs?
Pooling downsamples feature maps by summarizing regions (e.g., max or average pooling), reducing dimensionality and computation while preserving key features
339
What is max pooling and how does it work?
Max pooling selects the highest value in a patch of the feature map. It helps retain the most salient features and makes the model more robust to spatial variations.
340
What is average pooling and when might it be used?
Average pooling computes the average value of a patch. It may be used when smoother feature representations are desired.
341
What is stride in CNNs?
Stride is the number of pixels the filter moves during convolution or pooling. A larger stride reduces output size more aggressively.
342
What is padding and why is it used?
Padding adds extra pixels (usually zeros) around the input image to control the spatial size of the output and preserve edge information.
343
What happens after feature extraction and pooling in CNNs?
The output is flattened and passed into traditional machine learning classifiers, like logistic regression or fully connected neural networks, for final prediction.
344
Why might ReLU not always be ideal?
It discards all negative values, which can lead to dead neurons (never activated). Alternative activations like Leaky ReLU can help.
345
What issue is associated with the tanh activation function?
The derivative of tanh tends to zero for large input values, causing vanishing gradients that slow down or halt learning during backpropagation.
346
What is the softmax function used for in neural networks?
It converts output scores into a probability distribution over classes, making it useful for multi-class classification.
347
What is the role of learning rate in training neural networks?
It determines how much weights are updated during training. Too high can overshoot minima; too low may converge slowly or get stuck.
348
What are common problems with learning rates?
Large rates may skip minima; small ones may lead to slow training or local minima entrapment. Choosing the right rate is crucial.
349
What are solutions to learning rate problems?
Use decay (reduce over time), constant rates, or advanced optimizers like Adagrad, Adadelta, RMSprop, or Adam that adjust learning rates dynamically.
350
What is momentum in gradient descent?
A technique that adds a fraction of the previous update to the current one, helping accelerate learning and smooth out oscillations.
351
What are hyperparameters in neural networks?
Settings defined before training, including filter size, number of filters, padding, stride, learning rate, dropout, batch size, activation function, etc.
352
Why is hyperparameter tuning important in CNNs?
The performance of a CNN depends heavily on well-chosen hyperparameters. They influence learning speed, generalization, and model accuracy.
353