Exam questions Flashcards

Train concepts described by teacher (84 cards)

1
Q

(L1): What distinguishes data science from traditional analytics or business intelligence?

A

Data science goes beyond reporting and dashboards; it focuses on extracting actionable insights from complex and often unstructured data using programming, statistical modeling, and machine learning. Unlike traditional business intelligence, which is retrospective and descriptive, data science is predictive, exploratory, and iterative.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

(L1): How does the data science workflow support data-driven decision making?

A

The workflow—typically based on models like CRISP-DM—guides the process from understanding a business problem to collecting, preparing, analyzing, and deploying data solutions. It ensures that insights are not only technically correct but also aligned with business objectives.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

(L1): What are the main types of data analytics (descriptive, diagnostic, predictive, prescriptive) and how are they applied in business?

A

Descriptive: What happened? (e.g., sales reports)

Diagnostic: Why did it happen? (e.g., churn analysis)

Predictive: What will happen? (e.g., demand forecasting)

Prescriptive: What should we do? (e.g., route optimization)

Each type supports decision-making at different stages, from insight generation to strategic planning.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

(L1): Why is programming an essential skill for modern data scientists, even in business contexts?

A

Programming enables data scientists to automate tasks, clean and manipulate large datasets, develop models, and customize analyses. It bridges the gap between raw data and strategic insights, empowering them to build scalable, repeatable solutions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

(L1): What is R, and why is it widely used in data science?

A

R is a statistical programming language designed for data analysis, visualization, and modeling. It is popular because of its rich package ecosystem, strong community support, and strengths in exploratory and statistical work. It’s especially favored in academia and applied research.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

(L1): How does understanding programming improve your ability to collaborate across business and technical teams?

A

Programming literacy helps data scientists translate business questions into data problems, communicate with developers, and explain technical results to non-technical stakeholders. This ensures that analytical solutions are strategically relevant and implementable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

(L2): What does it mean to say that all data is “socially constructed”?

A

To say that all data is “socially constructed” means that data does not exist independently as pure, objective facts. Instead, it is:
Collected, categorized, and defined by people based on specific goals, contexts, and assumptions.

Shaped by choices about what to measure, how to measure it, and for what purpose.

Embedded with values, biases, and power structures, often reflecting the interests of those who design the systems.

As emphasized in Rosenberg (2013), “raw data” is an oxymoron—data is never neutral; it is always filtered through human and institutional decisions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

(L2): Why is it important to question the objectivity of data sources in business contexts?

A

Questioning the objectivity of data sources in business contexts is crucial because:
Data reflects design choices—what to collect, how, and from whom—which may introduce bias or omissions.

Business decisions based on biased data can lead to unfair outcomes (e.g., discriminatory models), misallocation of resources, or flawed strategies.

Contextual understanding is needed to avoid over-trusting data that appears neutral but is influenced by historical, cultural, or organizational factors.

It ensures ethical, accurate, and responsible use of data for modeling, forecasting, and decision-making.

In short: uncritical use of data risks turning flawed input into flawed conclusions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

(L2): How can data collection methods introduce bias into an analysis?

A

Data collection methods can introduce bias into an analysis through:
Sampling bias – when the data does not represent the target population (e.g., only collecting data from active users).

Measurement bias – when the tools or definitions used to collect data skew results (e.g., vague survey questions, poorly calibrated sensors).

Exclusion bias – when important groups or variables are left out (e.g., ignoring non-digital consumers in online studies).

Observer or recording bias – when human judgment affects what is recorded or how (e.g., manual categorization or tagging).

Platform or algorithmic bias – when digital systems (e.g., search engines, social media) shape what data gets collected in the first place.

These biases distort findings, reduce generalizability, and can lead to misleading or harmful conclusions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

(L2): What are some practical ways to document data provenance and transformation in a project?

A

Practical ways to document data provenance (origin) and transformation (changes) include:
Data dictionaries – Describe each variable: name, type, source, meaning, and units.
Metadata files – Record dataset origin, collection date, method, and context.
Version control (e.g., Git) – Track changes in datasets, scripts, and models over time.
Code-based workflows – Use reproducible scripts (e.g., R scripts, Jupyter notebooks) to log each data cleaning and transformation step.
CRISP-DM documentation – Follow structured steps to log business understanding, data preparation, modeling, and evaluation.
Data lineage diagrams – Visualize how raw data moves through transformations to final outputs.

These practices support transparency, reproducibility, and auditability.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

(L2): What ethical considerations should be made before collecting or using data for analysis?

A

Before collecting or using data for analysis, key ethical considerations include:
Consent – Was data collected with informed, voluntary consent?
Privacy – Does the analysis protect personal or sensitive information? Are anonymization and data minimization applied?
Purpose limitation – Is the data used strictly for its intended, declared purpose?
Bias and fairness – Could the data or analysis lead to discrimination or reinforce social inequalities?
Transparency – Are data sources, assumptions, and limitations clearly documented?
Accountability – Who is responsible for the outcomes of the analysis, especially if automated decisions are involved?
These issues are central to responsible data science and align with principles discussed in Lecture 9 (Ethics).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

(L2): What does the phrase “no such thing as raw data” mean in the context of data science?

A

The phrase “no such thing as raw data” means that data is never neutral, pure, or untouched—it is always the result of human decisions about what to observe, how to measure it, and why it matters.
In data science, this highlights that:
Data is constructed, not discovered.
It reflects assumptions, context, and bias from its collection and processing.
Analysts must treat data as rhetorical and interpretive, not as unquestionable fact.

This concept, emphasized by Rosenberg (2013), challenges the myth of objective data and calls for critical engagement with how data is created and used.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

(L2): How do social, technical, and political choices influence how data is collected and interpreted?

A

Social, technical, and political choices shape both what data is collected and how it is interpreted, in the following ways:
Social: Cultural norms and societal values influence what is deemed important to measure (e.g., gender categories, health metrics).
Technical: The tools and systems used (e.g., sensors, platforms, algorithms) define what can be captured and how accurately.
Political: Policies, funding, and power dynamics determine data priorities, access, and framing (e.g., census questions, surveillance practices).

These choices embed bias, exclusions, and power structures into data, affecting both the analysis and the decisions based on it. Data is never neutral—it reflects the worldviews of those who design its collection and use.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

(L2): Why is it important to document data provenance, metadata, and transformations?

A

Documenting data provenance, metadata, and transformations is essential because it ensures:
Transparency – Others can understand where the data came from and how it was processed.
Reproducibility – Analyses can be repeated and validated by others using the same steps.
Accountability – It’s clear who made which decisions, reducing errors and ethical risks.
Contextual understanding – Metadata provides meaning, helping analysts interpret data correctly.
Data quality control – Tracks issues like missing values or inconsistencies introduced during cleaning.
Without documentation, analysis becomes opaque, unreliable, and potentially misleading.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

(L2): What risks arise when we treat data as objective or neutral?

A

Treating data as objective or neutral introduces serious risks:
Bias reinforcement – Hidden biases in data can be mistaken for truths, leading to discriminatory models or decisions.
False legitimacy – Flawed conclusions gain credibility because “the data says so.”
Ethical blind spots – Ignoring the social context of data may result in privacy violations or harm to vulnerable groups.
Oversimplification – Complex social issues may be reduced to misleading numbers or categories.
Uncritical automation – Models trained on biased or incomplete data can make flawed decisions at scale.
In short: assuming neutrality masks the human choices behind data and undermines responsible analysis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

(L2): How can ethical considerations be incorporated at the data collection stage?

A

Ethical considerations can be incorporated at the data collection stage by:
Obtaining informed consent – Ensure participants know what data is collected, why, and how it will be used.
Minimizing data – Collect only what is necessary to reduce privacy risks.
Ensuring anonymity – Remove or mask personal identifiers where possible.
Being inclusive and fair – Design sampling methods to represent diverse groups and avoid exclusion.
Clarifying purpose and ownership – Be transparent about who owns the data and for what purposes it will be used.
Following legal and ethical standards – Comply with data protection laws (e.g., GDPR) and institutional ethics guidelines.
These steps promote responsible, trustworthy data practices from the outset.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

(L3/4): What is the CRISP-DM model, and how does it structure a data science project?

A

CRISP-DM (Cross-Industry Standard Process for Data Mining) structures a data science project into six phases: business understanding, data understanding, data preparation, modeling, evaluation, and deployment. It provides a flexible, iterative framework to align technical work with business needs.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

(L3/4): How does business understanding translate into an analytical objective in CRISP-DM?

A

The business understanding phase involves clarifying goals, constraints, and success criteria. These are then translated into specific analytical tasks, like predicting customer churn or segmenting users, forming the basis for model development.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

(L3/4): Why is iteration essential in the data science solution framework?

A

Iteration allows for refinement as new insights emerge during data exploration or modeling. It ensures models remain relevant, reliable, and aligned with business goals, especially when assumptions or data quality issues are uncovered later in the process.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

(L3/4): What are the differences between analytical and operational deployment of models

A

Analytical deployment refers to using the model for decision support (e.g., dashboards, ad hoc analysis).
Operational deployment means embedding the model into automated systems for real-time or repeated use (e.g., credit scoring in loan applications).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

(L3/4): How can you use proxy variables or composite measures in data preparation?

A

Proxy variables are substitutes when direct measures are unavailable (e.g., using zip code as a proxy for income). Composite measures combine multiple indicators into one (e.g., customer engagement index), often improving interpretability or predictive power.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

(L3/4): Why is model evaluation tied to business success criteria, not just statistical accuracy?

A

A model can be statistically strong yet useless in practice if it doesn’t improve business outcomes. Evaluation should consider metrics like ROI, user adoption, or operational feasibility, not just accuracy or precision.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

(L3/4): How does maintaining a reproducible script structure (e.g., load packages → load data → analysis) improve project quality?

A

A clear structure improves readability, reusability, and makes debugging easier. It also supports replication, which is essential for validating results and collaborating with others on shared codebases.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

(L3/4): What is the Data Science Solution Framework (DSSF), and why is it useful?

A

The DSSF is a structured approach to solving business problems with data. It ensures alignment between analytical methods and business needs, guiding teams from problem definition to solution evaluation in a repeatable and goal-oriented way.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
(L3/4): How do you move from a business problem to a data science research question?
You translate a broad business issue (e.g., declining retention) into a specific, measurable question (e.g., can we predict customer churn?). This involves identifying target variables, key features, and outcome definitions.
26
(L3/4): What distinguishes descriptive, diagnostic, predictive, and prescriptive analytics?
Descriptive: What happened? (Summarizes past data) Diagnostic: Why did it happen? (Explains causes) Predictive: What will happen? (Forecasts outcomes) Prescriptive: What should be done? (Suggests actions) Each type builds on the previous to increase decision-making power.
27
(L3/4): Why is it important to match the solution type to the decision context and data availability?
Different problems require different levels of insight. Using a predictive model for a problem without labeled data, for example, would be inappropriate. Aligning solution type ensures relevance, efficiency, and practical value.
28
(L3/4): What criteria should be used to evaluate whether a data science solution is “good”?
A good solution is: Accurate (technically sound), Actionable (supports decisions), Ethical (avoids harm or bias), Interpretable, and Feasible to implement and maintain in the business environment.
29
(L3/4): How can unintended consequences or ethical risks arise from poorly framed solutions?
If the problem or data is biased, models can reinforce inequality, misclassify vulnerable groups, or be misused. Lack of transparency or oversight can also lead to unethical outcomes or loss of trust.
30
(L3/4): What role does iteration play in building effective data science solutions?
Iteration allows continuous learning and adjustment to new findings, making solutions more accurate, robust, and aligned with evolving business needs. It’s key to navigating uncertainty and improving model relevance over time.
31
(Time series): What is time series analysis and why is it important in data science?
Time series analysis involves examining data points collected or recorded at successive points in time to identify patterns such as trends, seasonality, and cycles. Unlike cross-sectional data, time series data exhibit temporal dependence, meaning past values influence future ones. It's important in data science because many real-world phenomena—stock prices, weather, sales, or sensor readings—unfold over time. Understanding these temporal dynamics enables better forecasting, anomaly detection, and decision-making under uncertainty.
32
(Time series): Can you explain the concept of stationarity in time series analysis?
Stationarity refers to a time series whose statistical properties—mean, variance, and autocovariance—do not change over time. It is crucial because most forecasting models (e.g., ARMA, ARIMA) assume stationarity to produce valid and reliable predictions. A stationary series is predictable in a probabilistic sense, while a non-stationary series may exhibit trends or changing variance that can mislead model estimates.
33
(Time series): How does autocorrelation impact time series data, and how can it be identified and addressed?
Autocorrelation measures how current values in a time series relate to past values (lags). High autocorrelation implies dependency over time, which violates the assumption of independent observations in many models. It can be identified using the Autocorrelation Function (ACF) plot, where persistent spikes suggest autocorrelation. To address it, we can apply differencing, fit ARIMA/ARMA models, or use transformations to remove patterns and stabilize variance.
34
(Time series): What are the key components of a time series?
The key components of a time series are: Trend (Tₜ): A long-term upward or downward movement in the data. Seasonality (Sₜ): Regular, periodic fluctuations (e.g., yearly, monthly). Cyclicality: Irregular, longer-term oscillations not tied to a fixed calendar. Irregular/Noise (εₜ): Random variation not explained by the above components. These are often combined in additive (yₜ = Tₜ + Sₜ + εₜ) or multiplicative (yₜ = Tₜ × Sₜ × εₜ) models.
35
(Time series): What is the difference between time series analysis and cross-sectional analysis?
Time series analysis examines data points collected over time from a single subject or entity, focusing on temporal dependencies and patterns (e.g., trend, seasonality). Cross-sectional analysis, on the other hand, analyzes data collected at a single point in time across multiple entities (e.g., individuals, regions), assuming independence between observations. The key difference lies in temporal structure: time series accounts for time-order and autocorrelation, while cross-sectional does not.
36
(Time series): What are some common methods for forecasting in time series analysis?
Common forecasting methods in time series analysis include: Naïve and Moving Average: Simple baselines using past values or smoothed averages. Exponential Smoothing (e.g., Holt-Winters): Weighs recent observations more heavily. ARIMA (AutoRegressive Integrated Moving Average): Combines autoregression, differencing, and moving average components. Seasonal ARIMA (SARIMA): Extends ARIMA to model seasonality. Machine Learning & Deep Learning: Methods like LSTM or neural networks for complex, non-linear patterns. Each method suits different patterns like trend, seasonality, and noise.
37
(Time series): Can you explain the difference between AR, MA, and ARIMA models?
AR (Autoregressive) models predict future values based on past values. The current value depends linearly on its previous observations. MA (Moving Average) models predict current values based on past errors (shocks or residuals). They smooth out noise using past forecast errors. ARIMA (AutoRegressive Integrated Moving Average) combines both AR and MA but adds differencing to handle non-stationary data by removing trends or seasonality before modeling. Together, ARIMA captures both memory (AR), noise structure (MA), and non-stationary patterns.
38
(Time series): How do you handle seasonality in time series data?
Seasonality in time series can be handled using several methods: Decomposition: Break the series into trend, seasonal, and residual components (additive or multiplicative). Differencing: Apply seasonal differencing (e.g., subtract value from the same month last year) to stabilize seasonal patterns. Modeling: Use models like Holt-Winters or SARIMA, which explicitly include seasonal components. Transformation: Apply log or Box-Cox transforms when seasonal variance increases over time. These techniques help isolate or remove seasonal effects for clearer modeling.
39
(Time series): What is a lag variable and how is it used in time series analysis?
A lag variable is simply a previous value in a time series—like using yesterday’s temperature to help predict today’s. In time series analysis, lag variables are used because past values often influence future ones. For example, in sales data, what happened last month might affect this month’s outcome. Models use these lags to recognize and learn from such patterns.
40
(Time series): How can you test for stationarity in a time series data set?
You can test for stationarity using both visual and statistical methods: Visual inspection: Plot the data and look for constant mean and variance over time. ACF plot: Persistent autocorrelations suggest non-stationarity. Statistical tests: ADF test (Augmented Dickey-Fuller): Checks for a unit root; a low p-value suggests stationarity. KPSS test: Tests the null hypothesis of stationarity. PP test (Phillips-Perron): Similar to ADF, but adjusts for serial correlation and heteroscedasticity.
41
(Time series): How are trends detected and handled in time series analysis?
Trends are detected by observing a consistent increase or decrease over time in a time series plot or through techniques like moving averages and regression lines. To handle trends: Apply differencing (subtract each value from its previous one) to remove the trend. Use detrending techniques like subtracting a fitted line or model. Choose models (like ARIMA) that include a trend component. These steps help make the series stationary, which is essential for many forecasting methods.
42
(Time series): How can you use time series decomposition to better understand a data set?
Time series decomposition separates a series into its trend, seasonal, and residual (noise) components. This helps you: Identify patterns: Understand long-term movements and seasonal cycles. Improve modeling: Handle each component individually (e.g., remove seasonality before forecasting). Detect anomalies: Spot irregular spikes or drops in the residual component. By breaking the series down, you gain clearer insights into what drives changes over time.
43
(Time series): What are the applications of time series analysis in real-world data science scenarios?
Time series analysis is widely applied in real-world data science for: Finance: Stock price prediction, volatility modeling. Retail & Marketing: Forecasting sales or customer demand over time. Economics: Analyzing GDP, inflation, or unemployment trends. Healthcare: Monitoring vital signs, disease outbreaks (e.g., COVID-19). Operations: Predictive maintenance using sensor data. Web & Tech: Traffic forecasting, anomaly detection in servers or apps. These applications rely on modeling time-dependent behavior to guide decisions and improve outcomes.
44
(Time series): How is time series data different from other types of data?
Time series data differs from other data types mainly because it is ordered in time, and each observation depends on when it was recorded. Unlike cross-sectional or panel data: Time series has temporal dependence (current values are influenced by past ones). It often includes trends, seasonality, and autocorrelation, which violate the assumption of independence used in standard models. Specialized methods (like ARIMA or exponential smoothing) are required to handle its unique structure.
45
(Time series): What are some of the challenges faced while analyzing time series data?
Common challenges in analyzing time series data include: Non-stationarity: Changing mean or variance complicates modeling and violates assumptions. Seasonality and trends: Must be detected and removed for accurate forecasts. Autocorrelation: Makes standard statistical methods invalid without adjustments. Missing or irregular data: Time gaps disrupt analysis. Noise and outliers: Can obscure patterns or distort models. Overfitting: Especially in complex models that try to mimic noise rather than signal.
46
(Time series): What tools and techniques can be used to visualize time series data effectively?
To effectively visualize time series data, use these tools and techniques: Line plots: Show overall trends and fluctuations over time. Seasonal plots: Compare patterns across periods (e.g., months or years). Lag plots: Reveal autocorrelation by plotting values against lagged versions. ACF/PACF plots: Diagnose correlation structure at different lags. Decomposition plots: Display trend, seasonal, and residual components. Tools like R (ggplot2, forecast), Python (matplotlib, seaborn, plotly, statsmodels), and Tableau or Power BI support rich, interactive time series visualization.
47
(Machine Learning): What are the real-world applications of supervised learning?
Supervised learning is widely used in real-world scenarios where labeled data is available. Key applications include: Fraud detection: Classify transactions as fraudulent or legitimate based on past examples. Spam filtering: Predict whether emails are spam using labeled email data. Customer churn prediction: Identify customers likely to leave a service (e.g., telecom, banking). Loan approval: Predict creditworthiness based on financial history. Medical diagnosis: Classify patients by disease type using clinical data. Recommendation systems: Predict user preferences (e.g., Netflix, Amazon).
48
(Machine Learning): What is supervised learning in machine learning and how does it differ from unsupervised learning?
Supervised learning is a machine learning approach where the model is trained on a dataset containing input-output pairs—each input has a known label or outcome. The goal is to learn a function that maps inputs to correct outputs. In contrast, unsupervised learning works with unlabeled data, aiming to find hidden patterns, structures, or groupings (e.g., clustering or dimensionality reduction) without predefined outcomes. Key difference: Supervised learning answers "What will happen?" (e.g., prediction/classification), Unsupervised learning explores "What patterns exist?" in the data.
49
(Machine Learning): Can you explain the concept of a target or dependent variable in supervised learning?
In supervised learning, the target variable (also called the dependent variable or label) is the outcome the model aims to predict. It represents what you are trying to learn from the data. For example: In a classification task, the target might be categories like “spam” or “not spam.” In a regression task, it could be a continuous value like house price or temperature. The model learns the relationship between the input features (independent variables) and this target during training.
50
(Machine Learning): What is the difference between regression and classification in supervised learning?
The key difference lies in the type of target variable: Regression predicts a continuous outcome, such as house prices, temperature, or sales volume. Example: predicting income based on education and age. Classification predicts a categorical outcome, like “yes/no,” “spam/ham,” or “disease type.” Example: predicting if a customer will churn (yes or no). Both are supervised learning tasks, but use different evaluation metrics and models suited to their outcome types.
51
(Machine Learning): How do you handle categorical variables in supervised learning models?
Categorical variables are handled by encoding them into numerical format, since most models require numeric inputs. Common methods include: One-hot encoding: Creates a binary column for each category (e.g., "red", "blue" → [1, 0], [0, 1]). Label encoding: Assigns a unique integer to each category (e.g., "red" = 1, "blue" = 2). Target or frequency encoding: Replaces categories with the mean of the target or frequency count (mainly in tree-based models). Choice of encoding depends on the model—e.g., linear models prefer one-hot, while decision trees can handle label encoding well.
52
(Machine Learning): How do you choose an appropriate algorithm for a supervised learning task?
Choosing the right algorithm for a supervised learning task depends on several key factors: Type of task: Use regression models (e.g., linear regression) for continuous targets. Use classification models (e.g., logistic regression, decision trees) for categorical outcomes. Size and quality of data: Small datasets: Prefer simpler models (e.g., logistic regression). Large, complex data: Consider tree-based models, SVMs, or neural networks. Interpretability vs. performance: For transparency: Use logistic regression or decision trees. For accuracy: Use ensemble methods (e.g., random forests, boosting). Data characteristics: Linearity: Linear models work well if relationships are linear. High dimensionality: SVMs and regularized models perform better. Computational cost: KNN is expensive at prediction time; tree-based models are faster once trained. Always complement algorithm choice with cross-validation and performance metrics (accuracy, AUC, F1-score, etc.).
53
(Machine Learning): Can you describe the process of training a model in supervised learning?
The process of training a model in supervised learning typically involves these steps: Data Collection: Gather labeled data with input features and known target values. Data Preprocessing: Clean the data, handle missing values, encode categorical variables, and possibly normalize or scale features. Train-Test Split: Divide the dataset into training and test (or validation) sets to evaluate performance on unseen data. Model Selection: Choose an appropriate algorithm based on the problem (e.g., regression or classification). Training: Use the training set to fit the model—i.e., adjust parameters to minimize error between predicted and actual target values. Evaluation: Assess the model on the test set using metrics like accuracy, precision, recall, or RMSE, depending on the task. Tuning: Optimize hyperparameters using techniques like grid search or cross-validation. Final Model: Retrain on the full dataset (if appropriate) and deploy the model for prediction.
54
(Machine Learning): What considerations do we need to pay attention to when preparing data for supervised learning?
Preparing data for supervised learning is critical for model performance. Key considerations include: Missing values: Handle them appropriately—drop, impute (mean/median/mode), or use model-specific strategies. Feature encoding: Convert categorical variables into numeric form using techniques like one-hot or label encoding. Scaling: Normalize or standardize features if required (important for distance-based models like KNN or SVM). Class imbalance: If target classes are skewed, use resampling (oversampling/undersampling) or adjust model evaluation metrics. Data leakage: Ensure no future information or target-related variables are used during training. Train-test split: Separate data into training and test sets before any transformations to avoid bias. Feature engineering: Create or select informative variables that improve model learning and interpretability.
55
(Machine Learning): What is cross-validation and why is it important in supervised learning?
Cross-validation is a technique for assessing how well a supervised learning model generalizes to unseen data. It involves splitting the training data into multiple folds (typically 5 or 10), training the model on some folds and validating it on the remaining ones, then averaging the results. It is important because: It provides a more reliable estimate of model performance than a single train-test split. It helps in hyperparameter tuning and model selection. It reduces the risk of overfitting, especially with limited data.
56
(Machine Learning): How do you handle overfitting and underfitting in supervised learning models?
To handle overfitting and underfitting, you need to balance model complexity and generalization: Overfitting (model too complex): Use simpler models (e.g., fewer features or shallower trees). Regularization: Apply L1 (lasso) or L2 (ridge) penalties. Cross-validation: Tune hyperparameters to avoid over-tuning. Prune decision trees or early stop training in neural nets. Reduce variance with ensemble methods (e.g., bagging, random forest). Underfitting (model too simple): Use more complex models that can capture non-linear relationships. Add relevant features or use feature engineering. Decrease regularization if it's too strong. Ensure sufficient training time or better model architecture. Monitoring both training and validation error helps diagnose and address these issues.
57
(Machine Learning): What is the concept of bias-variance tradeoff in supervised learning?
The bias-variance tradeoff explains the balance between two sources of error that affect model performance: Bias: Error from overly simplistic models that cannot capture data complexity (e.g., linear model for nonlinear data). Leads to underfitting. Variance: Error from overly complex models that fit the training data too closely, capturing noise. Leads to overfitting. A good model minimizes both: High bias → low flexibility, poor performance on training and test data. High variance → great training performance, poor generalization to new data. The goal is to find a model with optimal complexity for the best generalization performance.
58
(Machine Learning): What role does balance play in supervised machine learning models?
Balance plays a critical role in supervised machine learning in several ways: Class balance: If one class dominates (e.g., 95% vs. 5%), models may ignore the minority class. Address this using resampling techniques (oversampling/undersampling), class weights, or synthetic data (SMOTE). Bias-variance balance: Models must avoid both underfitting (high bias) and overfitting (high variance) by selecting the right complexity and regularization. Feature balance: Ensure features are scaled or normalized if needed (e.g., for KNN, SVM), and no single variable dominates the model unfairly. Data split balance: Training, validation, and test sets must be representative of the overall distribution to prevent biased evaluation. Maintaining balance ensures that the model learns meaningfully, generalizes well, and fairly captures all relevant patterns.
59
(Machine Learning): What are the advantages and disadvantages of using complex machine learning models such as e.g. neural networks in supervised learning?
Advantages of Complex Models (e.g., Neural Networks) High predictive power: Neural networks can capture complex, non-linear relationships in large and high-dimensional datasets. Automatic feature learning: Deep models can learn abstract features from raw inputs (e.g., images, text) without manual engineering. Scalability: They perform well with large datasets and are adaptable to many tasks (e.g., classification, regression, time series). Disadvantages Lack of interpretability: Neural networks are often "black boxes"—hard to explain to stakeholders or debug. Risk of overfitting: High flexibility makes them prone to fitting noise, especially with small datasets. High computational cost: Training and tuning require significant time, hardware, and expertise. Data hunger: They typically require large amounts of labeled data to perform well. These models are powerful but should be used when their benefits outweigh the complexity and resource demands.
60
(Machine Learning): Can you explain the difference between parametric and non-parametric machine learning models?
The key difference lies in assumptions about the model structure and flexibility: Parametric models: Assume a fixed form (e.g., linear, logistic). Only need to estimate a finite set of parameters (e.g., coefficients in linear regression). Faster and require less data, but less flexible—may underfit complex patterns. Examples: Linear regression, logistic regression, Naive Bayes. Non-parametric models: Do not assume a fixed structure; model complexity can grow with the data. More flexible, can fit complex relationships, but may overfit and require more data. Examples: K-nearest neighbors, decision trees, random forests, neural networks (in practice). Choose parametric models for interpretability and speed, and non-parametric models for accuracy and flexibility in rich data environments.
61
(Machine Learning): What are the pros and cons of using linear regression in supervised learning?
Pros of Linear Regression: Simple and interpretable: Coefficients clearly show the impact of each feature on the outcome. Efficient: Fast to train and easy to implement, even with large datasets. Theoretically well-understood: Solid statistical foundation with clear assumptions. Good baseline model: Useful for benchmarking more complex methods. Cons of Linear Regression: Assumes linearity: Performs poorly when relationships between features and target are non-linear. Sensitive to outliers: Can be skewed by extreme values. Requires assumptions: Normality, homoscedasticity, and independence of errors must hold. Limited flexibility: Cannot model interactions or non-linear effects without manual feature engineering. It's best used when interpretability is needed and the data shows a roughly linear relationship.
62
(Machine Learning): How is logistic regression used in classification problems?
Logistic regression is used to classify data into categories, typically when there are two possible outcomes like “yes” or “no.” It works by estimating the probability that an observation belongs to a particular class based on its features. If the probability is above a chosen threshold (like 0.5), it assigns the observation to one class; otherwise, it assigns it to the other. It's especially useful when you want both predictions and insight into how each input feature influences the outcome.
63
(Machine Learning): Can you explain how decision trees work in supervised learning?
A decision tree works by splitting the data into smaller and smaller groups based on the values of input features, creating a tree-like structure. At each step (or “node”), the model selects the feature that best separates the data to improve prediction accuracy. This continues until it reaches a leaf node, where a final prediction is made. Decision trees are easy to interpret, can handle both numerical and categorical data, and work well for both classification and regression tasks.
64
(Machine Learning): How does a support vector machine (SVM) work in supervised learning?
A Support Vector Machine (SVM) works by finding the best boundary (hyperplane) that separates data points of different classes. It chooses the boundary that maximizes the margin—the distance between the hyperplane and the nearest data points from each class (called support vectors). This helps the model generalize well to new data. SVMs can also handle non-linear patterns by using kernel functions to transform data into higher dimensions where it becomes separable.
65
(Machine Learning): Can you explain the concept of ensemble methods in supervised learning?
Ensemble methods combine the predictions of multiple models to produce a more accurate and robust final result than any single model alone. There are two main types: Bagging (e.g., Random Forest): Builds many models in parallel on different data samples and averages their predictions to reduce variance. Boosting (e.g., XGBoost, AdaBoost): Builds models sequentially, where each new model focuses on correcting the errors of the previous ones, reducing bias. Ensembles are powerful because they balance out weaknesses of individual models and often improve performance.
66
(Machine Learning): What is the difference between parameters and hyperparameters?
The key difference lies in what they control and when they are set: Parameters are the internal values a model learns from the data during training—like the coefficients in linear regression or the weights in a neural network. Hyperparameters are external settings that control the training process itself, such as the learning rate, number of trees in a forest, or the number of neighbors in KNN. They are set before training and tuned using techniques like cross-validation. In short: parameters are learned, hyperparameters are configured.
67
(Machine Learning): How do you fine-tune a supervised learning model?
To fine-tune a supervised learning model, you systematically adjust its hyperparameters to improve performance. The key steps are: Choose hyperparameters to tune: For example, learning rate, number of trees, regularization strength, or depth of a decision tree. Select a tuning method: Grid search: Test all combinations in a defined parameter grid. Random search: Randomly sample parameter combinations. Bayesian optimization (advanced): Use past results to choose better settings. Use cross-validation: Evaluate each setting’s performance reliably by splitting training data into folds. Pick the best model: Based on performance metrics like accuracy, F1-score, or AUC on the validation set. Retrain on full training data with the best settings and test on holdout data.
68
(Machine Learning): How do you evaluate the performance of a supervised learning model?
To evaluate a supervised learning model, you compare its predictions to the true labels using appropriate performance metrics, depending on the task: For classification: Accuracy: Proportion of correct predictions. Precision & Recall: Useful for imbalanced data; precision measures exactness, recall measures completeness. F1-score: Harmonic mean of precision and recall. Confusion matrix: Shows true vs. predicted classes. ROC curve & AUC: Evaluate model's ability to rank predictions. For regression: Mean Squared Error (MSE) or Root MSE: Penalizes large errors. Mean Absolute Error (MAE): Measures average prediction error. R-squared: Proportion of variance explained by the model. Always assess on validation or test data, not the training set, to measure generalization.
69
(Machine Learning): What is difference between evaluating the performance of a regression vs. classification supervised learning model?
The main difference lies in the type of target variable and the metrics used to assess how well the model performs: Classification Models (predict categories): Goal: Assign inputs to discrete classes (e.g., spam or not spam). Common metrics: Accuracy: % of correct predictions. Precision / Recall / F1-score: Especially important for imbalanced data. Confusion matrix: Shows counts of true/false positives and negatives. ROC-AUC: Measures model's ability to distinguish between classes. Regression Models (predict continuous values): Goal: Estimate a numerical outcome (e.g., house price). Common metrics: Mean Absolute Error (MAE): Average absolute difference. Mean Squared Error (MSE): Average squared difference (penalizes large errors). Root Mean Squared Error (RMSE): Square root of MSE. R-squared: Proportion of variance explained by the model. In short: classification evaluates how well the model predicts labels, while regression evaluates how close its numerical predictions are to actual values.
70
(Text mining): What is text mining and how is it used in data science?
Text mining is the process of extracting meaningful information from unstructured text data using techniques from natural language processing (NLP), machine learning, and statistics. In data science, it is used to analyze large volumes of text—such as reviews, social media, or documents—for tasks like sentiment analysis, topic modeling, classification, and clustering, enabling data-driven insights from textual sources.
71
(Text mining): What are some real-world applications of text mining?
Real-world applications of text mining include: Sentiment analysis: Monitoring customer opinions in reviews or social media. Spam detection: Classifying emails as spam or not. Topic modeling: Identifying themes in large document collections. Chatbots: Understanding and responding to user queries. Fraud detection: Analyzing textual patterns in insurance claims or financial reports. Medical records analysis: Extracting symptoms, diagnoses, and treatments from clinical notes.
72
(Text mining): What is the difference between unstructured, semi-structured, and structured text data?
Structured text data: Organized in a fixed schema (e.g., databases, spreadsheets). Each element is clearly defined and easy to query. Semi-structured text data: Has some organizational properties but not a rigid schema (e.g., XML, JSON, HTML). Unstructured text data: Lacks a predefined format, making it harder to process (e.g., emails, articles, social media posts).
73
(Text mining): How is text mining used for information extraction and retrieval?
Information extraction (IE) uses text mining to identify and extract structured information—such as names, dates, entities, or relationships—from unstructured text. Information retrieval (IR) uses text mining to find relevant documents or texts based on user queries, typically by ranking documents according to similarity or relevance (e.g., search engines). Both rely on (Natural Language Processing) NLP techniques to process and understand natural language.
74
(Text mining): How does text classification differ from text clustering?
Text classification is a supervised learning task where texts are assigned to predefined categories based on labeled training data (e.g., spam vs. not spam). Text clustering is an unsupervised learning task where texts are grouped into clusters based on similarity, without predefined labels (e.g., discovering topics in a set of news articles). In short: classification uses known labels; clustering finds hidden structure.
75
(Text mining): Can you explain the process of text preprocessing and why it's important in text mining?
Text preprocessing is the process of cleaning and standardizing raw text before analysis. It is crucial because raw text is noisy and inconsistent. Key steps include: Tokenization – splitting text into words or phrases. Lowercasing – converting all text to lowercase for uniformity. Removing punctuation and special characters – to reduce noise. Stopword removal – eliminating common words (e.g., "the", "and") that add little meaning. Stemming/Lemmatization – reducing words to their root form (e.g., "running" → "run"). Removing numbers or rare words – depending on context. Preprocessing ensures that the text data is consistent, reduces dimensionality, and improves model performance.
76
(Text mining): Explain the concept of vectorization and how it works in text mining?
Vectorization is the process of converting text into numerical format so that it can be analyzed by machine learning algorithms. In text mining, this typically involves: Bag-of-Words (BoW) – represents text as a vector of word counts or frequencies, ignoring word order. TF-IDF (Term Frequency–Inverse Document Frequency) – adjusts word frequency by how unique a word is across documents, giving more weight to informative words. Word Embeddings (e.g., Word2Vec, GloVe) – represent words in dense, low-dimensional vectors that capture semantic meaning and relationships. Vectorization is essential because machine learning models require numerical input, and it allows similarity, clustering, and classification tasks on text data.
77
(Text mining): What are word embeddings and why do we care about them?
Word embeddings are dense, low-dimensional vector representations of words that capture their semantic meaning based on context. Unlike Bag-of-Words or TF-IDF (which are sparse and ignore meaning), embeddings place similar words closer together in vector space (e.g., "king" and "queen"). We care about them because they: Preserve semantic relationships (e.g., king - man + woman ≈ queen) Improve model performance in tasks like sentiment analysis, translation, and text classification Enable deep learning models to understand context and word similarity Popular methods: Word2Vec, GloVe, FastText, BERT.
78
(Text mining): Can you describe the term frequency-inverse document frequency (TF-IDF) approach and its role in text mining?
TF-IDF (Term Frequency–Inverse Document Frequency) is a statistical method used to evaluate how important a word is in a specific document relative to a collection (corpus) of documents. Components: Term Frequency (TF): Measures how often a term appears in a document. TF(t, d) = \frac{\text{Count of term } t \text{ in document } d}{\text{Total terms in } d}] Inverse Document Frequency (IDF): Measures how unique or rare a term is across all documents. IDF(t) = \log\left( \frac{N}{DF(t)} \right)] Where NNN is the total number of documents, and DF(t)DF(t)DF(t) is the number of documents containing term ttt. TF-IDF Score: TF\text{-}IDF(t, d) = TF(t, d) \times IDF(t)] Role in Text Mining: Highlights informative and discriminative words. Reduces the influence of common words (e.g., "the", "and"). Used for document classification, clustering, and search ranking. It transforms text into numerical vectors suitable for machine learning algorithms.
79
(Text mining): How can you normalize text data and why it is relevant?
Text normalization is the process of transforming text into a standard, consistent format to reduce variability and improve analysis. Common techniques: Lowercasing – to avoid case-sensitive mismatches (e.g., "Apple" vs. "apple"). Removing punctuation and special characters – cleans noise from data. Removing stopwords – eliminates frequent but uninformative words (e.g., "and", "the"). Stemming/Lemmatization – reduces words to their base/root form (e.g., "running" → "run"). Removing numbers or excessive whitespace – for cleaner representation. Why it's relevant: Normalization ensures that similar text elements are treated the same way, which improves the accuracy, efficiency, and interpretability of text mining models.
80
(Text mining): What is the role of n-grams in text mining?
N-grams are contiguous sequences of n items (typically words) from a text. Types: Unigram: single word (e.g., "data") Bigram: two-word sequence (e.g., "data science") Trigram: three-word sequence (e.g., "machine learning model") Role in Text Mining: Capture context and word order, which unigrams miss. Improve performance in tasks like text classification, sentiment analysis, and language modeling. Help detect common phrases and co-occurrence patterns. However, higher-order n-grams increase dimensionality and sparsity, so they require careful tuning.
81
(Text mining): How can a corpus be annotated and why is it useful for text mining?
A corpus can be annotated by adding labels or metadata to text elements to enrich the data for analysis. Types of Annotation: Part-of-speech (POS) tagging – labels each word’s grammatical role (e.g., noun, verb). Named Entity Recognition (NER) – identifies entities like names, dates, locations. Sentiment labels – assigns emotions or polarity (e.g., positive, negative). Topic or category labels – for supervised classification tasks. Why it's useful: Enables supervised learning (e.g., training classifiers). Adds structure and meaning to unstructured text. Improves accuracy in tasks like information extraction, search, and NLP modeling.
82
(Text mining): What are some of the challenges associated with text mining from various sources like social media, websites, etc.?
Key challenges in text mining from sources like social media and websites include: Noise and Informality – Text is often informal, includes slang, emojis, typos, and abbreviations. Unstructured Format – Data lacks consistent structure; may include hashtags, links, or HTML. High Volume and Velocity – Real-time data streams can be massive and rapidly changing. Short Context – Tweets or messages are brief, making interpretation harder. Multilingual Content – Requires language detection and separate processing pipelines. Spam and Bots – Can distort analysis if not filtered. Privacy and Ethics – Legal and ethical concerns about data usage and user consent. These issues complicate preprocessing, modeling, and interpretation.
83
(Text mining): What is sentiment analysis and how is it useful in text mining?
Sentiment analysis is the process of identifying and categorizing the emotional tone or opinion expressed in text—typically as positive, negative, or neutral. How it works: Uses lexicons, machine learning, or deep learning to detect sentiment in words, phrases, or full texts. Can be applied at document, sentence, or aspect level (e.g., “battery life” in a product review). Usefulness in Text Mining: Tracks public opinion (e.g., brand perception, political sentiment). Analyzes customer feedback to guide product improvements. Monitors social media for crisis detection or campaign impact. Supports automated decision-making in marketing, finance, and customer service.
84
(Text mining): Can you explain the concept of topic modeling and its application in text mining?
Topic modeling is an unsupervised machine learning technique used to discover hidden themes or topics in a collection of texts. How it works: Algorithms like Latent Dirichlet Allocation (LDA) assume each document is a mixture of topics, and each topic is a distribution of words. It identifies groups of words that frequently occur together and assigns topic probabilities to each document. Applications in Text Mining: Summarizing large text corpora (e.g., news articles, academic papers). Organizing content (e.g., clustering customer reviews by topic). Recommending content based on dominant topics. Discovering trends over time in social media or public discourse. It enables efficient navigation and analysis of unstructured text data.