Exam questions Flashcards

Question

(L3/4): How do you move from a business problem to a data science research question?

Answer 1

You translate a broad business issue (e.g., declining retention) into a specific, measurable question (e.g., can we predict customer churn?). This involves identifying target variables, key features, and outcome definitions.

Answer 2

Descriptive: What happened? (Summarizes past data) Diagnostic: Why did it happen? (Explains causes) Predictive: What will happen? (Forecasts outcomes) Prescriptive: What should be done? (Suggests actions) Each type builds on the previous to increase decision-making power.

Answer 3

Different problems require different levels of insight. Using a predictive model for a problem without labeled data, for example, would be inappropriate. Aligning solution type ensures relevance, efficiency, and practical value.

Answer 4

A good solution is: Accurate (technically sound), Actionable (supports decisions), Ethical (avoids harm or bias), Interpretable, and Feasible to implement and maintain in the business environment.

Answer 5

If the problem or data is biased, models can reinforce inequality, misclassify vulnerable groups, or be misused. Lack of transparency or oversight can also lead to unethical outcomes or loss of trust.

Answer 6

Iteration allows continuous learning and adjustment to new findings, making solutions more accurate, robust, and aligned with evolving business needs. It’s key to navigating uncertainty and improving model relevance over time.

Answer 7

Time series analysis involves examining data points collected or recorded at successive points in time to identify patterns such as trends, seasonality, and cycles. Unlike cross-sectional data, time series data exhibit temporal dependence, meaning past values influence future ones. It's important in data science because many real-world phenomena—stock prices, weather, sales, or sensor readings—unfold over time. Understanding these temporal dynamics enables better forecasting, anomaly detection, and decision-making under uncertainty.

Answer 8

Stationarity refers to a time series whose statistical properties—mean, variance, and autocovariance—do not change over time. It is crucial because most forecasting models (e.g., ARMA, ARIMA) assume stationarity to produce valid and reliable predictions. A stationary series is predictable in a probabilistic sense, while a non-stationary series may exhibit trends or changing variance that can mislead model estimates.

Answer 9

Autocorrelation measures how current values in a time series relate to past values (lags). High autocorrelation implies dependency over time, which violates the assumption of independent observations in many models. It can be identified using the Autocorrelation Function (ACF) plot, where persistent spikes suggest autocorrelation. To address it, we can apply differencing, fit ARIMA/ARMA models, or use transformations to remove patterns and stabilize variance.

Answer 10

The key components of a time series are: Trend (Tₜ): A long-term upward or downward movement in the data. Seasonality (Sₜ): Regular, periodic fluctuations (e.g., yearly, monthly). Cyclicality: Irregular, longer-term oscillations not tied to a fixed calendar. Irregular/Noise (εₜ): Random variation not explained by the above components. These are often combined in additive (yₜ = Tₜ + Sₜ + εₜ) or multiplicative (yₜ = Tₜ × Sₜ × εₜ) models.

Answer 11

Time series analysis examines data points collected over time from a single subject or entity, focusing on temporal dependencies and patterns (e.g., trend, seasonality). Cross-sectional analysis, on the other hand, analyzes data collected at a single point in time across multiple entities (e.g., individuals, regions), assuming independence between observations. The key difference lies in temporal structure: time series accounts for time-order and autocorrelation, while cross-sectional does not.

Answer 12

Common forecasting methods in time series analysis include: Naïve and Moving Average: Simple baselines using past values or smoothed averages. Exponential Smoothing (e.g., Holt-Winters): Weighs recent observations more heavily. ARIMA (AutoRegressive Integrated Moving Average): Combines autoregression, differencing, and moving average components. Seasonal ARIMA (SARIMA): Extends ARIMA to model seasonality. Machine Learning & Deep Learning: Methods like LSTM or neural networks for complex, non-linear patterns. Each method suits different patterns like trend, seasonality, and noise.

Answer 13

AR (Autoregressive) models predict future values based on past values. The current value depends linearly on its previous observations. MA (Moving Average) models predict current values based on past errors (shocks or residuals). They smooth out noise using past forecast errors. ARIMA (AutoRegressive Integrated Moving Average) combines both AR and MA but adds differencing to handle non-stationary data by removing trends or seasonality before modeling. Together, ARIMA captures both memory (AR), noise structure (MA), and non-stationary patterns.

Answer 14

Seasonality in time series can be handled using several methods: Decomposition: Break the series into trend, seasonal, and residual components (additive or multiplicative). Differencing: Apply seasonal differencing (e.g., subtract value from the same month last year) to stabilize seasonal patterns. Modeling: Use models like Holt-Winters or SARIMA, which explicitly include seasonal components. Transformation: Apply log or Box-Cox transforms when seasonal variance increases over time. These techniques help isolate or remove seasonal effects for clearer modeling.

Answer 15

A lag variable is simply a previous value in a time series—like using yesterday’s temperature to help predict today’s. In time series analysis, lag variables are used because past values often influence future ones. For example, in sales data, what happened last month might affect this month’s outcome. Models use these lags to recognize and learn from such patterns.

Answer 16

You can test for stationarity using both visual and statistical methods: Visual inspection: Plot the data and look for constant mean and variance over time. ACF plot: Persistent autocorrelations suggest non-stationarity. Statistical tests: ADF test (Augmented Dickey-Fuller): Checks for a unit root; a low p-value suggests stationarity. KPSS test: Tests the null hypothesis of stationarity. PP test (Phillips-Perron): Similar to ADF, but adjusts for serial correlation and heteroscedasticity.

Answer 17

Trends are detected by observing a consistent increase or decrease over time in a time series plot or through techniques like moving averages and regression lines. To handle trends: Apply differencing (subtract each value from its previous one) to remove the trend. Use detrending techniques like subtracting a fitted line or model. Choose models (like ARIMA) that include a trend component. These steps help make the series stationary, which is essential for many forecasting methods.

Answer 18

Time series decomposition separates a series into its trend, seasonal, and residual (noise) components. This helps you: Identify patterns: Understand long-term movements and seasonal cycles. Improve modeling: Handle each component individually (e.g., remove seasonality before forecasting). Detect anomalies: Spot irregular spikes or drops in the residual component. By breaking the series down, you gain clearer insights into what drives changes over time.

Answer 19

Time series analysis is widely applied in real-world data science for: Finance: Stock price prediction, volatility modeling. Retail & Marketing: Forecasting sales or customer demand over time. Economics: Analyzing GDP, inflation, or unemployment trends. Healthcare: Monitoring vital signs, disease outbreaks (e.g., COVID-19). Operations: Predictive maintenance using sensor data. Web & Tech: Traffic forecasting, anomaly detection in servers or apps. These applications rely on modeling time-dependent behavior to guide decisions and improve outcomes.

Answer 20

Time series data differs from other data types mainly because it is ordered in time, and each observation depends on when it was recorded. Unlike cross-sectional or panel data: Time series has temporal dependence (current values are influenced by past ones). It often includes trends, seasonality, and autocorrelation, which violate the assumption of independence used in standard models. Specialized methods (like ARIMA or exponential smoothing) are required to handle its unique structure.

Answer 21

Common challenges in analyzing time series data include: Non-stationarity: Changing mean or variance complicates modeling and violates assumptions. Seasonality and trends: Must be detected and removed for accurate forecasts. Autocorrelation: Makes standard statistical methods invalid without adjustments. Missing or irregular data: Time gaps disrupt analysis. Noise and outliers: Can obscure patterns or distort models. Overfitting: Especially in complex models that try to mimic noise rather than signal.

Answer 22

To effectively visualize time series data, use these tools and techniques: Line plots: Show overall trends and fluctuations over time. Seasonal plots: Compare patterns across periods (e.g., months or years). Lag plots: Reveal autocorrelation by plotting values against lagged versions. ACF/PACF plots: Diagnose correlation structure at different lags. Decomposition plots: Display trend, seasonal, and residual components. Tools like R (ggplot2, forecast), Python (matplotlib, seaborn, plotly, statsmodels), and Tableau or Power BI support rich, interactive time series visualization.

Answer 23

Supervised learning is widely used in real-world scenarios where labeled data is available. Key applications include: Fraud detection: Classify transactions as fraudulent or legitimate based on past examples. Spam filtering: Predict whether emails are spam using labeled email data. Customer churn prediction: Identify customers likely to leave a service (e.g., telecom, banking). Loan approval: Predict creditworthiness based on financial history. Medical diagnosis: Classify patients by disease type using clinical data. Recommendation systems: Predict user preferences (e.g., Netflix, Amazon).

Answer 24

Supervised learning is a machine learning approach where the model is trained on a dataset containing input-output pairs—each input has a known label or outcome. The goal is to learn a function that maps inputs to correct outputs. In contrast, unsupervised learning works with unlabeled data, aiming to find hidden patterns, structures, or groupings (e.g., clustering or dimensionality reduction) without predefined outcomes. Key difference: Supervised learning answers "What will happen?" (e.g., prediction/classification), Unsupervised learning explores "What patterns exist?" in the data.

Answer 25

In supervised learning, the target variable (also called the dependent variable or label) is the outcome the model aims to predict. It represents what you are trying to learn from the data. For example: In a classification task, the target might be categories like “spam” or “not spam.” In a regression task, it could be a continuous value like house price or temperature. The model learns the relationship between the input features (independent variables) and this target during training.

Answer 26

The key difference lies in the type of target variable: Regression predicts a continuous outcome, such as house prices, temperature, or sales volume. Example: predicting income based on education and age. Classification predicts a categorical outcome, like “yes/no,” “spam/ham,” or “disease type.” Example: predicting if a customer will churn (yes or no). Both are supervised learning tasks, but use different evaluation metrics and models suited to their outcome types.

Answer 27

Categorical variables are handled by encoding them into numerical format, since most models require numeric inputs. Common methods include: One-hot encoding: Creates a binary column for each category (e.g., "red", "blue" → [1, 0], [0, 1]). Label encoding: Assigns a unique integer to each category (e.g., "red" = 1, "blue" = 2). Target or frequency encoding: Replaces categories with the mean of the target or frequency count (mainly in tree-based models). Choice of encoding depends on the model—e.g., linear models prefer one-hot, while decision trees can handle label encoding well.

Answer 28

Choosing the right algorithm for a supervised learning task depends on several key factors: Type of task: Use regression models (e.g., linear regression) for continuous targets. Use classification models (e.g., logistic regression, decision trees) for categorical outcomes. Size and quality of data: Small datasets: Prefer simpler models (e.g., logistic regression). Large, complex data: Consider tree-based models, SVMs, or neural networks. Interpretability vs. performance: For transparency: Use logistic regression or decision trees. For accuracy: Use ensemble methods (e.g., random forests, boosting). Data characteristics: Linearity: Linear models work well if relationships are linear. High dimensionality: SVMs and regularized models perform better. Computational cost: KNN is expensive at prediction time; tree-based models are faster once trained. Always complement algorithm choice with cross-validation and performance metrics (accuracy, AUC, F1-score, etc.).

Answer 29

The process of training a model in supervised learning typically involves these steps: Data Collection: Gather labeled data with input features and known target values. Data Preprocessing: Clean the data, handle missing values, encode categorical variables, and possibly normalize or scale features. Train-Test Split: Divide the dataset into training and test (or validation) sets to evaluate performance on unseen data. Model Selection: Choose an appropriate algorithm based on the problem (e.g., regression or classification). Training: Use the training set to fit the model—i.e., adjust parameters to minimize error between predicted and actual target values. Evaluation: Assess the model on the test set using metrics like accuracy, precision, recall, or RMSE, depending on the task. Tuning: Optimize hyperparameters using techniques like grid search or cross-validation. Final Model: Retrain on the full dataset (if appropriate) and deploy the model for prediction.

Answer 30

Preparing data for supervised learning is critical for model performance. Key considerations include: Missing values: Handle them appropriately—drop, impute (mean/median/mode), or use model-specific strategies. Feature encoding: Convert categorical variables into numeric form using techniques like one-hot or label encoding. Scaling: Normalize or standardize features if required (important for distance-based models like KNN or SVM). Class imbalance: If target classes are skewed, use resampling (oversampling/undersampling) or adjust model evaluation metrics. Data leakage: Ensure no future information or target-related variables are used during training. Train-test split: Separate data into training and test sets before any transformations to avoid bias. Feature engineering: Create or select informative variables that improve model learning and interpretability.

Answer 31

Cross-validation is a technique for assessing how well a supervised learning model generalizes to unseen data. It involves splitting the training data into multiple folds (typically 5 or 10), training the model on some folds and validating it on the remaining ones, then averaging the results. It is important because: It provides a more reliable estimate of model performance than a single train-test split. It helps in hyperparameter tuning and model selection. It reduces the risk of overfitting, especially with limited data.

Answer 32

To handle overfitting and underfitting, you need to balance model complexity and generalization: Overfitting (model too complex): Use simpler models (e.g., fewer features or shallower trees). Regularization: Apply L1 (lasso) or L2 (ridge) penalties. Cross-validation: Tune hyperparameters to avoid over-tuning. Prune decision trees or early stop training in neural nets. Reduce variance with ensemble methods (e.g., bagging, random forest). Underfitting (model too simple): Use more complex models that can capture non-linear relationships. Add relevant features or use feature engineering. Decrease regularization if it's too strong. Ensure sufficient training time or better model architecture. Monitoring both training and validation error helps diagnose and address these issues.

Answer 33

The bias-variance tradeoff explains the balance between two sources of error that affect model performance: Bias: Error from overly simplistic models that cannot capture data complexity (e.g., linear model for nonlinear data). Leads to underfitting. Variance: Error from overly complex models that fit the training data too closely, capturing noise. Leads to overfitting. A good model minimizes both: High bias → low flexibility, poor performance on training and test data. High variance → great training performance, poor generalization to new data. The goal is to find a model with optimal complexity for the best generalization performance.

Answer 34

Balance plays a critical role in supervised machine learning in several ways: Class balance: If one class dominates (e.g., 95% vs. 5%), models may ignore the minority class. Address this using resampling techniques (oversampling/undersampling), class weights, or synthetic data (SMOTE). Bias-variance balance: Models must avoid both underfitting (high bias) and overfitting (high variance) by selecting the right complexity and regularization. Feature balance: Ensure features are scaled or normalized if needed (e.g., for KNN, SVM), and no single variable dominates the model unfairly. Data split balance: Training, validation, and test sets must be representative of the overall distribution to prevent biased evaluation. Maintaining balance ensures that the model learns meaningfully, generalizes well, and fairly captures all relevant patterns.

Answer 35

Advantages of Complex Models (e.g., Neural Networks) High predictive power: Neural networks can capture complex, non-linear relationships in large and high-dimensional datasets. Automatic feature learning: Deep models can learn abstract features from raw inputs (e.g., images, text) without manual engineering. Scalability: They perform well with large datasets and are adaptable to many tasks (e.g., classification, regression, time series). Disadvantages Lack of interpretability: Neural networks are often "black boxes"—hard to explain to stakeholders or debug. Risk of overfitting: High flexibility makes them prone to fitting noise, especially with small datasets. High computational cost: Training and tuning require significant time, hardware, and expertise. Data hunger: They typically require large amounts of labeled data to perform well. These models are powerful but should be used when their benefits outweigh the complexity and resource demands.

Answer 36

The key difference lies in assumptions about the model structure and flexibility: Parametric models: Assume a fixed form (e.g., linear, logistic). Only need to estimate a finite set of parameters (e.g., coefficients in linear regression). Faster and require less data, but less flexible—may underfit complex patterns. Examples: Linear regression, logistic regression, Naive Bayes. Non-parametric models: Do not assume a fixed structure; model complexity can grow with the data. More flexible, can fit complex relationships, but may overfit and require more data. Examples: K-nearest neighbors, decision trees, random forests, neural networks (in practice). Choose parametric models for interpretability and speed, and non-parametric models for accuracy and flexibility in rich data environments.

Answer 37

Pros of Linear Regression: Simple and interpretable: Coefficients clearly show the impact of each feature on the outcome. Efficient: Fast to train and easy to implement, even with large datasets. Theoretically well-understood: Solid statistical foundation with clear assumptions. Good baseline model: Useful for benchmarking more complex methods. Cons of Linear Regression: Assumes linearity: Performs poorly when relationships between features and target are non-linear. Sensitive to outliers: Can be skewed by extreme values. Requires assumptions: Normality, homoscedasticity, and independence of errors must hold. Limited flexibility: Cannot model interactions or non-linear effects without manual feature engineering. It's best used when interpretability is needed and the data shows a roughly linear relationship.

Answer 38

Logistic regression is used to classify data into categories, typically when there are two possible outcomes like “yes” or “no.” It works by estimating the probability that an observation belongs to a particular class based on its features. If the probability is above a chosen threshold (like 0.5), it assigns the observation to one class; otherwise, it assigns it to the other. It's especially useful when you want both predictions and insight into how each input feature influences the outcome.

Answer 39

A decision tree works by splitting the data into smaller and smaller groups based on the values of input features, creating a tree-like structure. At each step (or “node”), the model selects the feature that best separates the data to improve prediction accuracy. This continues until it reaches a leaf node, where a final prediction is made. Decision trees are easy to interpret, can handle both numerical and categorical data, and work well for both classification and regression tasks.

Answer 40

A Support Vector Machine (SVM) works by finding the best boundary (hyperplane) that separates data points of different classes. It chooses the boundary that maximizes the margin—the distance between the hyperplane and the nearest data points from each class (called support vectors). This helps the model generalize well to new data. SVMs can also handle non-linear patterns by using kernel functions to transform data into higher dimensions where it becomes separable.

Answer 41

Ensemble methods combine the predictions of multiple models to produce a more accurate and robust final result than any single model alone. There are two main types: Bagging (e.g., Random Forest): Builds many models in parallel on different data samples and averages their predictions to reduce variance. Boosting (e.g., XGBoost, AdaBoost): Builds models sequentially, where each new model focuses on correcting the errors of the previous ones, reducing bias. Ensembles are powerful because they balance out weaknesses of individual models and often improve performance.

Answer 42

The key difference lies in what they control and when they are set: Parameters are the internal values a model learns from the data during training—like the coefficients in linear regression or the weights in a neural network. Hyperparameters are external settings that control the training process itself, such as the learning rate, number of trees in a forest, or the number of neighbors in KNN. They are set before training and tuned using techniques like cross-validation. In short: parameters are learned, hyperparameters are configured.

Answer 43

To fine-tune a supervised learning model, you systematically adjust its hyperparameters to improve performance. The key steps are: Choose hyperparameters to tune: For example, learning rate, number of trees, regularization strength, or depth of a decision tree. Select a tuning method: Grid search: Test all combinations in a defined parameter grid. Random search: Randomly sample parameter combinations. Bayesian optimization (advanced): Use past results to choose better settings. Use cross-validation: Evaluate each setting’s performance reliably by splitting training data into folds. Pick the best model: Based on performance metrics like accuracy, F1-score, or AUC on the validation set. Retrain on full training data with the best settings and test on holdout data.

Answer 44

To evaluate a supervised learning model, you compare its predictions to the true labels using appropriate performance metrics, depending on the task: For classification: Accuracy: Proportion of correct predictions. Precision & Recall: Useful for imbalanced data; precision measures exactness, recall measures completeness. F1-score: Harmonic mean of precision and recall. Confusion matrix: Shows true vs. predicted classes. ROC curve & AUC: Evaluate model's ability to rank predictions. For regression: Mean Squared Error (MSE) or Root MSE: Penalizes large errors. Mean Absolute Error (MAE): Measures average prediction error. R-squared: Proportion of variance explained by the model. Always assess on validation or test data, not the training set, to measure generalization.

Answer 45

The main difference lies in the type of target variable and the metrics used to assess how well the model performs: Classification Models (predict categories): Goal: Assign inputs to discrete classes (e.g., spam or not spam). Common metrics: Accuracy: % of correct predictions. Precision / Recall / F1-score: Especially important for imbalanced data. Confusion matrix: Shows counts of true/false positives and negatives. ROC-AUC: Measures model's ability to distinguish between classes. Regression Models (predict continuous values): Goal: Estimate a numerical outcome (e.g., house price). Common metrics: Mean Absolute Error (MAE): Average absolute difference. Mean Squared Error (MSE): Average squared difference (penalizes large errors). Root Mean Squared Error (RMSE): Square root of MSE. R-squared: Proportion of variance explained by the model. In short: classification evaluates how well the model predicts labels, while regression evaluates how close its numerical predictions are to actual values.

Answer 46

Text mining is the process of extracting meaningful information from unstructured text data using techniques from natural language processing (NLP), machine learning, and statistics. In data science, it is used to analyze large volumes of text—such as reviews, social media, or documents—for tasks like sentiment analysis, topic modeling, classification, and clustering, enabling data-driven insights from textual sources.

Answer 47

Real-world applications of text mining include: Sentiment analysis: Monitoring customer opinions in reviews or social media. Spam detection: Classifying emails as spam or not. Topic modeling: Identifying themes in large document collections. Chatbots: Understanding and responding to user queries. Fraud detection: Analyzing textual patterns in insurance claims or financial reports. Medical records analysis: Extracting symptoms, diagnoses, and treatments from clinical notes.

Answer 48

Structured text data: Organized in a fixed schema (e.g., databases, spreadsheets). Each element is clearly defined and easy to query. Semi-structured text data: Has some organizational properties but not a rigid schema (e.g., XML, JSON, HTML). Unstructured text data: Lacks a predefined format, making it harder to process (e.g., emails, articles, social media posts).

Answer 49

Information extraction (IE) uses text mining to identify and extract structured information—such as names, dates, entities, or relationships—from unstructured text. Information retrieval (IR) uses text mining to find relevant documents or texts based on user queries, typically by ranking documents according to similarity or relevance (e.g., search engines). Both rely on (Natural Language Processing) NLP techniques to process and understand natural language.

Answer 50

Text classification is a supervised learning task where texts are assigned to predefined categories based on labeled training data (e.g., spam vs. not spam). Text clustering is an unsupervised learning task where texts are grouped into clusters based on similarity, without predefined labels (e.g., discovering topics in a set of news articles). In short: classification uses known labels; clustering finds hidden structure.

Answer 51

Text preprocessing is the process of cleaning and standardizing raw text before analysis. It is crucial because raw text is noisy and inconsistent. Key steps include: Tokenization – splitting text into words or phrases. Lowercasing – converting all text to lowercase for uniformity. Removing punctuation and special characters – to reduce noise. Stopword removal – eliminating common words (e.g., "the", "and") that add little meaning. Stemming/Lemmatization – reducing words to their root form (e.g., "running" → "run"). Removing numbers or rare words – depending on context. Preprocessing ensures that the text data is consistent, reduces dimensionality, and improves model performance.

Answer 52

Vectorization is the process of converting text into numerical format so that it can be analyzed by machine learning algorithms. In text mining, this typically involves: Bag-of-Words (BoW) – represents text as a vector of word counts or frequencies, ignoring word order. TF-IDF (Term Frequency–Inverse Document Frequency) – adjusts word frequency by how unique a word is across documents, giving more weight to informative words. Word Embeddings (e.g., Word2Vec, GloVe) – represent words in dense, low-dimensional vectors that capture semantic meaning and relationships. Vectorization is essential because machine learning models require numerical input, and it allows similarity, clustering, and classification tasks on text data.

Answer 53

Word embeddings are dense, low-dimensional vector representations of words that capture their semantic meaning based on context. Unlike Bag-of-Words or TF-IDF (which are sparse and ignore meaning), embeddings place similar words closer together in vector space (e.g., "king" and "queen"). We care about them because they: Preserve semantic relationships (e.g., king - man + woman ≈ queen) Improve model performance in tasks like sentiment analysis, translation, and text classification Enable deep learning models to understand context and word similarity Popular methods: Word2Vec, GloVe, FastText, BERT.

Answer 54

TF-IDF (Term Frequency–Inverse Document Frequency) is a statistical method used to evaluate how important a word is in a specific document relative to a collection (corpus) of documents. Components: Term Frequency (TF): Measures how often a term appears in a document. TF(t, d) = \frac{\text{Count of term } t \text{ in document } d}{\text{Total terms in } d}] Inverse Document Frequency (IDF): Measures how unique or rare a term is across all documents. IDF(t) = \log\left( \frac{N}{DF(t)} \right)] Where NNN is the total number of documents, and DF(t)DF(t)DF(t) is the number of documents containing term ttt. TF-IDF Score: TF\text{-}IDF(t, d) = TF(t, d) \times IDF(t)] Role in Text Mining: Highlights informative and discriminative words. Reduces the influence of common words (e.g., "the", "and"). Used for document classification, clustering, and search ranking. It transforms text into numerical vectors suitable for machine learning algorithms.

Answer 55

Text normalization is the process of transforming text into a standard, consistent format to reduce variability and improve analysis. Common techniques: Lowercasing – to avoid case-sensitive mismatches (e.g., "Apple" vs. "apple"). Removing punctuation and special characters – cleans noise from data. Removing stopwords – eliminates frequent but uninformative words (e.g., "and", "the"). Stemming/Lemmatization – reduces words to their base/root form (e.g., "running" → "run"). Removing numbers or excessive whitespace – for cleaner representation. Why it's relevant: Normalization ensures that similar text elements are treated the same way, which improves the accuracy, efficiency, and interpretability of text mining models.

Answer 56

N-grams are contiguous sequences of n items (typically words) from a text. Types: Unigram: single word (e.g., "data") Bigram: two-word sequence (e.g., "data science") Trigram: three-word sequence (e.g., "machine learning model") Role in Text Mining: Capture context and word order, which unigrams miss. Improve performance in tasks like text classification, sentiment analysis, and language modeling. Help detect common phrases and co-occurrence patterns. However, higher-order n-grams increase dimensionality and sparsity, so they require careful tuning.

Answer 57

A corpus can be annotated by adding labels or metadata to text elements to enrich the data for analysis. Types of Annotation: Part-of-speech (POS) tagging – labels each word’s grammatical role (e.g., noun, verb). Named Entity Recognition (NER) – identifies entities like names, dates, locations. Sentiment labels – assigns emotions or polarity (e.g., positive, negative). Topic or category labels – for supervised classification tasks. Why it's useful: Enables supervised learning (e.g., training classifiers). Adds structure and meaning to unstructured text. Improves accuracy in tasks like information extraction, search, and NLP modeling.

Answer 58

Key challenges in text mining from sources like social media and websites include: Noise and Informality – Text is often informal, includes slang, emojis, typos, and abbreviations. Unstructured Format – Data lacks consistent structure; may include hashtags, links, or HTML. High Volume and Velocity – Real-time data streams can be massive and rapidly changing. Short Context – Tweets or messages are brief, making interpretation harder. Multilingual Content – Requires language detection and separate processing pipelines. Spam and Bots – Can distort analysis if not filtered. Privacy and Ethics – Legal and ethical concerns about data usage and user consent. These issues complicate preprocessing, modeling, and interpretation.

Answer 59

Sentiment analysis is the process of identifying and categorizing the emotional tone or opinion expressed in text—typically as positive, negative, or neutral. How it works: Uses lexicons, machine learning, or deep learning to detect sentiment in words, phrases, or full texts. Can be applied at document, sentence, or aspect level (e.g., “battery life” in a product review). Usefulness in Text Mining: Tracks public opinion (e.g., brand perception, political sentiment). Analyzes customer feedback to guide product improvements. Monitors social media for crisis detection or campaign impact. Supports automated decision-making in marketing, finance, and customer service.

Answer 60

Topic modeling is an unsupervised machine learning technique used to discover hidden themes or topics in a collection of texts. How it works: Algorithms like Latent Dirichlet Allocation (LDA) assume each document is a mixture of topics, and each topic is a distribution of words. It identifies groups of words that frequently occur together and assigns topic probabilities to each document. Applications in Text Mining: Summarizing large text corpora (e.g., news articles, academic papers). Organizing content (e.g., clustering customer reviews by topic). Recommending content based on dominant topics. Discovering trends over time in social media or public discourse. It enables efficient navigation and analysis of unstructured text data.

Exam questions Flashcards

Train concepts described by teacher (84 cards)