Exploratory Data Analysis Flashcards
(40 cards)
Exploratory data analysis
getting a feel of the data, making it easier to find mistakes, guess what actually happened and makes it easier to find outliers.
- Understand and gain insights into the data before selecting analysis techniques.
- Approach data without assumptions, often using visual methods.
We need to get to know the data
- Numeric data distributions (symmetric, normal, skewed etc.)
- Data quality problems
- Find outliers
- Search for correlations and interrelationships
- Identify subsets of interest
- Suggest functional relationships
We can ask questions
- Descriptive stats: “Who is most profitable”
- Hypothesis Testing: “Is there a difference between the value of these two customers”
- Classification: “What are the common characteristics of customers”
- Prediction: “Will this new customer become profitable”
- We need to answer the question of what models and techniques to use given the problem context, data and underlying assumptions.
Comparison with Hypothesis Testing
- EDA: Open-ended exploration with no or incomplete prior expectations.
- Hypothesis Testing: Tests pre-defined hypotheses.
Systematic Process
- Understand Data Context:
- Who created the dataset, when, and why?
- Size, number of fields, and their meanings.
- Initial Exploration:
- Inspect familiar or interpretable records.
- Compute summary statistics (e.g., mean, min, max, quartiles, outliers).
- Visualization:
- Plot variable distributions (e.g., box plots, time-series).
- Examine relationships via scatterplot matrices.
- Visualize pairwise correlations and group breakdowns (e.g., gender, age).
- Transformations:
- Transform variables as needed to identify patterns and outliers.
Descriptive Statistics
Quantitatively describe main features of the data. Main data features:
- Measures of central tendency represent a center around which measurements are distributed (mean, median)
- Measures of variability represent the spread of data from the center (standard dev.)
- Measures of relative standing represent the ‘relative position’ of specific measurements in data (quantiles)
The mean
Average, badly affected by outliers, making it a bad measure of central tendency
The median
Middle value when values are ranked in order, shows two halves. AKA the 50th percentile. Unaffected by outliers, making it a better measure of central tendency. In skewed data, the mean lies further towards the skew than the median.
The mode
Most common data point, may be multiple points.
Variance
the spread around the mean. Shows how median and mean differ. The lower the variance the more consistent it is.
Standard Deviation
Spread around the mean, high std means increased spread, less consistency and less clustering.
Quartiles
The value that marks one of the divisions that breaks a series of values into four equal parts. Median is the 2nd quartile and divides it in half.
Common Visualizations
- Histograms/Bar Charts
- Box Plots
- Scatterplots
Histograms/Bar Charts
Used to display frequency distribution. Counts of data falling in various ranges. Histogram is used for numeric data and bar chart for categorical data. The bin size selection is important; if too small it may show false patterns, if too large it may hide important patterns. Several variations are possible; plot relative frequencies instead of raw frequencies. Make the height of the histogram equal to the relative frequency/width.
Box plots
A five value summary plot of data, minimum, maximum, median, 1st and 3rd quartiles. Often used with histogram in EDA.
Scatterplots
2D graphs, useful for understanding the relationship between two attributes. Features of the relationship are describes by; strength, shape, direction, presence of outliers.
Models Definition & Purpose
- Models encapsulate information into tools for forecasts/predictions.
- Key steps: Building, fitting, and validating.
- “All models are wrong, but some are useful.” — George Box
Philosophies of Models
- Occam’s Razor
- Bias Variance Trade-Off
Occam’s Razor
- Prefer simpler models when equally accurate, as they:
- Make fewer assumptions, reducing overfitting risk.
- Avoid memorizing features of the dataset.
- However, simplicity isn’t absolute:
- Complex models like deep learning can be more predictive despite higher parameter counts.
- Complexity comes with a trade-off between accuracy and cost.
Bias-Variance Trade-Off
- Bias: Error from overly simple assumptions (e.g., underfitting).
- Performs poorly on both training and testing data.
- Variance: Error from excessive sensitivity to noise (e.g., overfitting).
- Performs well on training data but poorly generalizes to new data.
Principles of Good Models
- Probabilistic Predictions: Assign probabilities to forecasts (50% chance of rain) Use probability mean distribution
- Feedback Mechanism: Models should update dynamically and show how predictions evolve over time
- Consensus: Build multiple models with distinct methods for the same prediction
- Bayesian Reasoning: Update probabilities with new events. Requires prior probabilities from domain knowledge
Baseline Models Purpose
- Assess model effectiveness by comparison to simple, reasonable benchmarks.
- Only when models decisively outperform baselines can they be deemed effective.
Classification Baselines
- Random selection of labels (no prior distribution).
- Most common label in the training data.
- Best single-feature model.
- Compare against an existing, well-known model.
Prediction Baselines
- Mean or median value of the target.
- Linear regression for linear relationships.
- Previous value (useful in time-series forecasting).