Principles of Statistics Flashcards
What does analysing data with statistics do?
- Framework to uncover hidden patterns 🏗️
- Objective Perspective 🎯
- Test Hypotheses🧪
- Confident Decisions: Rely on Data > Assumptions e.g. lead time changes💪
How have you applied statistical testing when analysing data?
- Descriptive stats: mean, median etc.
- Inferential stats: Hypothesis Testing: pearson’s correlation coefficient or Regression
- Assess Model: RMSE, MAE
What is Hypothesis Testing?
- Inferential stats method 📈
- Assess a hypothesis about a larger population based on a sample 👥 🎛️
- 2 Competing hypothesises - null (no sig correlation) and alternative (a sig correlation)❌🔀
- See if observed data is due to chance🔭🍀
What is Inferential Statistics?
- Field of Statistics🌾
- Analytical tools to draw conclusions about a whole population 🌍 based on a sample 🔬
What is Pearson’s correlation test?
Type of hypothesis testing that determines if a relationship exists between 2 variables (lead time and stock holding)
What is a t test?
Hypothesis test that compares the means of 2 groups
What was the significance level that the P value was tested against?
5% significance level (p < 0.05)
What is a p-value?
- Statistical Measure 📏
- DETERMINES if the results are statisically significant⭐⭐⭐⭐⭐⭐⭐
- A low p value < 5% = reject the null hypothesis and conclude the alternative that there is an effect/relationship/difference
- A high p value > 5% = conclude the null hypothesis and that there is no effect/relationship/difference between 2 variables
Interpret the P value results of the Pearson’s Correlation Test
- P Value < 0.05
- Reject Null
- Conclude Alternative
- WAS a significant relationship between Lead Time & Stock Holding
Interpret the correlation coefficient of the pearson’s correlation test
- Strength of relationship
- -1 to 1
- Positive Value, far from 1
- Weak Positive relationship
- Could infer from the sample: a relationship did exist between lead time and stock holding in the Frozen Warehouse (Inferential Stats example)
Have you encountered a situation where stats method did not yield the desired results? How did you rectify it?
- Regression = high error & poor fit
- Due to small sample size, DQ issues or weak relationship
- Frozen Suppliers not adhere to lead times
- Summer build stock (irrespective of lead time)
- Customer demand, supplier shortages, warehouse space (not considered by model)
- External factors: historical data may be better
- Time series: identify patterns
What is linear regression?
- Stats method
- Predicts an outcome based on another
- By fitting a line of best fit to the data
- The equation of the line allows the model to make predictions
- E.g. if the lead time was 30 days (x axis), you could see where the line intercepts the x axis and see the corresponding y value (stocking holding) as the prediction
When did you use linear regression?
- To predict stock holding from lead time
- Lead Time as the independent variable (x axis)
- Stock Holding as the dependent variable (y axis)
What was the independent variable in your regression model?
Lead time on the x axis
What was the dependent variable on your regression model?
Stock holding on the y axis
What evaluation metrics did you use to determine the accuracy and effectiveness of your models?
- ROOT MEAN SQUARED ERROR- measures the difference between actual and predicted values (lower value is better)
- MEAN ABSOLUTE ERROR- showed how much error was in the predictions too (lower value is better)
- R SQUARED - most common - shows how much data variation is explained by the model. 0 - 1. 1 = 100% of the variation is explained by the model. 1 = better fit💯✅
- Plotted predicted stock/lead time - not a straight 45 line, not performing well
What is R SQUARED and interpret your results
- Number that shows how well the line (LR Model) fits the data🔢
- Tells me how much of a difference in stock holding can be explained by lead time⏱️
- My R-squared was no bigger than 0.05, which means only 5% of the differences in stock holding can be explained by lead time⚄
- Additionally, Training and Test numbers were lower, which could suggested the model was** too simple to capture the patterns** in the data (underfitting)🤺🧪⚪️
What does over fitting mean?
- Model is too complex
- Fits the training data too well
- Cannot handle data that is different from that
What is a limitation of R squared?
sensitive to outliers and my data had a few that could have influenced the score
Why did you choose those error metrics?
MAE and RMSE as together as RMSE is sensitive to outliers and using both can show more insights. E.g. RMSE bigger than MAE = outliers exist that could throw model off
What is a time series forecast?
Type of predictive analysis that predicts future values based on historical data collected at specific intervals. It analyses past trends, patterns and seasonal variations to make these predictions.
What tool did you use for the time series forecast model and why?
Python
- flexibility: exponential smoothing levels
- experiment with different models
How do you know if your forecast is accurate?
Root Mean Sq Error - margin of error between actual and predicted values
Mean Absolute Error - also measures error between actual/predictive values
Also use confidence levels in the chart to see how confident the model is
What are the 4 plots on the decomposition plot show?
Observed - actual data
trend - the long term upward or downward direction of the data
seasonality - repeating patterns within specific time periods
Residual (noise): random fluctuations that cannot be explained by trend or seaonality