Data Analysis Flashcards
(38 cards)
What is the difference between a bad data point and an outlier?
Bad data point: An observation that contains invalid or inaccurate data and may or may not be a statistical outlier.
Outlier: An observation that lies outside the overall pattern of a distribution.
What are some causes of poor data quality?
- Manual entry errors
- Duplicate entries
- Query issues (nulls, blanks, white space, data formats)
What are the benefits of using experimental data rather than historical data?
- Better define the measurement system
- Create the factor settings to analyze
- Establish cause and effect
- Structured to reduce variation and enable analysis
What are some issues with historical data?
- Often lacks precision
- Includes Project Y data but not the X settings needed
- Does not establish cause and effect
When does it make sense to use historical data?
- Access to sufficient data
- Cost prohibitive to conduct an experiment
When does it make sense to use experimental data?
- No historical data exists
- Need to define the relationship between inputs and outputs
- Identify and measure interactions between the ‘Vital Few’ sources of variation
- Determine the best set-up conditions of X’s for improved Y performance
- Cost-justified
What is Data Analysis?
Using statistical methodology and tools to discover useful information and make better business decisions.
When does it make sense to use data analysis?
- Establish whether there is a relationship between an output and a suspected variable
- Determine the optimal setting for a confirmed variable
- Quantify the impact of controlling a confirmed variable
What are some key outcomes of the Data Analysis skillset?
- 4S/GAP Methodology
- Identifying Transfer Function(s)
- Basic Statistical Analyses
- Recommendations to the business based on data
What is the difference between discrete and continuous data?
Discrete or Attribute data result from a finite number of possible values. Continuous or Variable data can be measured along a continuous scale of values.
Why do we care what type of data we have?
- Different Analysis tools for Discrete and Continuous Data
- Continuous Data typically requires fewer data points
- Continuous data provides a more complete picture of variation in a process
What does y = (f)x represent and mean?
Transfer function. The relationship that explains y in terms of x(s).
Describe the approach we use to analyze data at Penske.
4S (Stability, Shape, Spread, ‘S’enter) and GAP (Graphical, Analytical, Practical)
What is the difference between descriptive statistics and inferential statistics?
Descriptive: Analyzing data without drawing conclusions about a larger group. Inferential: Analyzing data from a sample group to draw conclusions about the population.
What is sampling?
The process of collecting a subset of the data and drawing conclusions about the total population from the subset.
What statistics do we use to look at variation and when do we use each?
- Standard deviation - normal
- IQR - non-normal
What statistics do we use to look at central tendency and when do we use each?
- Mean (average) - normal
- Median (middle value) - non-normal
- Mode (most frequently occurring) - not used in hypothesis testing
What core question are we answering when using a hypothesis test for Spread or ‘S’enter?
Is there a statistical difference?
What is a p-value?
- The probability of being wrong if claiming a difference
- The probability of obtaining a result as extreme as the one observed if the null hypothesis is true.
What is power?
Ability to see a difference if there is one.
What are confidence intervals and why do we use them?
Amount of variation we can expect in our estimates; used because of sampling.
What does a 95% confidence interval mean?
95% of the time, the statistic of interest will fall within the range of that confidence interval if the same sample is taken.
Does correlation imply causation?
No.
When referring to Project Y’s, what statements are typically true?
- Based on a sample of the population
- Focused on establishing cause and effect relationships
- Aim to create continuous Project Y’s.