Data Analysis Flashcards

(38 cards)

1
Q

What is the difference between a bad data point and an outlier?

A

Bad data point: An observation that contains invalid or inaccurate data and may or may not be a statistical outlier.
Outlier: An observation that lies outside the overall pattern of a distribution.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are some causes of poor data quality?

A
  • Manual entry errors
  • Duplicate entries
  • Query issues (nulls, blanks, white space, data formats)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are the benefits of using experimental data rather than historical data?

A
  • Better define the measurement system
  • Create the factor settings to analyze
  • Establish cause and effect
  • Structured to reduce variation and enable analysis
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are some issues with historical data?

A
  • Often lacks precision
  • Includes Project Y data but not the X settings needed
  • Does not establish cause and effect
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

When does it make sense to use historical data?

A
  • Access to sufficient data
  • Cost prohibitive to conduct an experiment
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

When does it make sense to use experimental data?

A
  • No historical data exists
  • Need to define the relationship between inputs and outputs
  • Identify and measure interactions between the ‘Vital Few’ sources of variation
  • Determine the best set-up conditions of X’s for improved Y performance
  • Cost-justified
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is Data Analysis?

A

Using statistical methodology and tools to discover useful information and make better business decisions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

When does it make sense to use data analysis?

A
  • Establish whether there is a relationship between an output and a suspected variable
  • Determine the optimal setting for a confirmed variable
  • Quantify the impact of controlling a confirmed variable
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are some key outcomes of the Data Analysis skillset?

A
  • 4S/GAP Methodology
  • Identifying Transfer Function(s)
  • Basic Statistical Analyses
  • Recommendations to the business based on data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is the difference between discrete and continuous data?

A

Discrete or Attribute data result from a finite number of possible values. Continuous or Variable data can be measured along a continuous scale of values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Why do we care what type of data we have?

A
  • Different Analysis tools for Discrete and Continuous Data
  • Continuous Data typically requires fewer data points
  • Continuous data provides a more complete picture of variation in a process
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What does y = (f)x represent and mean?

A

Transfer function. The relationship that explains y in terms of x(s).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Describe the approach we use to analyze data at Penske.

A

4S (Stability, Shape, Spread, ‘S’enter) and GAP (Graphical, Analytical, Practical)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is the difference between descriptive statistics and inferential statistics?

A

Descriptive: Analyzing data without drawing conclusions about a larger group. Inferential: Analyzing data from a sample group to draw conclusions about the population.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is sampling?

A

The process of collecting a subset of the data and drawing conclusions about the total population from the subset.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What statistics do we use to look at variation and when do we use each?

A
  • Standard deviation - normal
  • IQR - non-normal
17
Q

What statistics do we use to look at central tendency and when do we use each?

A
  • Mean (average) - normal
  • Median (middle value) - non-normal
  • Mode (most frequently occurring) - not used in hypothesis testing
18
Q

What core question are we answering when using a hypothesis test for Spread or ‘S’enter?

A

Is there a statistical difference?

19
Q

What is a p-value?

A
  • The probability of being wrong if claiming a difference
  • The probability of obtaining a result as extreme as the one observed if the null hypothesis is true.
20
Q

What is power?

A

Ability to see a difference if there is one.

21
Q

What are confidence intervals and why do we use them?

A

Amount of variation we can expect in our estimates; used because of sampling.

22
Q

What does a 95% confidence interval mean?

A

95% of the time, the statistic of interest will fall within the range of that confidence interval if the same sample is taken.

23
Q

Does correlation imply causation?

24
Q

When referring to Project Y’s, what statements are typically true?

A
  • Based on a sample of the population
  • Focused on establishing cause and effect relationships
  • Aim to create continuous Project Y’s.
25
What does a Stability test practically tell you?
Whether you can trust your measure of 'S'enter to be representative of the population over time.
26
What does GAP help us do?
Provides a standard way to look at data and answers the hypothesis test question of 'Is there a Difference?'.
27
What main thing do we have to take into account when collecting sample data?
The sample must be representative of the population.
28
When should you sample?
* Collecting all data is impractical * Too costly or time-consuming * Measuring a high-volume process.
29
When should you not sample?
A subset of data can't accurately depict the process.
30
What is the Central Limit Theorem?
If you take samples from a population, the means of the samples will form a normal distribution regardless of the original distribution shape.
31
Practically, what does a p-value tell us?
If there is a difference, it helps determine if the difference found truly makes a difference to the business.
32
If your p-value is 0.05, would you fail to reject the null or reject the null?
Fail to reject the null.
33
If I do not see a difference when completing my hypothesis test, what should I do?
Run a power test to determine how likely I would have been to see a difference.
34
What is the difference between common cause and special cause variation?
* Common: Always Present, Expected, Predictable, Usual * Special: Not Always Present, Unexpected, Unpredictable, Unusual.
35
Why do we conduct MSAs?
To ensure that variation in the data is from process variation and not measurement error.
36
What are residuals and why do we check them?
Residuals are the difference between actual response values and fitted values; they estimate inability to predict.
37
What is r and what does it tell us?
Correlation coefficient. Measures strength of linear relationship between variables.
38
What is r2?
Coefficient of determination; the amount of variation in the output explained by the input(s).