Statistics Flashcards

1
Q

Population vs samples and parameters vs statistics

A

First step is to find out whether you are dealing with a population or a sample

Population:
All items of interest
Denoted with N
Numbers obtained are called parameters

Sample:
Subset of population
Denoted with n (lower case)
Numbers obtained are called statistics

Populations are hard to define and hard to observe in real life

Samples however are less time consuming, less costly

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Randomness vs. representativeness

A

Randomness –> Random sample is collected when each member of the sample is chosen from the population strictly by chance

A group is not random when a large portion of the group did not have the chance to be chosen

Representative –> Sample is a subset of the population that accurately reflects the members

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Which types of data can we define along with their subcategories?

A

Categorical
- Categories, groups
- Yes/No questions

Numerical –> Represents numbers
- Discrete nr’s –> Integer numbers Like amount of children you will have
- Continuous nr’s –> Infinite and impossible to count –> Weight count which is a rounded nr

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are the measurement levels of the data type categories?

A

Qualitavive data
- Nominal –> Like categorical data
- Ordinal –> Follow a strict order –> Rating your lunch for example from 1 to 5 stars

Quantitative data
- Interval –> Does not have a true zero like temperature (unlike Kelvin)
- Ratio –> Have a true zero like distance or time

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the histogram relative frequency?

A

Percentage probability per interval –> relative frequency

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

When are scatter plots used?

A

Scatter plots
Used when we are representing two numerical variables

Example:
Horizontal axis –> Reading scores
Vertical axis –> Writing scores
Both axes are numerical

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is an outlier?

A

Data point that goes against the logic and of the whole dataset

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Define mean

A

Simple average
Denoted with μ for a population
x̄ for sample

Downside: Easily disturbed by an outlier!

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Define median

A

Middle number
(n+1) / 2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Define mode

A

Value that occurs most often

When each price appears only once –> We say there is NO mode

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is skewness and what does it indicate?

A

Skewness indicates whether the data is concentrated on one side

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Right skew vs left skew

A

Right skew:
The mean is bigger than the median –> mean > median

The outliers are to the right

Mode –> Highest point in graph

Check video for graph

Left skew:
mean < median

Outliers are to the left

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What does variance measure?

A

Variance measures the dispersion of a set of data points around their mean value

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Why squaring the number for variance?

A

We always get non negative computations

Amplifies effect of large differences

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Population variance vs sample variance

A

Population variance: √( ∑ ( (xi - μ)2 / N) )

Sample variance: √( ∑ ( (xi - x̅)2 / n - 1) )

Let op: x̅ en n-1 ipv n

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Population variance standard deviation vs sample variance standard deviation

A

Population standard deviation –> σ = SQRT(σ²)

Sample standard deviation –> S = SQRT(S²)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is the coefficient of variation?

A

Relative standard deviation: Standard deviation / mean

Population: Cv = σ / μ

Sample: Cv = s / x̄

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Why use coefficients of variation?

A

Standard deviation is the most common measure of variability for a single dataset

Coefficient is much better measure for comparing two datasets

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is Covariance?

A

2-dimensionaal

In tegenstelling tot de formules voor variance en sample variance, komt er nu nog een y-component bij

Voor de rest dezelfde formule voor population en sample

Notice the sigma and s are NOT squared in the formula

Cov(x,y) = σ(xy)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Covariance formula?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Covariance meaning?

A

It gives a sense of direction in which the two variables are heading

> 0 means the two variables move together

<0 means the two variables move in opposite directions

=0 means the two variables are independent

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What does correlation do?

A

Adjusts covariance, so that the relationship between the two variables becomes easy and intuitive to interpret

This is either sample of population dependent on the data you are working with

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

How to calculate correlation coefficient?

A

Cov(x,y) = σ(xy)

Population: σ(xy) / σ(x)σ(y)

Sample: S(xy) / SxSy

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

How to interpret correlation?

A

The correlation coefficient is always between -1 and 1

1 –> Entire variability of one variable is explained by the other

Almost 1 –> Strong relationship between the 2 values

0 –> Absolutely independent

Negative correlation –> They influence each other negatively

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

Is the correlation between X and Y the same as the correlation between Y and X?

A

Yes.
Hence: σ(xy) / σ(x)σ(y)
Where σ(xy) is the same as σ(yx)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

What is causality?

A

Causation indicates that one event is the result of the occurrence of the other event; i.e. there is a causal relationship between the two events. This is also referred to as cause and effect.

It is important to understand the direction of causal relationships

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

Disregarding of correlations when

A

It is a common practise to disregard correlations below 0.2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

How to calculate the Z-score

A

Z = (Y - μ) / σ

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

What is the central limit theorem?

A

In probability theory, the central limit theorem (CLT) establishes that, in many situations, for independent and identically distributed random variables, the sampling distribution of the standardized sample mean tends towards the standard normal distribution even if the original variables themselves are not normally distributed.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

When do we speak of a sampling distribution?

A

A sampling distribution is a probability distribution of a statistic obtained from a larger number of samples drawn from a specific population. The sampling distribution of a given population is the distribution of frequencies of a range of different outcomes that could possibly occur for a statistic of a population.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

How to denote the sampling distribution?

A

Sampling distribution denoted:
~N(μ, σ²/n)

This leads to the insights:

The bigger the sample size the smaller the variance and the more accurate the results are

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

What allows the CLT us to do?

A

Make inferences using the normal distribution, even when the population is not normally distributed

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

Standard error: Definition and formula

A

Standard deviation of the distribution formed by the sample means, which is:

√(σ²/n) = σ/√n

Means that:

Error decreases when sample size increases

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

Why is the standard error important?

A

Important because it is used in most statistical tests –> It shows how well you approximated the true mean

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

What is an estimate?

A

An approximation based on sample information

36
Q

Which types of estimates can we distinguish?

A

Two types of estimates

Point estimates –> Single number
Confidence intervals –> Interval

Relation –> Point estimate is exactly in the middle of the confidence interval

Confidence intervals do provide much more information though

37
Q

How are x̅ and S² defined as estimates?

A

The sample mean (̄x) is a point estimate of the population mean, μ.

The sample variance (s2) is a point estimate of the population variance (σ2).

38
Q

Which two properties does an estimate have?

A

Bias
Efficiency

The goal is always to look for the most unbiased estimators

39
Q

Characteristics of an unbiased estimator?

A

Expected value = population parameter

x̄ has an expected value of μ

Example: Someone says the average height of americans is taking a sample and add a foot to it.

x̄ plus 1 ft. = μ

40
Q

What is the most efficient estimator?

A

The most efficient estimator is the unbiased estimator with the smallest variance

41
Q

What is the confidence interval?

A

Range within which you expect the population parameter to be

42
Q

How is the confidence level denoted?

A

Denoted as 1 - α

α is a value between 0 and 1
If the confidence level is 95% then α is 5%

43
Q

How is the confidence interval denoted?

A

[ x̅ - Z(α/2) * (σ/√n), x̅ + Z(α/2) * (σ/√n) ]

44
Q

Case: Calculate the confidence interval (95%) from:

With a x̅ (sample mean) of 100200
And σ = 15000
And n = 30

A

α is then 0,05 –> Divided by 2 is 0,025

Then you have to look up the Z-score of Z(0.025)

You would have to look up in the table the value of 1 - 0.025 = 0,975

This returns values of 1.9 and 0.06

Z(0.025) is therefore 1.9 + 0.06 = 1.96

Substitute the values in the formula:

[94833, 105568]

Interpretation:

We are 95% confident that the average data scientist salary will be in the interval [94833, 105568]

45
Q

How usefull are confidence level ranges?

A

100% is useless –> Range is to big

99% –> Same story. Not insightful enough

5% –> Too small to be meaningful

95% is the accepted norm!

46
Q

Characteristics Student’s T

A

Small sample size approximation of a Normal Distribution

You use this when there’s not sufficient data for the normal distribution

Graph is also bell shaped but with larger tails to accomodate occurence of values for away from the mean

Another key difference is that apart from mean and variance you must also define degrees of freedom for the distribution

47
Q

What is the T-statistic

A

Just as the Z-statistic is related to the normal distribution

The T-statistic is related to the T distribution

48
Q

How to calculate the T-statistic?

A

T(n-1),α = (x̅ - µ) / (s / √n)

–> Approximation of the normal distribution

49
Q

How to find the T-statistic in a T-table?

A

Hence:
T(n-1), α = (x̅ - µ) / (s / √n)

With a sample of n-1 –> We have n-1 degrees of freedom. So for 20 observations, the degrees of freedom is 19

The T-table:

Vertical axis: degrees of freedom
Horizontal axis: α

Note that after 30th row the numbers don’t vary to much with the Z-statistic table

50
Q

Finding confidence interval for Student’s T distribution for known population variance and unknown population variance?

A

Unknown variance:
[ x̅ - T(n-1,α/2) * (S/√n), x̅ + T(n-1,α/2) * (S/√n) ]

Known variance:
[ x̅ - Z(α/2) * (σ/√n), x̅ + Z(α/2) * (σ/√n) ]

All we have to do is finding the T-statistic in the table

51
Q

Is T-statistic related to the Z-statistic

A

Just as the Z-statistic is related to the normal distribution

The T-statistic is related to the T distribution

52
Q

How will the confidence interval change when we know the population variance?

A

When we know the population variance we get a narrower confidence interval. When do not know the population variance there is a higher uncertainty.

So: When we don’t know the population variance we can still make predictions though less accurate!

53
Q

How is Margin of Error defined?

A

ME = Reliability Factor * (σ/√n)

Meaning:
Higher reliability factor or standard deviation –> Higher margin of error

Bigger margin of error –> Wider confidence interval

Smaller margin of error –> Narrower confidence interval

Higher sample size will decrease the margin of error and vice versa

54
Q

Margin of Error for known and unknown population variance

A

Known population variance:
Margin of error –> Z(α/2) * (σ/√n)

Unknown population variance:
Margin of error –> T(n-1,α/2) * (S/√n)

55
Q

How can you define the confidence intervals with the margin of error?

A

x̅ +-ME

56
Q

What happens with a smaller margin of error?

A

Narrower confidence interval

57
Q

What is an example of two datasets, with two means, that are dependent samples from each other

A

Studying a person’s weight loss –> Same person

Habits of husbands and wives –> Coincide with each other

58
Q

Difference between dependent and independent samples

A

Dependent:

Instead of before and after situation we look at cause and effect

Testing with confidence intervals for dependent samples

Use statistical methods like regressions

Independent, can be applied for 3 cases:

When population variance is known

Population variance is unknown but assumed to be equal

Population variance unknown but assumed to be different

59
Q

How to calculate confidence intervals for dependent samples?

A

We use đ instead of x̅

We calculate the đ by calculating the before and after difference of samples and taking the mean from that

You can use the T-statistic for applying it to the confidence interval:

[ đ - T(n-1,α/2) * (Sd/√n), đ + T(n-1,α/2) * (Sd/√n) ]

Example of application: 10 patients testing medication leading to before and after results. The differences of these results have a certain mean, which is defined as đ.

60
Q

Considerations for using either the Z or T-statistic

A

Sample size –> Big / Small

Are the population variances known –> Yes / No

Distribution type? –> Normal?

In case of Big sample size, known population variance and normal distribution –> Use the Z statistic

61
Q

How to calculate the variance between two INDEPENDENT data sets with variance KNOWN?

A

σ²(diff) = σ(1)² / n(1) + σ(2)² / n(2)

62
Q

What is the confidence interval for two INDEPENDENT data sets with variance KNOWN?

A

( x̅ - ȳ) +- Z(α/2) * √(σ(1)² / n(1) + σ(2)² / n(2))

63
Q

What is the confidence interval for two INDEPENDENT data sets with variance UNKNOWN but assumed to be equal? And what is an and example of a case like this?

A

In this case you use what is called the Pooled variance formula

S(p)² = (Nx - 1)Sx² + (Ny - 1)Sy² / Nx + Ny - 2

Calculate the interval by using the T-statistic, hence image

Example: You have 2 datasets but the sample size is not the same.

64
Q

Explain the usage of the T-statistic for two INDEPENDENT data sets with variance UNKNOWN

A

The degrees of freedom are equal to the total sample size minus the number of variables

Normally this would be n-1 because you had 1 variable (sample size)

Because in this case you have 2 sample sizes, there’s 2 variables

Degrees of freedom is then Sample size 1 + sample size 2 - 2

65
Q

What is the interpretation of calculating the confidence interval when comparing two datasets?

A

Interpretation:

We are 95% positive that the difference between set A and set B is between point (a,b)

66
Q

What are the steps when comparing 2 different groups?

A

Find out whether sets are independent or not

Find out whether population variance is unknown or assumed to be equal

In this case calculate the pooled variance with according formula

You will get a confidence interval for every possible shoe size

67
Q

Name the two hypotheses types

A

Null hypothesis –> Denoted with H0 (small 0)

Alternative hypothesis –> Denoted with H1 or Ha

Null hypothesis:
Is like innocent until proven guilty
H0 is true until rejected
The = sign always needs to be in the H0 hypothesis

68
Q

How is α related to the null hypothesis?

A

Significance level. Defined as: The probability of rejecting the null hypothesis, if it’s true

69
Q

Steps for testing a hypothesis?

A
  1. Calculate a statistic (like x̅)
  2. Scale it with Z = (x̅ - µ) / (s / √n)
  3. Check if Z is in the rejected region. Check whether it is one or two-sided –> Number for α depends on this.

The Z is the coordinate point. Check for α = 0.05 what the coordinates are for the safety margins (look up the α/2 value and then add the numbers on the left side and the top side for z). Then check whether Z falls within that region.

70
Q

What is a Type 1 Error and what is a Type 2 Error?

A

Type I error:
When you reject a true null hypothesis

Also called a false positive

Probability: α

Type II error
Accept a false null hypothesis

False negative

Probability ß –> Depends mainly on sample size n and variance σ

Probability of rejecting a false null hypothesis: 1 - ß –> Also called the power of the test

71
Q

What does the accept/reject quadrant look like

A
72
Q

Example: You are in love with a girl, unsure if she looks you back

H0 –> She does not like you back

Fill in the blanks in the quadrants

A

H0 is true and accept(Do nothing) –> You do nothing and save yourself the embarrassment

H0 is false and accept(Do nothing) –> Missed opportunity

H0 is true and reject(Invite her) –> Embarrassment

H0 is false and reject(Invite her) –> Favourable for all

73
Q

Describe the P-value

A

Smallest level of significance at which we can still reject the null hypothesis, given the observed sample statistic

Check of geteste waarde binnen het significance domein valt. Als p daarbuiten valt dan kun je hypothese afwijzen

74
Q

What if you can’t find extreme values in Z-table?

A

Round up to the closest value available

75
Q

When must the hypothesis be rejected?

A

When P-value < α

76
Q

How to find p-value in Z-table?

A

One sided: 1 minus the number from the Z-table

Two sided: 1 minus the number from the Z-table times 2

77
Q

What statistic to use when population variance unknown

A

T-statistic

78
Q

What does D0 stand for?

A

Hypothesized value difference

78
Q

Decision rule for accept/reject when using T-score

A

Accept if: The absolute value of the T-score < critical value t

Reject if: The absolute value of the T-score > critical value t

79
Q

H0 : D0 >= 0 is the same as writing?

A

H0: µb - µa >=0

D0 = Hypothesized value difference

80
Q

Steps for testing hypothesis - 11 steps

A

Formulating the hypothesis

Calculate sample mean

Standard deviation

Standard error

Determine which statistic to use
Small / Big sample
Assuming which distribution
Variance known / unknown

T score (in this case) is equal to T = (đ-µ0)/standard error

Determine whether you want to choose a level of significance, if not choose the p-value

In the T-table you can see in which significance range the number is (α between 0.025 & 0.01)

Use online formule to determine it exactly (p-value)

Decision rule
Accept if: p > α
Reject if: p < α

Then choose the level of significance for the study

81
Q

Say these are your hypotheses:

Hypothesis: H0 : µe - µm = -4%
Hypothesis: H1 : µe - µm ≠ -4%

What if you wanna know it is higher or lower than -4%

A

–> The sign of the test statistic can give you that information

Negative sign of statistic means it’s smaller than hypothesized value –> In this case, Z=-2.44, thus the difference can be lower than -4%, like 5 or 6%

Positive sign of statistic means it’s higher than hypothesized value

81
Q

Independent samples
Case example: On average, management outperforms engineering by 4%

Set up Hypothesis

A

Hypothesis: H0 : µe - µm = -4%
Hypothesis: H1 : µe - µm ≠ -4%

Look at sample sizes –> Whether they are equal

Determine difference between means

Determine standard error of the difference: √ ( σe1² / ne + σm² / nm )

Determine which statistic to use –> Z statistic
Big samples
Known variances

Find Z-score –> Z statistic formula: (x̅ - µ0) / standard error (from step 4)
Notice sometimes there’s no M0, because the H0 states that somethings smaller/bigger without giving the number –> In that case µ0 is null

P-value from online software –> 0.015

Interpretation:
At 5% significance we reject the null hypothesis –> 0.015 < 0.05
We say: There is enough statistical evidence that the mean difference is NOT 4%

82
Q

What to do with independent samples, variance unknown but assumed to be equal

A

Use the pooled variance

83
Q

What to do when the null hypothesis states that the difference between two means is 0, but you still want to know whether there’s a difference at all?

A

Checking if the T-score is positive or negative

Positive sign of statistic means it’s higher than hypothesized value

84
Q
A