Statistics Flashcards

Question 1

Q

Population vs samples and parameters vs statistics

Answer

A

First step is to find out whether you are dealing with a population or a sample

Population:
All items of interest
Denoted with N
Numbers obtained are called parameters

Sample:
Subset of population
Denoted with n (lower case)
Numbers obtained are called statistics

Populations are hard to define and hard to observe in real life

Samples however are less time consuming, less costly

Question 2

Q

Randomness vs. representativeness

Answer

A

Randomness –> Random sample is collected when each member of the sample is chosen from the population strictly by chance

A group is not random when a large portion of the group did not have the chance to be chosen

Representative –> Sample is a subset of the population that accurately reflects the members

Question 3

Q

Which types of data can we define along with their subcategories?

Answer

A

Categorical
- Categories, groups
- Yes/No questions

Numerical –> Represents numbers
- Discrete nr’s –> Integer numbers Like amount of children you will have
- Continuous nr’s –> Infinite and impossible to count –> Weight count which is a rounded nr

Question 4

Q

What are the measurement levels of the data type categories?

Answer

A

Qualitavive data
- Nominal –> Like categorical data
- Ordinal –> Follow a strict order –> Rating your lunch for example from 1 to 5 stars

Quantitative data
- Interval –> Does not have a true zero like temperature (unlike Kelvin)
- Ratio –> Have a true zero like distance or time

Question 5

Q

What is the histogram relative frequency?

Answer

A

Percentage probability per interval –> relative frequency

Question 6

Q

When are scatter plots used?

Answer

A

Scatter plots
Used when we are representing two numerical variables

Example:
Horizontal axis –> Reading scores
Vertical axis –> Writing scores
Both axes are numerical

Question 7

Q

What is an outlier?

Answer

A

Data point that goes against the logic and of the whole dataset

Question 8

Q

Define mean

Answer

A

Simple average
Denoted with μ for a population
x̄ for sample

Downside: Easily disturbed by an outlier!

Question 9

Q

Define median

Answer

A

Middle number
(n+1) / 2

Question 10

Q

Define mode

Answer

A

Value that occurs most often

When each price appears only once –> We say there is NO mode

Question 11

Q

What is skewness and what does it indicate?

Answer

A

Skewness indicates whether the data is concentrated on one side

Question 12

Q

Right skew vs left skew

Answer

A

Right skew:
The mean is bigger than the median –> mean > median

The outliers are to the right

Mode –> Highest point in graph

Check video for graph

Left skew:
mean < median

Outliers are to the left

Question 13

Q

What does variance measure?

Answer

A

Variance measures the dispersion of a set of data points around their mean value

Question 14

Q

Why squaring the number for variance?

Answer

A

We always get non negative computations

Amplifies effect of large differences

Question 15

Q

Population variance vs sample variance

Answer

A

Population variance: √( ∑ ( (xi - μ)2 / N) )

Sample variance: √( ∑ ( (xi - x̅)2 / n - 1) )

Let op: x̅ en n-1 ipv n

Question 16

Q

Population variance standard deviation vs sample variance standard deviation

Answer

A

Population standard deviation –> σ = SQRT(σ²)

Sample standard deviation –> S = SQRT(S²)

Question 17

Q

What is the coefficient of variation?

Answer

A

Relative standard deviation: Standard deviation / mean

Population: Cv = σ / μ

Sample: Cv = s / x̄

Question 18

Q

Why use coefficients of variation?

Answer

A

Standard deviation is the most common measure of variability for a single dataset

Coefficient is much better measure for comparing two datasets

Question 19

Q

What is Covariance?

Answer

A

2-dimensionaal

In tegenstelling tot de formules voor variance en sample variance, komt er nu nog een y-component bij

Voor de rest dezelfde formule voor population en sample

Notice the sigma and s are NOT squared in the formula

Cov(x,y) = σ(xy)

Question 20

Q

Covariance formula?

Question 21

Q

Covariance meaning?

Answer

A

It gives a sense of direction in which the two variables are heading

> 0 means the two variables move together

<0 means the two variables move in opposite directions

=0 means the two variables are independent

Question 22

Q

What does correlation do?

Answer

A

Adjusts covariance, so that the relationship between the two variables becomes easy and intuitive to interpret

This is either sample of population dependent on the data you are working with

Question 23

Q

How to calculate correlation coefficient?

Answer

A

Cov(x,y) = σ(xy)

Population: σ(xy) / σ(x)σ(y)

Sample: S(xy) / SxSy

Question 24

Q

How to interpret correlation?

Answer

A

The correlation coefficient is always between -1 and 1

1 –> Entire variability of one variable is explained by the other

Almost 1 –> Strong relationship between the 2 values

0 –> Absolutely independent

Negative correlation –> They influence each other negatively

Question 25

Q

Is the correlation between X and Y the same as the correlation between Y and X?

Answer

A

Yes.
Hence: σ(xy) / σ(x)σ(y)
Where σ(xy) is the same as σ(yx)

Question 26

Q

What is causality?

Answer

A

Causation indicates that one event is the result of the occurrence of the other event; i.e. there is a causal relationship between the two events. This is also referred to as cause and effect.

It is important to understand the direction of causal relationships

Question 27

Q

Disregarding of correlations when

Answer

A

It is a common practise to disregard correlations below 0.2

Question 28

Q

How to calculate the Z-score

Answer

A

Z = (Y - μ) / σ

Question 29

Q

What is the central limit theorem?

Answer

A

In probability theory, the central limit theorem (CLT) establishes that, in many situations, for independent and identically distributed random variables, the sampling distribution of the standardized sample mean tends towards the standard normal distribution even if the original variables themselves are not normally distributed.

Question 30

Q

When do we speak of a sampling distribution?

Answer

A

A sampling distribution is a probability distribution of a statistic obtained from a larger number of samples drawn from a specific population. The sampling distribution of a given population is the distribution of frequencies of a range of different outcomes that could possibly occur for a statistic of a population.

Question 31

Q

How to denote the sampling distribution?

Answer

A

Sampling distribution denoted:
~N(μ, σ²/n)

This leads to the insights:

The bigger the sample size the smaller the variance and the more accurate the results are

Question 32

Q

What allows the CLT us to do?

Answer

A

Make inferences using the normal distribution, even when the population is not normally distributed

Question 33

Q

Standard error: Definition and formula

Answer

A

Standard deviation of the distribution formed by the sample means, which is:

√(σ²/n) = σ/√n

Means that:

Error decreases when sample size increases

Question 34

Q

Why is the standard error important?

Answer

A

Important because it is used in most statistical tests –> It shows how well you approximated the true mean

Question 35

Q

What is an estimate?

Answer

A

An approximation based on sample information

Question 36

Q

Which types of estimates can we distinguish?

Answer

A

Two types of estimates

Point estimates –> Single number
Confidence intervals –> Interval

Relation –> Point estimate is exactly in the middle of the confidence interval

Confidence intervals do provide much more information though

Question 37

Q

How are x̅ and S² defined as estimates?

Answer

A

The sample mean (̄x) is a point estimate of the population mean, μ.

The sample variance (s2) is a point estimate of the population variance (σ2).

Question 38

Q

Which two properties does an estimate have?

Answer

A

Bias
Efficiency

The goal is always to look for the most unbiased estimators

Question 39

Q

Characteristics of an unbiased estimator?

Answer

A

Expected value = population parameter

x̄ has an expected value of μ

Example: Someone says the average height of americans is taking a sample and add a foot to it.

x̄ plus 1 ft. = μ

Question 40

Q

What is the most efficient estimator?

Answer

A

The most efficient estimator is the unbiased estimator with the smallest variance

Question 41

Q

What is the confidence interval?

Answer

A

Range within which you expect the population parameter to be

Question 42

Q

How is the confidence level denoted?

Answer

A

Denoted as 1 - α

α is a value between 0 and 1
If the confidence level is 95% then α is 5%

Question 43

Q

How is the confidence interval denoted?

Answer

A

[ x̅ - Z(α/2) * (σ/√n), x̅ + Z(α/2) * (σ/√n) ]

Question 44

Q

Case: Calculate the confidence interval (95%) from:

With a x̅ (sample mean) of 100200
And σ = 15000
And n = 30

Answer

A

α is then 0,05 –> Divided by 2 is 0,025

Then you have to look up the Z-score of Z(0.025)

You would have to look up in the table the value of 1 - 0.025 = 0,975

This returns values of 1.9 and 0.06

Z(0.025) is therefore 1.9 + 0.06 = 1.96

Substitute the values in the formula:

[94833, 105568]

Interpretation:

We are 95% confident that the average data scientist salary will be in the interval [94833, 105568]

Question 45

Q

How usefull are confidence level ranges?

Answer

A

100% is useless –> Range is to big

99% –> Same story. Not insightful enough

5% –> Too small to be meaningful

95% is the accepted norm!

Question 46

Q

Characteristics Student’s T

Answer

A

Small sample size approximation of a Normal Distribution

You use this when there’s not sufficient data for the normal distribution

Graph is also bell shaped but with larger tails to accomodate occurence of values for away from the mean

Another key difference is that apart from mean and variance you must also define degrees of freedom for the distribution

Question 47

Q

What is the T-statistic

Answer

A

Just as the Z-statistic is related to the normal distribution

The T-statistic is related to the T distribution

Question 48

Q

How to calculate the T-statistic?

Answer

A

T(n-1),α = (x̅ - µ) / (s / √n)

–> Approximation of the normal distribution

Question 49

Q

How to find the T-statistic in a T-table?

Answer

A

Hence:
T(n-1), α = (x̅ - µ) / (s / √n)

With a sample of n-1 –> We have n-1 degrees of freedom. So for 20 observations, the degrees of freedom is 19

The T-table:

Vertical axis: degrees of freedom
Horizontal axis: α

Note that after 30th row the numbers don’t vary to much with the Z-statistic table

Question 50

Q

Finding confidence interval for Student’s T distribution for known population variance and unknown population variance?

Answer

A

Unknown variance:
[ x̅ - T(n-1,α/2) * (S/√n), x̅ + T(n-1,α/2) * (S/√n) ]

Known variance:
[ x̅ - Z(α/2) * (σ/√n), x̅ + Z(α/2) * (σ/√n) ]

All we have to do is finding the T-statistic in the table

Question 51

Q

Is T-statistic related to the Z-statistic

Answer

A

Just as the Z-statistic is related to the normal distribution

The T-statistic is related to the T distribution

Question 52

Q

How will the confidence interval change when we know the population variance?

Answer

A

When we know the population variance we get a narrower confidence interval. When do not know the population variance there is a higher uncertainty.

So: When we don’t know the population variance we can still make predictions though less accurate!

Question 53

Q

How is Margin of Error defined?

Answer

A

ME = Reliability Factor * (σ/√n)

Meaning:
Higher reliability factor or standard deviation –> Higher margin of error

Bigger margin of error –> Wider confidence interval

Smaller margin of error –> Narrower confidence interval

Higher sample size will decrease the margin of error and vice versa

Question 54

Q

Margin of Error for known and unknown population variance

Answer

A

Known population variance:
Margin of error –> Z(α/2) * (σ/√n)

Unknown population variance:
Margin of error –> T(n-1,α/2) * (S/√n)

Question 55

Q

How can you define the confidence intervals with the margin of error?

Question 56

Q

What happens with a smaller margin of error?

Answer

A

Narrower confidence interval

Question 57

Q

What is an example of two datasets, with two means, that are dependent samples from each other

Answer

A

Studying a person’s weight loss –> Same person

Habits of husbands and wives –> Coincide with each other

Question 58

Q

Difference between dependent and independent samples

Answer

A

Dependent:

Instead of before and after situation we look at cause and effect

Testing with confidence intervals for dependent samples

Use statistical methods like regressions

Independent, can be applied for 3 cases:

When population variance is known

Population variance is unknown but assumed to be equal

Population variance unknown but assumed to be different

Question 59

Q

How to calculate confidence intervals for dependent samples?

Answer

A

We use đ instead of x̅

We calculate the đ by calculating the before and after difference of samples and taking the mean from that

You can use the T-statistic for applying it to the confidence interval:

[ đ - T(n-1,α/2) * (Sd/√n), đ + T(n-1,α/2) * (Sd/√n) ]

Example of application: 10 patients testing medication leading to before and after results. The differences of these results have a certain mean, which is defined as đ.

Question 60

Q

Considerations for using either the Z or T-statistic

Answer

A

Sample size –> Big / Small

Are the population variances known –> Yes / No

Distribution type? –> Normal?

In case of Big sample size, known population variance and normal distribution –> Use the Z statistic

Question 61

Q

How to calculate the variance between two INDEPENDENT data sets with variance KNOWN?

Answer

A

σ²(diff) = σ(1)² / n(1) + σ(2)² / n(2)

Question 62

Q

What is the confidence interval for two INDEPENDENT data sets with variance KNOWN?

Answer

A

( x̅ - ȳ) +- Z(α/2) * √(σ(1)² / n(1) + σ(2)² / n(2))

Question 63

Q

What is the confidence interval for two INDEPENDENT data sets with variance UNKNOWN but assumed to be equal? And what is an and example of a case like this?

Answer

A

In this case you use what is called the Pooled variance formula

S(p)² = (Nx - 1)Sx² + (Ny - 1)Sy² / Nx + Ny - 2

Calculate the interval by using the T-statistic, hence image

Example: You have 2 datasets but the sample size is not the same.

Question 64

Q

Explain the usage of the T-statistic for two INDEPENDENT data sets with variance UNKNOWN

Answer

A

The degrees of freedom are equal to the total sample size minus the number of variables

Normally this would be n-1 because you had 1 variable (sample size)

Because in this case you have 2 sample sizes, there’s 2 variables

Degrees of freedom is then Sample size 1 + sample size 2 - 2

Answer 63

A

Interpretation:

We are 95% positive that the difference between set A and set B is between point (a,b)

Answer 64

A

Find out whether sets are independent or not

Find out whether population variance is unknown or assumed to be equal

In this case calculate the pooled variance with according formula

You will get a confidence interval for every possible shoe size

Answer 65

A

Null hypothesis –> Denoted with H0 (small 0)

Alternative hypothesis –> Denoted with H1 or Ha

Null hypothesis:
Is like innocent until proven guilty
H0 is true until rejected
The = sign always needs to be in the H0 hypothesis

Answer 66

A

Significance level. Defined as: The probability of rejecting the null hypothesis, if it’s true

Answer 67

A

Calculate a statistic (like x̅)
Scale it with Z = (x̅ - µ) / (s / √n)
Check if Z is in the rejected region. Check whether it is one or two-sided –> Number for α depends on this.

The Z is the coordinate point. Check for α = 0.05 what the coordinates are for the safety margins (look up the α/2 value and then add the numbers on the left side and the top side for z). Then check whether Z falls within that region.

Answer 68

A

Type I error:
When you reject a true null hypothesis

Also called a false positive

Probability: α

Type II error
Accept a false null hypothesis

False negative

Probability ß –> Depends mainly on sample size n and variance σ

Probability of rejecting a false null hypothesis: 1 - ß –> Also called the power of the test

Answer 69

A

H0 is true and accept(Do nothing) –> You do nothing and save yourself the embarrassment

H0 is false and accept(Do nothing) –> Missed opportunity

H0 is true and reject(Invite her) –> Embarrassment

H0 is false and reject(Invite her) –> Favourable for all

Answer 70

A

Smallest level of significance at which we can still reject the null hypothesis, given the observed sample statistic

Check of geteste waarde binnen het significance domein valt. Als p daarbuiten valt dan kun je hypothese afwijzen

Answer 71

A

Round up to the closest value available

Answer 72

A

When P-value < α

Answer 73

A

One sided: 1 minus the number from the Z-table

Two sided: 1 minus the number from the Z-table times 2

Answer 74

A

T-statistic

Answer 75

A

Hypothesized value difference

Answer 76

A

Accept if: The absolute value of the T-score < critical value t

Reject if: The absolute value of the T-score > critical value t

Answer 77

A

H0: µb - µa >=0

D0 = Hypothesized value difference

Answer 78

A

Formulating the hypothesis

Calculate sample mean

Standard deviation

Standard error

Determine which statistic to use
Small / Big sample
Assuming which distribution
Variance known / unknown

T score (in this case) is equal to T = (đ-µ0)/standard error

Determine whether you want to choose a level of significance, if not choose the p-value

In the T-table you can see in which significance range the number is (α between 0.025 & 0.01)

Use online formule to determine it exactly (p-value)

Decision rule
Accept if: p > α
Reject if: p < α

Then choose the level of significance for the study

Answer 79

A

–> The sign of the test statistic can give you that information

Negative sign of statistic means it’s smaller than hypothesized value –> In this case, Z=-2.44, thus the difference can be lower than -4%, like 5 or 6%

Positive sign of statistic means it’s higher than hypothesized value

Answer 80

A

Hypothesis: H0 : µe - µm = -4%
Hypothesis: H1 : µe - µm ≠ -4%

Look at sample sizes –> Whether they are equal

Determine difference between means

Determine standard error of the difference: √ ( σe1² / ne + σm² / nm )

Determine which statistic to use –> Z statistic
Big samples
Known variances

Find Z-score –> Z statistic formula: (x̅ - µ0) / standard error (from step 4)
Notice sometimes there’s no M0, because the H0 states that somethings smaller/bigger without giving the number –> In that case µ0 is null

P-value from online software –> 0.015

Interpretation:
At 5% significance we reject the null hypothesis –> 0.015 < 0.05
We say: There is enough statistical evidence that the mean difference is NOT 4%

Answer 81

A

Use the pooled variance

Answer 82

A

Checking if the T-score is positive or negative

Positive sign of statistic means it’s higher than hypothesized value

Brainscape's Knowledge GenomeTM

Statistics Flashcards

Brainscape's Knowledge Genome^TM