L6 - Estimating from a Sample: The Sample Mean & Confidence Intervals Flashcards Preview

18ECA005 - Data Analysis II > L6 - Estimating from a Sample: The Sample Mean & Confidence Intervals > Flashcards

Flashcards in L6 - Estimating from a Sample: The Sample Mean & Confidence Intervals Deck (13)
Loading flashcards...

Why do we need Samples?

More often than not we use samples to infer about the population.

Why do we need samples in the first place?
- Only available data;
- Costly and time consuming to use the population, if not impossible;
- Could be even counter-productive (destructive sampling)


How prices is sample information?

- We need specialist sampling techniques.
- We need specialist sampling techniques to make sure the sample is
representative and accurate. Beyond the scope of this module.

- Bad sample --> poll put out by a newspaper --> biased to people that read newspapers, people who read that particular paper, their own political views


What is is called when we get information from a sample to find something out about the population?

- When we get information from a sample to find out something about the population we use what is called an estimator.
- For example, the sample mean X(bar) is an estimator for the population
mean (μ), and the sample variance (s^2) is an estimator for the population
variance (σ^2).
- mu and variance is a parameter --> doesnt change


What is the formula for the estimator?

An estimator is a formula and it defines a variable:
- For example the estimator of μ is:
- X(bar)= 1/n x Σ^n_i=1(x{i})

An estimate is the numerical value we get from applying that formula
to our sample of data
- For example if we get = 4.22 then 4.22 is our estimate


What is the Distribution of the Sample Mean?

- Consider a variable X ~ (μ, σ^2) and assume that:
- We know what σ^2 is
- We don’t know what μ is

- Ex: the height of students in a country; since we cannot measure them
all to compute μ we use a sample of students and calculate X(bar)
- The value we get clearly depends on the sample we used:
- Sample a will give a value of X^a(bar)
- Sample b will give a value of X^b(bar)
- Infinite samples --> infinite possible X(bar) values we could get

- This means that;
- the sample mean (like all estimators) is a variable;
as such it follows a distribution with a mean and a variance.


What would the the sample distribution of a sample most likely tend towards?

- So the sample mean is an estimator, i.e. a variable: its values are all the
possible values that I would get from infinite samples.
- If we did this, we would find that values close to μ are more likely, and
values far from μ less likely. I.e., more sample means would be closer to
the population mean, and fewer would be further away.

- It have been demonstrated that:
- X(bar) ~ (μ,σ^2/n)

- X(bar) is distributed with mean = μ and variance = σ^2/n where:
- σ^2 is the variance of X (the height of students in the whole population
- n the same size


When would a sample distribution be normal?

-The distribution will be normal, i.e. X(bar) ~ (μ,σ^2/n) if:
- X~N
- n is large ( ex n > 30: from the Central Limit Theorem
- Central Limit Theorem --> when independent variables are added together, their normalised sum tends to be Normal (i.e. it approaches normality as n --> ∞)

So what does all this practically mean?
- If the distribution of X(bar) is centred around μ the on average we are going to "get it right": more like ly to get values close to μ that far from it
- We write this as E[X(bar)] = μ and say that the sample mean is unbiased estimators of the population mean


What is the Property of Consistency with an Unbiased estimator of the population mean?

The variance of the sampling distribution σ^2/n will decrease with n; as n --> ∞
it tends to 0 and X --> μ : this is the property of consistency.


What do you need to be careful about with the two types of variance?

Do not confuse the variance of the sample mean (σ^2/n ) with
the sample variance s^2 !!

One is the variance of the various sample means, the other the variance
within our own sample, e.g. the height of the 10 children selected.


What is a Summary of the main logical points to calculate the Sample Mean?

- We start from a variable X ~ (μ,σ^2); (eg heights)
- We know what σ^2 is but not what μ is;
- We need to estimate μ, and we do this by using a sample.
- The value X(bar) that we get is an observation from the distribution of the
variable X(bar) which is the variable “sample mean”.
- X(bar) ~N (μ,σ^2/n) if
- X ~ N
- n > 30


What is a confidence interval?

- Nature of the problem for X ~ N(μ,σ^2), we want to find what μ is
- Assume we know σ^2
- We collect a sample and estimate X(bar) which is expected to be close to but not identical to μ
- What is the uncertainty around our point estimate? We build a
symmetric range with a certain probability (eg 95%) around it.
- This is called a confidence interval with probability (1-α)
- Let’s do this in steps starting from Z~N(0,1)


How do you calculate confidence intervals?

- First we choose the size of the confidence interval e.g. 95%
- This leaves two tails summing to a total area α of 5%, i.e. each worth α/2 =
2.5%. This value α is called significance level, and the C.I. has area (1-α).
- So yellow area = C.I. = (1-α) = 95%
- Green area in each tail: α/2 = 2.5%.
= We need to find the two critical points, which we know are +/-1.96 ( from the second table of critical values):
- P(z{1} < Z < z{2}) = 0.95

- Now we use P(-1.96 < Z < +1.96) = 0.95 to work out a confidence interval for the mean μ using the sample mean X(bar) :
- If X(bar) ~N (μ,σ^2/n) then P(-1.96 < {(x- μ)/sqrt(σ^2/n))] < +1.96) = 0.95

Now we simply rearrange the equation to that μ is in the middle:
P(X(bar) - 1.96sqrt(σ^2/n) < μ < X(bar) - 1.96sqrt(σ^2/n)) = 0.96

The two values for our range are then simply calculated as:
- X(bar) - 1.96sqrt(σ^2/n) = x{1}
- X(bar) + 1.96sqrt(σ^2/n) = x{2}


What is a Caveat for the Confidence interval interpretation?

- The interval looks like saying “there is a 95% probability that μ lies
between 126,300 and 233,700” but this is technically incorrect since μ
is a fixed value (a parameter, not a variable).
- Technically the meaning is that if we were to calculate the C.I. an infinite
number of times, using an infinite number if samples, 95% of these
times we would get the value of μ to be exactly within C.I. limits.