week 2 Flashcards

1
Q

Why is probability important in statistics?

A

We want a random sample which represents our population.
This means no inherent biases in the sampling technique.
Variations in the sample data cause uncertainty in the statistical analysis results.
This is because no random sample is similar to another random sample.
E.g. One random sample may be representative of the population of interest, another random sample may be off – due to chance.
Inferential statistics is concerned with measuring the degree of uncertainty which can be quantified using the concept of probability.
Allowing us to draw conclusions about our population using the random sample.

We use probability to quantify how much we expect random samples to vary.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Calculation of probability

A

The definition of probability depends on some process or experiment that occurs repeatedly under identical conditions, the number of times the experiment was repeated and also the number of times an outcome of interest occurs.
Let us assume that an experiment is repeated n times, and out of n times an event of interest occurred m times. Then the probability that the event occurs is just the relative frequency of the event. Thus the relative frequency definition of probability is given by: Relative frequency : m/n
1000 smokers (m)
20 Lung cancer (n)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

what is Parameter

A

In a study you are aiming to ensure that your sample result is as close to the parameter as possible.
This allows the results of your study to be generalised to the entire (relevant) population.

A parameter is an unknown characteristic of interest in the true population.

General examples include the true mean, true proportion, and true standard deviation.
It is difficult to calculate for a large population due to financial and time constraint.
An example: Consider the BMI of Australians of age 30 to 60 years. If all the Australians within this age groups are taken into account to calculate the mean and standard deviation of BMI, they are called the true (population) mean and standard deviation respectively.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the difference between parameters and statistics ?

A

A statistic and a parameter are very similar in the sense that they are both descriptions of groups. For example, “50% of cat owners prefer X brand cat food.” The difference between a statistic and a parameter is that:

Statistics describe a sample and are denoted by Latin (roman) letters
Parameter describes an entire population is denoted by Greek letters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Normal distribution

A

The normal distribution is appropriate only for continuous variables.
The normal distribution has two parameters: mean (μ) and standard deviation (σ). These parameters respectively describe the central value and spread in the data.

The shape of the distribution of sample observations depends on the shape of the distribution of the observations in the sampled population. In general, as the sample size increases, the distribution of the sample observations approaches to the population distribution.

We can predict the shape of the data in the population by the shape of the observations in a large sample.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

some features of normal distribution

A

Normal distributions are symmetric around their mean.
The mean, median, and mode of a normal distribution are equal.
The area under the normal curve is equal to 1.0.
Normal distributions are denser in the centre and less dense in the tails.
Normal distributions are defined by two parameters, the mean (μ) and the standard deviation (σ).
68% of the area of a normal distribution is within one standard deviation of the mean.
Approximately 95% of the area of a normal distribution is within two standard deviations of the mean.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

The total area under the curve above the horizontal axis is one square unit because the area under the curve is the cumulative sum of ____________frequencies.

A

Relative

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

normal distributions - Z distribution

A

For a normal distributions of data, the observations are evenly clustered around the central value of the distribution, i.e., around the population mean.

If the data follows the normal distribution with true mean and true standard deviation then:

68% observations fall within 1 standard deviation (SD) of the mean (μ±σ).
95% observations fall within 2 SD’s of the mean (μ ± 2σ).
99.7% observations fall within 3 SD’s of the mean (μ ± 3σ).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Intervals

A

We can split up the individual intervals into their respective percentages. The ranges are called reference ranges for the normal distribution.
Equivalently for a large sample (for large samples the sample mean approaches the population mean), the same 68-95-99.7% rule applies.

34%-> 13.5%-> 2.35%-> 0.15%

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How do we calculate the areas for the ranges that are not within the references ranges?

A

First what are our reference ranges?
‐3SD, ‐2SD, ‐1SD, Mean, +1SD, +2SD, +3SD.
Answer: z-score

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

what is Z score?

A

Z-score indicates how many standard deviations a data point is from the mean and helps us with this problem.
Z score tells use:
- where a data point lies compared with the rest of the data set in relation to the mean
- allows comparisons of data points across different normal distributions
For example, we can compare the scores obtained by a student in two exams whose scores are normally distributed.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

how to calculate Z-score?

A

A simple transformation of the variable can be useful to calculate the probability. This transformation requires the knowledge of the true (population) mean and true standard deviation for the variable.
The transformation is achieved by subtracting the true mean from the observed value and then dividing this difference by the true standard deviation.
The whole expression is denoted by Z and is known as Z-score or standard score. This has a mean = 0, and a standard deviation = 1. This is known as the “standard normal distribution”.
Z-score = (Observed Value – True Mean)/(True Standard Deviation)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

if the true mean and SD are unknown….

A

If the true mean and SD are unknown, they can be replaced by the sample mean and SD respectively when the sample size is large.

For any individual, the z-score tells us how many standard deviations the raw score for that individual deviates from the mean and in what direction. A positive z-score indicates the individual is above average and a negative z-score indicates the individual is below average.

Z = – 1.5 means that a BMI of 25 is 1.5 standard deviations below the true mean.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Area under the normal distribution curve

A

The two main methods for calculating the probabilities for various z-scores are:

Normal distribution table
Statistical packages

In order to use the normal probability table, we must calculate the z-score.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How to read a normal distribution table?

A

Consider the absolute value (always positive) of the Z-score and break it into two parts:
(a) the whole number and the tenth and
(b) the hundredth.

In this example, the absolute value of the Z-score is 1.50.
The whole number and the tenth is 1.5
and the hundredth is 0.00

The whole number and the tenth (1.5) are looked up along the first column and the hundredth (0.00) is looked up across the first row in the table.
The value in the intersection of the row and column is the probability from the absolute value of the Z-score to infinity (i.e. above the Z score). Thus from the table, the probability above 1.5 is 0.0668.
Note: Due to the symmetry of the normal distribution, the probability below -1.5 is the same as the probability as above +1.5.

We are interested in the probability of a Z-score greater than -1.5.
The easiest way to calculate this is: 1 - 0.0668 (i.e. entire area - [area below -1.5]).
Thus the required probability is 0.9332 (1 - 0.0668).

Hence the probability that a randomly selected adult Australian is overweight (BMI > 25 kg/m2) is 93.32 %.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Sampling distribution - Definition

A

Sampling distribution is the distribution of a summary statistic.
Medical research often involves acquiring data from a sample of individuals and using the information gathered from the sample to make inferences about a broader group of individuals.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Steps for sampling distribution:

A

Step 1: Take repeated samples (n) from the true population (N). Calculate the mean BMI for each sample.

Step 2: Record the sample means.

Step 3: We present the results (sample means) on a Histogram. This histogram is called the “sampling distribution”.

Step 4: Calculate the mean and standard error (which is the standard deviation of the means) from the sample means.

For a sufficiently large sample the sample means arising from repeated samples will cluster around the population mean.
A sample mean from a larger sample is likely to lie closer to the population mean than a sample mean from a smaller sample.

18
Q

Which of the following statement(s) is/are CORRECT regarding the mean of a sampling distribution?

It is the mean of the statistic for all of the samples in the distribution.

It depends on the sample size.

It is the same as the population parameter

A

It is the mean of the statistic for all of the samples in the distribution.

It is the same as the population parameter

19
Q

what is standard error?

A

In order to quantify the uncertainty in the sample mean we must calculate the standard error.

This spread can be measured using the standard deviation of the sample means in the repeated samples and this leads us to a special term: the Standard Error (SE) of the mean.
It is a measure of precision of the sample mean from a single sample in estimating the population mean. The smaller the SE, the more precisely the population mean is being estimated.

20
Q

How to calculate SE

A

The standard error is obtained by dividing the standard deviation by the square root of the sample size. Thus, a computation formula for se for the sample mean is given by: SE = Sample Sd/ square root ( Sample size)

What is the value for the spread for the sampling distribution?( SE)

21
Q

what is central limit theorem?

A
  • This concept demonstrates the effect of sample size on the shape of the sampling distribution of the mean.
  • The sampling distribution follows the normal distribution regardless of the shape of the true population.
  • If you increase the sample size then the sample distribution will be less spread out (standard error will be smaller).

Example

The distribution of sample means is skewed and spread out when the sample sizes are small. As the sample size increases the sampling distribution of the sample mean becomes more concentrated around the true mean.
Conclusion: Even though the population distribution of preoperative creatinine level is skewed to the right, the sampling distribution of the sample mean becomes normal for large samples.

22
Q

Overview of CLT:

A
  • For small sample sizes the distribution of the sample mean is spread out, and may be skewed, i.e., the sample means are far from the true mean and hence the precision of the sample mean is low.
  • The distribution of the sample mean becomes more concentrated around the true mean for larger sample sizes, i.e., the sample means are clustered around the true mean and hence the distribution of the sample mean is symmetric and normal. Also the precision of the sample mean is high.
  • The degree of clustering in the sampling distribution also depends on the variability of the data in the population.
23
Q

Describe t-distributions

A

The t distribution is similar to the normal distribution and is appropriate for continuous data; both are symmetric and bell shaped.
Tails are a bit longer for the t distribution compared to the normal distribution.
The shape of the t distribution depends on the sample size.
For large samples, the t distribution is more like the normal distribution.
T-distribution is associated with calculation of degree of freedom (df)

24
Q

Degrees of freedom (df)

A

The t-distribution is associated with calculation of degrees of freedom (df).
It denotes the number of independent pieces of information available to estimate another piece of information.
In short, think of df as a mathematical restriction that we need to put in place when we calculate an estimate of one parameter from an estimate of another.

25
Q

Area under the sampling distribution

A

For large samples the sampling distribution for the sample mean follows the normal distribution.
If we draw a perpendicular from the peak to the horizontal axis, the meeting point on the horizontal axis is the population mean.
Because the distribution is symmetric and bell shaped, the perpendicular divides the total area of the curve into two equal parts (with 50% each).

The formula for the area under the sampling distribution depends on whether the true (population) standard deviation is known or known.

26
Q

According to the normal distribution probability law:

A

Approximately 68% of the sample means are expected to lie within one standard error of the true mean, that is, within +1SE and -1SE.

Approximately 95% of the sample means are expected to lie within 1.96 or approximately two standard errors of the true mean, that is, within -2SE and +2SE.

Approximately 99.7% of sample means are expected to lie within 2.97 or approximately three standard errors of the true mean, that is, within -3SE and +3SE.

27
Q

Calculate the area under the sampling distribution when the SD is known

A

Then irrespective of the sample size the area (other than reference ranges) under the sampling distribution for the sample mean can be calculated using the normal probability table.
This can be done by transforming the sample mean to a Z-score, where the Z-score is calculated by subtracting the true mean from the sample mean and dividing this difference by the standard error of the sample mean.
The value of the Z-score for a sampling distribution for sample mean shows the number of SE the sample mean is away from the true mean.

28
Q

Z score formula

A

sample mean - true mean / SE
or z = (x-μ)/σ
x = observed value
μ= mean of the sample
σ= SD

29
Q

calculate the area under the sampling distribution when the SD is unknown

A

When population SD is unknown the sampling distribution follows a t‐distribution, with df = n – 1, where n = sample size.
T score = sample mean- True mean / SE
As the sample size increases the distribution of t-score approaches the standard normal distribution.

30
Q

PROBABILITY - simple

A

▪ A fundamental concept in statistics – always in terms of probabilities
▪ The number of times an event (m) occurs when a process/experiment is repeated n times,
under identical conditions
▪ Estimated from the relative frequency
▪ Ranges from 0 (never) to 1 (all the time)
▪ Probability (Pr) of an event = m / n
ie. study of 10000 smokers found 54 developed lung cancer,
Thus the probability that a smoker develops lung cancer is:
54/10000 = 0.54%

31
Q

uncertainty

A

▪ There is no level of uncertainty when we are making statements about the sample
▪ The uncertainty comes when we estimate population properties from our random sample
(draw conclusions, inferences)
▪ There is variation in the sample data which creates uncertainty in our estimates
* within & between subject differences
* measurement error, other sources
* chance
▪ We must account for the uncertainty and variability
▪ Use probability to measure the degree of uncertainty

32
Q

PARAMETER ESTIMATES vs SAMPLE

A
33
Q

BINOMIAL DISTRIBUTION

A
34
Q

BINOMIAL & NORMAL DISTRIBUTIONS

A
35
Q

STANDARD NORMAL DISTRIBUTION

A

▪ Normal distribution with mean of 0 & standard deviation of 1
▪ Any value can be transformed into a ‘standardised’ z-score

36
Q

Z-SCORE

A
37
Q

SAMPLING DISTRIBUTIONS- sim

A

▪ The sampling distribution is the probability distribution of a sample statistic (ie. mean, median)
▪ Formed when samples of size n are repeatedly taken from a population
▪ If sample statistic is mean, then distribution is:
✔ The sampling distribution of sample mean

38
Q

Standard deviation of mean is

A

SE = Sample sd / Square root ( sample size)

39
Q

SD vs SEM

A

▪ SD: average deviation of each observation from mean (quantifies variation in dataset)
parameter for probability distribution/individual
▪ SEM: average deviation of each sample mean from central mean, parameter for sampling distribution
▪ SEM is smaller than SD, as it takes into account SD and sample size (n)
▪ Exception is for very large samples where SE approx. equal to SD
▪ Larger n = smaller SE → the prediction becomes more precise
▪ SD may also decrease with sample size, but the overall change is not as big

40
Q

Confidence Intervals (CIs)

A
41
Q

CONFIDENCE INTERVALS vs. REFERENCE RANGES

A

▪ Not to be confused
▪ CI always narrower than the reference range
▪ Reference range gives the spread of data in the population ie. 95% within 2 SD of mean
▪ CI is estimating an unknown population parameter
Summary statistics +/- 1.96 X SE (uncertainty)

42
Q

The t-distribution

A