Statistics Flashcards

(51 cards)

1
Q

What does PMCC measure?

A

Product Moment Correlation Coefficient measures how correlated two variables are, giving an r number between -1 and 1.

1 indicates a perfect positive correlation, -1 indicates a perfect negative correlation and 0 indicates no correlation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is a regression line?

A

The best line of best fit, minimising residuals. It always passes through the mean of x and y

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the equation for P(A|B)?

A

P(A|B) = P(AnB) / P(B)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the equation for P(AnB)?

A

P(AnB) = P(A) x P(B|A)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What does it mean for an event to be independent?

A

When the occurrence of one event doesn’t affect the probability of another

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How could you test if an event is independent?

A

If they are independent, P(AnB) = P(A) x P(B)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What does it mean for events to be mutually exclusive?

A

The events cannot occur at the same time

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How would you check if two events are mutually exclusive?

A

P(AnB) = 0, so P(AuB) = P(A) + P(B)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is a difference between a population, a sample, and a sampling frame?

A

A population is the whole group, while a sample is a selected group from the population. A sampling frame is a list of all the members of the population

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are the different sampling methods?

A
  • Census
  • Simple random sampling
  • Systematic sampling
  • Stratified sampling
  • Quota sampling
  • Opportunity sampling
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is a census? Advantages? Disadvantages?

A

Collects data about all the members of a population.

Gives accurate, unbiased results, but is time-consuming and expensive, and can use-up all members of a population if they are consumables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are the advantages and disadvantages of using sampling over census?

A

Is quicker and cheaper, and leads to less data needing to be analysed, but might not represent population accurately and could introduce bias

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is simple random sampling and how would you carry it out?

A

A sample of size n is taken where every member of the population has an equal probability of being selected.

Uniquely number every member of a population, and randomly select n numbers from a random number generator

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

When should simple random sampling be used? Advantages? Disadvantages?

A

Should be used when you want a random sample to avoid bias.

Is unbiased and useful in a small population, but inconvenient for very large or spread out populations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is systematic sampling and how would you carry it out?

A

A sample is formed by choosing members of a population at regular intervals using a list.

You would calculate the size of the interval (population size / sample size), and choosing a start point.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

When should systematic sampling be used? Advantages? Disadvantages?

A

Should be used when you want a random sample from a large population.

Useful when there is a natural order, but can’t be used if it isn’t possible to list all members of the population, and in order for the sample to be random the sampling frame needs to be random

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is stratified sampling and how would you carry it out?

A

The population is divided into groups called strata, and a random sample is taken from each group.

Population could be split into strata by defining characteristics. Then the number of members to be sampled from a stratum = (size of sample / size of population) x number of members in the stratum

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

When should stratified sampling be used? Advantages? Disadvantages?

A

Should be used when the population can be split into obvious groups of members.

Useful when there are very different groups of members within a population, sample will be representative of the population structure, sample from each group is random, but can’t be used if population can’t be divided into discrete groups

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is quota sampling and how would you carry it out?

A

The population is split into groups and members of the population are selected until each quota is filled.

If a member doesn’t want to be included, another member is chosen instead, and the members don’t need to be selected randomly

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

When should quota sampling be used? Advantages? Disadvantages?

A

Should be used when a small sample is needed to be representative of the population structure.

Useful when a sampling frame is not available, but can introduce bias as some members may choose to not be included

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What is opportunity sampling?

A

A sample is formed using the first available members of a population who fit the criteria

22
Q

When should opportunity sampling be used? Advantages? Disadvantages?

A

Should be used when a sample is needed quickly.

Useful when a list of the population is not possible, but unlikely to be representative of the population structure

23
Q

What is variance?

A

The standard deviation squared. A measure of the spread within a set of data.

24
Q

What is standard deviation?

A

Measures how far, on average, each data point is from the mean

25
What is the correction that needs to be made to the equation when calculating standard deviation of a sample and why?
Replace n in the equation with n-1, because the standard deviation of the sample is likely to be smaller than that of the population (most of the sample is likely to come from the most common group)
26
What does a large standard deviation mean?
The data is more spread out
27
What are outliers? What are the two ways of finding outliers?
Data points that differ significantly. Data points are outliers if they are further than 2x standard deviation from the mean, or 1.5x interquartile range from the upper and lower quartile respectively
28
Describe the implications of the mean and standard deviation if a variable x is transformed using y=ax+b, where a and b are constants
mean y = a(mean x) + b SD y = a (SD x)
29
How would you plot a cumulative frequency graph from grouped data?
Cumulative frequency is the sum of all frequencies up to the point being worked on. Plot the cumulative frequency against the endpoint of the class interval. The curve often looks like an elongated S.
30
How would you find the median, upper and lower quartiles on a cumulative frequency graph?
For the upper quartile, find the frequency which is 75% of the total pieces of data (on the y axis), then draw across to the curve, and find the corresponding x value. Repeat with 50% and 25% for the median and lower quartile
31
What is a skew? What is a positive and negative skew?
When distributions aren't symmetrical. Positive skew means a smaller distance between the lower quartile and median, a negative skew has a smaller distance between the upper quartile and the median
32
Describe the properties of a histogram
- Continuous data - Bars may be of unequal width - Vertical axis is frequency density - Area is proportional to frequency. A = kf so you don't always need a scale on the y axis
33
What does n! show?
n factorial shows how many ways you can order n things. = n(n-1)(n-2)etc
34
What is the formula for nCr? What is another way of writing nCr?
n! / r!(n-r)! (n r) - written vertically like a vector
35
What is the formula for any term in the binomial expansion of (a+b)^n?
nCr x a^r x b^(n-r)
36
How would you write a binomial distribution?
X ∼ B(n, p) Where n is the number of trials and p is the probability of the event
37
What is the formula for P(X=r)?
nCr x p^r x (1-p)^(n-r) where n is the number of trials, and p is the probability of the event
38
What are the conditions needed for a binomial distribution to be valid?
- There are only two outcomes - Trials are independent - Trials have a fixed probability
39
What are the formulae for: - Mean - Variance in a binomial distribution?
mean = np variance = np(1-p) where n is the number of trials and p is the probability of the event
40
How would you work out P(2 < X <= 6)?
P(X <= 6) - P(X <= 2)
41
What is a significance level?
The chance that the null hypothesis was in fact true when it was rejected. (Maximum probability the outcome was due to chance)
42
How would you find the critical region?
Using trial and error, find which values of P(X = x) is less than the significance level. These make up the critical region
43
How would you set up a hypothesis test?
H0: p = (default probability from question) H1: p >\< (proposed probability from question) Assuming H0, X ∼ B(n, H0(p)) P(X >= / <= x from question) = answer. If answer is less than the significance level, the result is significant so reject H0. There is sufficient evidence to suggest that (context from question), and vice versa. The side of the inequalities depends on context from the question
44
What are the 5 makes of car in the large data set?
- BMW - Ford - Toyota - Vauxhall - Volkswagen
45
What are the 3 regions in the large data set?
- London - North West - South West
46
What are the 5 propulsion types in the large data set? What is special about some of them?
- Petrol - Diesel - Electric - Gas/Petrol - Electric/Petrol Electric and gas/petrol only have 1 piece of data
47
What are the 4 keeper title IDs?
- Male - Female - Unknown (Rev, Dr, etc) - Company
48
What are the 2 years included in the large data set?
- 2002 - 2016
49
What are the units used in the large data set?
- g/km for emissions - kg for mass - cm^3 for engine size
50
What does the mass of the car include in the large data set?
Includes a 75kg driver (the actual mass of the car is the dataset value - 75)
51
What is a common error in the large data set and how should it be dealt with?
The mass of some cars has been put as 0, so these data must be excluded from mean and standard deviation calculations