Data Analytics Theory Flashcards

(333 cards)

1
Q

Which of the mean, mode and median are resistant to outliers?

A

The mean is very sensitive to the presence of outliers. The median and mode are very resistant to outliers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

True or false - the median is calculated differently depending on if there is an even or odd number in the sample?

A

True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are the steps for determining the mean?

A

The sum of all sample values (Xi) divided by the number of samples.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are the steps for determining the median?

A

Order the sample values in ascending order. For odd total n the median is found at (n+1)/2. For even total n, the median is the average of the value at n/2 and (n+2)/2.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are the steps for determining the mode?

A

Creating a frequency table and observing the highest frequency. Then observe which value this is for.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the mode?

A

The observation that occurs most frequently in the dataset.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Why should we not look at measures of centrality in isolation?

A

Comparing the measures of centrality between datasets may indicate that they are similar when in reality they have different amounts of dispersion.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are the measures of centrality?

A

Mean, mode, median

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What do measures of variability describe?

A

Measures of variability describe how dispersed observations in the univariate dataset are. They describe whether observations are tightly clustered or spread out.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are the measures of variability?

A

Variance and standard deviation. Range (though very sensitive to outliers). Five number summary provides basic information about variability.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are synonyms of the mean?

A

Arithmetic mean or average

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the formula for calculating the mean?

A

[See flashcard]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is the mean?

A

The mean is considered to be the central (typical) measurement of a collection of observations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is the formula for calculating the standard deviation?

A

[See flashcard]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the formula for calculating the variance?

A

[See flashcard]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is the variance?

A

The average squared distance of each observation from the mean. Measured in units squared.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What units is the variance measured in?

A

Units squared

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is the standard deviation?

A

The square root of the variance - it is useful to consider how close the observations are from the mean. Measured in the same units/same scale as the observations in the numerical variable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What units is the standard deviation measured in?

A

Same units/scale as the observations in the numerical variable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

How much of the data is usually within one standard deviation from the mean?

A

68%

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

How much of the data is usually within two standard deviations from the mean?

A

95%

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What are order statistics?

A

Statistics based on sorted (ranked) data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Define a quantile.

A

The value computed from a sorted collection of numerical measurements (in ascending order) that indicates an observation’s rank when compared to all other present observations. It can take a value between 0 and 1.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What does the 0.5th quantile mean?

A

This is the median value, below which half (50%) of the measurements lie.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
What values can a quantile take?
Between 0 and 1.
26
What values can a percentile take?
Between 0 and 100.
27
What is the relationship between a quantile and percentile.
The percentile is the quantile expressed in "percent scale" of 0 to 100 ie Pth quantile = 100 x Pth percentile.
28
Define percentile.
The percentile is the quantile expressed in "percent scale" of 0 to 100 ie Pth quantile = 100 x Pth percentile. The Pth percentile is the cutoff point that indicates that at least P percent of the observation in the dataset take on this value or less.
29
What does the 80th percentile represent?
The 80th percentile is the cutoff point which indicates that 80% of observations in the dataset may be found at this point or below.
30
What are quartiles?
Quartiles are three cut off points that divide the dataset into four equal groups (Q1, Q2, Q3)
31
Define the first quartile
Q1 = 0.25th quantile = 25th percentile. This is the middle value between the smallest observation and the median. Ie it is the median of the lower half of the dataset.
32
Define the second quartile.
Q2 = 0.5th quantile = 50th percentile. This is the median of the dataset (the value which splits the dataset in half).
33
Define the third quartile.
Q3 = 0.75th quantile = 75% percentile. This is the middle value between the median and the highest observation in the dataset. Ie it is the median of the upper half of the dataset.
34
Define the range.
The range is the difference between the smallest and largest observations in a numerical variable. It is extremely sensitive to outliers and therefore not very useful as a general measure of dispersion in the data.
35
Why is the range not very useful as a general measure of dispersion in the data?
It is extremely sensitive to outliers - its calculation involves the use of extreme values.
36
What is the five number summary?
This provides basic information about variability in the dataset. It consists of the 0th percentile (minimum), 25th percentile (Q1), 50th percentile (Q2), 75th percentile (Q3) and 100th percentile (maximum). Ie it is the quartiles plus the maximum and minimum values.
37
What is the interquartile range?
The interquartile range (IQR) measures the width of the "middle 50 percent" of the data. It is the range of values between Q1 (0.25 quantile) and Q3 (0.75 quantile). It is very resistant to outliers as it doesn't consider the extremes where outliers are present.
38
Why is the IQR resistant to outliers?
The IQR measures the range across the middle 50% of the data, and therefore unlike the range it doesn't consider the extremes where the outliers are present.
39
What is the first step to carry out before determining order statistics?
Sort the data in ascending order.
40
What is covariance?
Covariance measures joint variability — the extent of variation between two random variables. It quantifies how two variables vary together.
41
What are the possible outcomes for covariance and what does each mean?
R = 0 - there is no linear relationship between numerical variables x and y. R > 0 - there is a positive linear relationship between numerical variables x and y (as x increases, y increases and vice versa). R < 0 - there is a negative linear relationship between numerical variables x and y (as x increases, y decreases and vice versa)
42
What does a positive linear relationship mean?
R > 0 - as x increases, y increases and vice versa
43
What does a negative linear relationship mean?
R < 0 - as x increases, y decreases and vice versa
44
Does correlation or covariance measure how strong a relationship is?
Correlation
45
Why does calculating the covariance not tell us how strong a relationship is?
Covariance can tell us if there is a relationship between two variables, but it cannot measure how strong the relationship is as there is no scale to compare the value of r to.
46
What type of variable can covariance and correlation be calculated for?
Numerical variables.
47
What is the problem with covariance?
We cannot quantify strength of the linear relationship between two variables. There are no upper or lower limits which covariance coefficient can take.
48
What does correlation measure?
The direction and strength of an association between two variables. It is used to interpret the covariance.
49
What coefficient do we use for correlation?
Pearson’s product-moment correlation coefficient (Pxy, Rho xy).
50
What are the interpretations of the absolute strength of the Pearson’s product-moment correlation coefficient?
There are guidelines available to interpret the value of rho. |rho| = 0.0 – no linear relationship 0.0 < |rho| <= 0.19 – very weak L.R. 0.20 <= |rho| <= 0.39 – weak L.R. 0.40 <= |rho| <= 0.59 – moderate L.R. 0.60 <= |rho| <= 0.79 – strong L.R. 0.80 <= |rho| < 1.0 – very strong L.R. |rho| = 1.0 – perfect L.R.
51
What are the basic interpretations of the Pearson’s product-moment correlation coefficient?
If rho = 1, there is a perfect positive linear relationship between variables x and y. If 0 < rho < 1, there is a positive linear relationship between x and y. The closer to 1 the stronger it is. If rho = -1, there is a perfect negative linear relationship between x and y. If -1 < rho < 0, there is a negative linear relationship between x and y. The closer to -1 the stronger it is. If rho = 0, there is no linear relationship between x and y.
52
What values can Pearson’s product-moment correlation coefficient take on?
Rho is between -1 and 1.
53
Why are we able to say how strong the relationship is using Pearson’s product-moment correlation coefficient?
It is scaled between - 1 and 1.
54
What is a frequency table?
A statistical technique used to get more insight into the properties of categorical variables.
55
What are the columns of a frequency table?
1 - category 2 - frequency column (F) - the number of occurrences of each categorical variable. Will total to n 3 - relative frequency (RF) - the proportion of occurrences of each categorical variable. (F/n). The sum of all relative frequencies when written as proportions must be equal to 1. 4 - percentages (P) - proportions multiplied by 100. The sum of this column must equal 100.
56
What does the relative frequency column of a frequency table sum to?
1
57
Why are frequency tables useful?
They help us to summarise large amounts of data and display this information clearly. We can see the most/least common variables and can calculate proportions.
58
What are contingency tables used for?
A contingency table summarises data for two categorical variables (table of counts by category). Each value in the table represents the number of times a particular combination of variable outcomes occurred.
59
What is the relationship between a frequency table and contingency table?
Both tables are used to summarise information on categorical variables. A frequency table is used to summarise information on a single categorical variables whereas contingency tables summarise the data for two categorical variable.
60
What kind of tool can be used to answer questions like "what proportion of spam emails contains text without numbers?"
Two categorical variables - contingency table
61
What are bar charts used to visualise?
Categorical variables. This can be represented as frequency or proportion.
62
How are categorical variables visualised?
Bar charts - this can be by frequency or proportion.
63
What are the different axis of a bar chart?
The x-axis represents the different symbols (categories) of a categorical variable. The y-axis represents the frequency or proportion of the occurrence of each category.
64
What is a mosaic plot?
A graphical representation of the information in a contingency table. It is similar to a bar plot.
65
How many variables can a mosaic plot represent?
A mosaic plot can be used to visualise one or two categorical variables from a contingency table.
66
How do mosaic plots represent the number of observations?
Mosaic plots use box areas to represent the number of observations that that box represents.
67
What is used to visualise contingency tables?
A mosaic plot
68
How does a two-variable mosaic plot represent the two variables?
One category (x) is used to create an initial one variable mosaic plot where the area represents the number of observations for that category. The second category (y) is represented by splitting each bar proportionally according to the fractions of y.
69
What types of variables are plotted on a scatterplot?
Numerical variables
70
What is a scatterplot?
A plot that provides a case-by-case view of data for two numerical variables.
71
What are scatterplots useful for?
Scatterplots are helpful in quickly spotting associations between two numerical variables.
72
What is a box plot?
A visualisation technique used for explaining important features of the distribution of the target numerical variable. It provides insight into centrality, spread, skewness and possible outliers.
73
What does a box plot show?
Centrality (mean), spread (quartiles), skewness and possible outliers.
74
Do the whiskers of a box plot represent the full range?
No, the whiskers may not capture the maximum and minimum values. The whiskers are determined differently dependent on the software package used. Eg 1.5 the IQR
75
What can box plots be useful for?
Identifying outliers.
76
If a box plot shows lot of outliers above the maximum whisker (high positive) what does this indicate about the skew of the data?
Right-skewed
77
If a box plot shows lot of outliers below the minimum whisker what does this indicate about the skew of the data?
Left-skewed
78
How can you identify suspected outliers on a box plot?
Suspected outliers are the observations beyond the maximum reach of the whiskers.
79
What is an outlier?
An outlier is an observation that appears extreme relative to the rest of the data
80
Why is it important to look for outliers?
- To identify a strong skew in the distribution - To identify data collection or entry errors - To get an insight into interesting properties of the data
81
What are side-by-side box plots used for?
Side-by-side box plots is a traditional tool for comparing numerical observations across categories. It is particularly useful for comparing centrality and spread of numerical observations between categories.
82
What visualisation technique can you use for exploring the distribution of numerical and categorical variables together?
Side-by-side box plots
83
What measures are side-by-side box plots particularly useful for?
Comparison of centrality and spread of numerical observations between categories.
84
How should you answer questions describing graphs?
- Describe what you see - Relate this to the question (ie what does this mean in real life) - Support with figures from the graph
85
What are histograms?
Histograms are plots that are used for describing the shape of the data distribution of the target numerical variable. They also provide a view of the data density of the target numerical variable (higher bars represent where data is more common).
86
What kind of data type is plotted in a histogram?
Numerical
87
What kind of visualisation describes the shape of the data distribution of a numerical variable?
Histogram
88
What does a higher bar in a histogram represent?
Where the data are relatively more common.
88
What kind of visualisation describes the data density of a numerical variable?
Histogram - where higher bars represent where the data are relatively more common.
89
What are the similarities of bar charts and histograms?
They use bars to represent frequencies / they both measure frequencies.
90
What are the differences of bar charts and histograms?
- Histograms re used for displaying distributions of numerical variables while bar charts are used for categorical variables. - Both measure frequencies, but in histograms, observations first need to be "binned"
91
What is a "bin" in a histogram?
A defined interval (used to group individual numerical values). The number of observations that fall within each interval are counted and this frequency is used to determine the height of the bar for that interval.
92
Why does bin width matter when plotting histograms?
The chosen bin width can alter the story that the histogram is telling. Increasing the bin widths may decrease the number of modes available.
93
What are the steps of constructing a histogram?
1 - define the bins and bin sizes (software may determine this) 2 - once defined, count how many observations fall into each interval 3 - plot
94
How is the mode represented in a histogram?
The mode is represented by a prominent peak in the distribution.
95
What can histograms show?
Histograms can show how many and what the modes of a distribution are. - Unimodal / bimodal / multimodal
96
Describe a right-skewed distribution.
When data trails off to the right ie observations are clustered on the left of the axis and there is a long tail to the right.
97
Describe a left-skewed distribution.
When data trails off to the left ie observations are clustered on the right of the axis and there is a long tail to the left.
98
If observations are clustered on the left of a histogram and there is a long tail to the right - what kind of skew is this?
Right-skewed
99
If observations are clustered on the right of a histogram and there is a long tail to the left - what kind of skew is this?
Left-skewed
100
How do you describe a dataset that shows roughly equal trailing off in both directions?
Symmetric
101
What is a symmetric distribution?
A dataset that shows roughly equal trailing off in both directions.
102
Why is it important to check if data is normally distributed?
A lot of statistical inference relies on data being normally distributed.
103
If the distribution of a dataset is symmetric, what measures should you use to describe the centre and spread?
Mean and standard deviation
104
What kind of distribution is best described by the mean and standard deviation?
Symmetric
105
If the distribution of a dataset is skewed, what measures should you use to describe the centre and spread?
Median and IQR - they are robust to outliers.
106
What is the relationship between the median, mean and mode of a symmetric distribution?
mean ~ median ~ mode
107
What is the relationship between the median, mean and mode of a right-skewed distribution?
mode < median < mean
108
What is the relationship between the median, mean and mode of a left-skewed distribution?
mean < median < mode
109
Why does mean ~ median ~ mode not hold for skewed data?
The mean is pulled in the direction of the tail, towards the extremes. The mode is pulled in the opposite direction (where the data is clustered)
110
If data is right-skewed, what kind of transformations can result in new samples which are less skewed?
y = sqrt(x) y = ln(x) y = -1/x In increasing order of skewness severity
111
If data is left-skewed, what kind of transformations can result in new samples which are less skewed?
y = x^2 y = x^3 In increasing order of skewness severity
112
Why is the bin width choice important?
Depending on bin size, the story the graph tells can change. If the bin size is too wide, it may mislead you into thinking that the data is normally distributed.
113
What are the options to have on the y-axis of a histogram?
Absolute frequency or relative frequency (F/n)
114
What is the difference in shape of the relative histogram in relation to the absolute frequency histogram?
They have the same shape. The difference is the Y-axis and the fact that the areas of the bars of the relative frequency histogram add up to one.
115
How do you calculate the relative frequency?
The absolute frequency divided by the Toal number of observations
116
When do we use the relative frequency histogram over the absolute frequency histogram?
Use the relative frequency histogram when we want to investigate whether the proportion is less than or greater than a certain value. Ie we want to look at proportion rather than frequency.
117
How do you answer a question to determine how many people fall in a certain interval, if the bin widths are too big to answer this accurately?
Can't determine an exact answer with these bin widths, we can only estimate. To answer accurately we need to have a narrower histogram (one with smaller bins)
118
If you keep changing the bin widths of a histogram to become smaller, what happens?
The histogram forms a more smooth curve, approaching the density curve.
119
What is a density curve?
A density curve is a smoothed version of the relative frequency histogram. It is used for the visualisation of continuous variables or very large populations. It also represents a probability density function. The area under the curve is equal to 1.
120
What kind of variables is visualised in a density curve or probability density function?
A continuous variable.
121
What does the area under a density curve represent?
The area corresponds to measuring probabilities. The total area is equal to 1. Similar to the bars in a relative frequency diagram.
122
From a probability density curve, what is the probability that x = a particular value from the continuous distribution?
The probability that x is equal to some value from the continuous distribution is ALWAYS equal to 0. This happens because a single point on the density curve diagram has a width of 0 and therefore we can't obtain the area underneath the curve at a single point.
123
Which distribution is the most common?
The normal curve or normal distribution.
124
What are the properties of the normal distribution?
- It is unimodal and symmetric around its mean bell-shaped curve - Mean, mode and median are equal - It is determined by two parameters (mu and sigma), usually denoted as N(mu, sigma) - The area under the normal curve is 1
125
What parameters determine the normal distribution?
Mu and sigma - N(mu, sigma)
126
What is the standard normal distribution?
A normal distribution where mu = 0 and sigma = 1, represented as N(0,1)
127
Which normal distribution is represented by N(mu = 0, sigma = 1)?
The standard normal distribution
128
We want our dataset to be normally distributed, but in practice our data comes from lots of different types of distributions, with lots of different influencing parameters. How do we account for this?
Transform our dataset onto the standard normal distribution. This enables us to refer to the standardised tables.
129
What parameters determine the shape of the normal distribution?
Mu (mean) - the centre of the curve, changing mu shifts the curve left / right Sigma (standard deviation) - the width of the curve. Changing sigma stretches or constricts the curve
130
Which rule describes how many observations lie within different numbers of standard deviations from the mean in the normal distribution?
68-95-99.7 Rule - 68% of observations lie within 1 SD away from the mean in the normal distribution - 95% of observations lie within 2 SDs - 99.7% of observations lie within 3 SDs
131
How many observations lie within 1/2/3 SDs away from the mean in the normal distribution?
68%, 95%, 99.7%
132
How do we analyse normally distributed data?
We should convert available observations into the standard deviation units and measure their distances from the mean. To perform this type of conversion we use the standardisation technique called Z-score.
133
What is a Z-score?
The Z-score of an observation is the number of standard deviations it falls above or below the mean. It is used to analyse normally distributed data.
134
How do we calculate the Z score?
For an observation x that follows the normal distribution N(u,o) Z = (x-u) / o By calculating a Z-score we "convert" the data value for its normal distribution N(u,o) to a value from the normal standard distribution N(0,1) in such a way that it maintains all the properties of the original dataset.
135
What does a Z score of 1 mean?
The observation is one standard deviation away from the mean? (above)
136
If an observation is 1.5 standard deviations below the mean, what is its Z-score?
z = -1.5.
137
What can comparing the Z-scores of two observations allow you to determine?
You can use Z-scores to roughly identify which observations are more unusual than others. If the absolute value of the Z-score is larger, it is more unusual - |z1| > |z2| means z1 is more unusual.
138
How can you tell which of two observations is more unusual?
The more unusual observation will have a larger Z score, ie it will be more standard deviations away from the mean.
139
What does the magnitude and sign of the Z-score indicate?
Magnitude - the number of standard deviations away from the mean the observation is. Value - whether this number of standard deviations away is above or below the mean.
140
If a random variable X~N(mu, sigma), then what can we say about the random variable Z = (x-mu)/sigma?
Z ~ N(0,1) It follows that it is normally distributed once transformed.
141
How do we calculate percentiles for a N(mu, sigma) distribution?
We transform it to the standard normal distribution (Z scores) and use the N(0,1) percentiles, which are listed in a normal probability table to determine the percentile based on the Z score.
142
What are the steps to follow for solving normal probability problems?
1 – draw and label a picture of the normal distribution (doesn’t need to be exact) 2 – shade in the region of interest 3 – calculate the Z-score of the cutoff value 4 – look up the percentile for the Z-score in the normal probability table 5 – do you need to subtract from 1? Always verify that the final answer makes sense with the picture you drew.
143
What is the textbook definition of a Z-score?
Z-score is a statistical measurement that describes a value's relationship to the mean of a group of values. Z-score is measured in terms of standard deviations from the mean.
144
What are the two types of categorical variables?
Nominal categorical variable - have no implied order Ordinal categorical variable - have a natural ordering
145
What are the two most popular approaches to evaluating whether sample data follows the normal distribution or not?
- Statistical tests - Visualisation techniques Ideal to use both
146
What are some statistical approaches of evaluating whether a given sample of data follows the normal distribution?
- Shapiro-Wilk test - Kolmogorov – Smirnov test - Anderson – Darling test, etc.
147
What is a drawback of relying on statistical tests for evaluating if data follows the normal distribution?
Statistical tests are very sensitive to the presence of outliers. If a certain number of outliers are present in a normally distributed data set, statistical tests may report that the data set is not drawn from a normal distribution. Visualisation techniques may help overcome this problem.
148
What are two visualisation techniques for normality assessment?
- Histograms with the best fitting normal curve overlaid on the plot - The normal probability plot (quantile-quantile plot or QQ plot)
149
What is the quantile-quantile (QQ) plot a synonym for?
The normal probability plot.
150
What does a histogram with the best fitting normal curve overlaid on the plot show?
This is used to visualise normality assessment. The sample mean and SD are used as the parameters for the best fitting normal curve. The closer the curve is to the histogram, the more reasonable the normal model assumption is.
151
What does the normal probability plot show?
This is used to visualise normality assessment. Data are plotted on the y-axis of the plot and theoretical quantiles (following normal distribution) are plotted on the x-axis. The closer the points are to a perfect straight line, the more confident we can be that the data follow the normal model.
152
When using a histogram with the best fitting normal curve overlaid or the normal probability plot to visualise normality, how does sample size affect this?
A smaller sample size will show more variability around the curve. A larger sample size increases the confidence.
153
When using a histogram with the best fitting normal curve overlaid to visualise normality, what does a curve close to the histogram represent?
A curve closer to the histogram means it is more reasonable to assume the data is normally distributed.
154
How is right-skew observed on a normal probability plot?
Points bend up and to the left of the line.
155
How is left-skew observed on a normal probability plot?
Points bend down and to the right of the line.
156
If visualisations show that the data isn't normally distributed what should you do?
Perform further analysis, eg different visualisations or investigating if and why there are outliers.
157
How are short tails visualised on a normal probability plot?
Short tails (narrower than the normal distribution) - points follow an S-shaped curve.
158
How are long tails visualised on a normal probability plot?
Long tails (wider than the normal distribution) - points start below the line, bend to follow it, and end above it.
159
What is the purpose of statistical inference?
To draw conclusions about and assess population parameters for a specific population based on a sample of data taken from that population.
160
Why do we use sample statistics?
Sample statistics (mean, proportions etc) are used was point estimates for the unknown population parameters of interest, as it is difficult (or impossible) to collect data from the complete population.
161
What is a point estimate?
In statistics, a point estimate is a single value that is calculated from sample data to estimate an unknown population parameter. It is a "best guess" or "best estimate" of the population parameter. They generally vary from one sample to another and this sampling variation suggests our estimates may be close, but not exactly the true population parameter.
162
What does variation in point estimates between different samples suggest?
This sampling variation suggests that the estimate is not exactly equal to the true population parameter.
163
What does the sampling distribution represent?
The distribution of point estimates based on samples of a fixed size from a certain population.
164
What are the parameters of a sampling distribution?
The central "balance" point of a sampling distribution is its mean. The standard deviation of a sampling distribution is referred to as a standard error.
165
What is the standard error?
The standard deviation of a sampling distribution. Reflects the fact that probabilities are no longer tied to raw measurements/observations, but rather to a quantity calculated from a sample of such observations. The standard error of an estimate describes how far the point estimate is from the true population parameter eg how far the typical estimate is away from the actual population mean.
166
What is the difference between the standard deviation and the standard error?
The standard deviation measures the variability of individual data points inside the sample The standard error measures how far the point estimate is from the population parameter.
167
What is the central limit theorem?
If a sample consists of at least 30 independent observations and the data are not strongly skewed, then the distribution of the sample mean is approximated well by the normal distribution. The central limit theorem says that the sampling distribution of the mean will always be normally distributed, as long as the sample size is large enough. Regardless of whether the population has a normal, Poisson, binomial, or any other distribution, the sampling distribution of the mean will be normal.1
168
What is the normal model which approximates the distribution of the sample mean? For both known standard deviation of the population and unknown.
[See flashcard]
169
What conditions need to be met for the central limit theorem (CLT) to apply (to the mean)?
Independence - sample observations must be independent. Sample size/skew - either the population distribution is normal, or if the population distribution is skewed, the sample size is large.
170
For the central limit theorem (CLT), independence is difficult to verify, but it is more likely if:
- Random sampling / assignment is used - If sampling without replacement, n is less than 10% of the population
171
To apply the central limit theorem (CLT), if the population distribution is skewed, what do we need to ensure?
The more skewed the population distribution, the larger sample size we need to apply for the CLT. For moderately skewed distributions, n > 30 is a widely used rule of thumb.
172
For the central limit theorem (CLT), it is difficult to verify the skew / if it is normally distributed, but how can we check it?
We can check it using the sample data and assume that the sample mirrors the population.
173
What conditions need to be met for the central limit theorem (CLT) to apply (for the single proportion)?
- Independence - sampled observations must be independent. - Sample size / skew - at least 10 success and 10 failure observations. eg for the marathon example, at least 10 who ran < 2 hours and 10 who ran > 2 hours
174
If you report a single point estimate, such as the sample mean, are you likely to capture the exact population parameter, eg the population mean?
It is very likely that we will not capture the exact population parameter. Instead, if we report a range of the plausible values, we have a good chance to capture a true population parameter. A plausible range of values for the population parameter is called a confidence interval.
175
What is a confidence interval?
A plausible range of values for the population parameter. They may be constructed in different ways, depending on the type of statistic and therefore shape of the corresponding sample distribution.
176
For symmetrically distributed sample statistics, like those involving means and proportion's, what is the general formula of the confidence interval?
[See flashcard]
177
What is Z* and what does it depend on?
Z* is the critical value and can have a different value depending on the confidence level.
178
What is Z* x SE in the confidence interval?
The margin of error. For a given sample the margin of error changes as the confidence level changes.
179
What is the formula for the margin error?
Z* x SE
180
How do we adjust the confidence level?
Adjust Z* in the formula
181
What are the two most commonly used confidence intervals in practice?
95% confidence interval, Z* = 1.96 99% confidence interval, Z* = 2.58
182
How do we find Z* values to use in the calculation of confidence intervals?
Use the normal Z-table. eg how do we be 96% confident?
183
If we want to be more certain that we will capture the population parameter, how does our confidence interval need to change?
The confidence interval needs to increase ie become wider. This will increase our confidence level. Too wide an interval may not be very informative.
184
Why may too wide of a confidence interval be a problem?
It may not be very informative eg weather example.
185
How should we interpret a confidence interval [l,u]?
We are XY% (eg 95%) confident that the true population parameter is between the lower bound (l) and upper bound (u) of our confidence interval.
186
What do confidence intervals attempt to do?
Confidence intervals try to capture the population parameter - they say nothing about the confidence of capturing individual observations, a proportion of observations or about capturing point estimates.
187
What are the steps for significance tests / statistical inference?
1 - formulation of the practical problem in terms of statistical hypotheses 2 - construction of a test statistic 3 - description of a critical region and/or the calculation of the p-value 4 - significance level or size of the test 5 - further assessment
188
What is the null hypothesis?
The null hypothesis H0 represents what we currently hold as true. H0 is basically a standard with which the evidence for HA can be compared. One-sample: there is no difference from our previous knowledge (maintenance of status quo) Two-sample: there is no difference between the populations being compared.
189
What is the alternative hypothesis?
HA represents what we want to test. It expresses the range of situations that we wish the test to be able to diagnose. Depending upon the outcome of the test we may take action.
190
What language is used to summarise the outcome of the null hypothesis?
Language - is there enough evidence to reject the null hypothesis (we never accept it). "H0 is rejected in favour of HA" "There is insufficient evidence to reject H0 in favour of HA"
191
What is a test statistic?
The test statistic is a number calculated from a statistical test of a hypothesis. It shows how closely your observed data match the distribution expected under the null hypothesis of that statistical test. The test statistic is used to calculate the p value of your results, helping to decide whether to reject your null hypothesis. It is a function of the data plus the information in the hypothesis H0.
192
What properties should a test statistic satisfy?
1 - its probability distribution must be calculable (at least approximately) under the assumption that H0 is true 2 - it should behave differently when H0 is true from when HA is true
193
What is the critical region?
A region of values of the test statistic t which support our preference for HA rather than H0
194
If the calculated value of t (calculated under the assumption that H0 is true) falls in a suitable critical region, what do we do?
We reject H0 in favour of HA Otherwise, we are unable to reject H0 in favour of HA
195
If the calculated value of t (calculated under the assumption that H0 is true) does not fall in a suitable critical region, what do we do?
We are unable to reject H0 in favour of HA.
196
How are tests constructed in significance tests?
So that the lack of information, particularly too little data, tends to result in non-critical values of the test statistic. Hence, it is unwise to talk positively about "accepting H0". Lack of strong evidence to reject H0 in favour of HA may indicate that we have not collected enough data to reject it.
197
Why do we not "accept H0".
Lack of information, particularly too little data, tends to result in non-critical values of the test statistic. Lack of strong evidence to reject H0 in favour of HA may indicate that we have not collected enough data to reject it.
198
What is the p-value?
A p-value, or probability value, is a number describing the likelihood of obtaining the observed data under the null hypothesis of a statistical test. The p-value quantifies the strength of the evidence against the null hypothesis H0 and in favour of the alternative hypothesis HA.
199
What results in a small p-value?
H0 is true and an improbable event has occurred HA is true
200
How do you interpret the p-value?
If the p-value is small, H0 is rejected in favour of HA If the p-value is not "small", the evidence does not support the reject of H0 in favour of HA.
201
What are two approaches to investigating the null hypothesis?
Calculate the p-value Investigate t-statistic and the critical region
202
In making a test of H0 against HA we can make two kinds of error - what are they?
Type 1 - false positive. H0 is rejected when in fact it is true. Type 2 - false negative. H0 is not rejected when it is true.
203
If conducting a serious clinical trial would you want the significance level alpha to be larger or smaller?
Would choose a smaller significance level - we would rather have 1 in 100 errors than 5 in 100 errors.
204
What is the significance level alpha?
The significance level of an event (such as a statistical test) is the probability that the event could have occurred by chance. It is the probability of rejecting H0 when in fact it is true, ie committing a Type 1 error.
205
What influences the choice of the significance level alpha?
Depends on the particular problem and how serious it is a true H0 is rejected (false positive) eg medical trials
206
What does a significance level (alpha) of 0.05 mean?
We will allow 5 incorrect rejections of H0 from every 100 we make. There is a 5% chance that the result is due to chance.
207
What are the basic conclusions when working with a significance level of 5%?
P <= 5% (p <= 0.05) – the test is significant at 5% level and H0 is rejected in favour of HA P > 5% (p > 0.05) – the test is not significant at the 5% level and H0 is not rejected in favour of HA
208
What are the further assessments that can be made when working with a significance level of 5%
- P > 10% - there is no (or very little) evidence for rejecting H0 in favour of HA - 5% < P <= 10% - on the available evidence, we cannot reject H0 is in favour of HA but we have some suspicion (ie we would like to obtain more evidence) Eg you didn’t reject the null due to a small dataset - 1% < p <= 5% - significant at 5% level and H0 is rejected in favour of HA. If the decision to change is important, we should probably seek further evidence - 0.1% < p <= 1% - highly significant at the 5% level. There is considerable evidence for rejection of H0 in favour of HA - P <= 0.1% - very highly significant at the 5% level. We are very confident that HA is to be preferred to H0
209
When testing a single mean using the Z statistic, what are H0 and HA? How are the test statistic and confidence interval calculated?
[See flashcard]
210
When testing the comparison of two means using the Z statistic, what are H0 and HA? How are the test statistic and confidence interval calculated?
[See flashcard]
211
For testing a single proportion, what is the formula for the test statistic Z and the confidence interval?
[See flashcard]
212
For testing the comparison of two proportions, what is the formula for the test statistic Z and the confidence interval?
[See flashcard]
213
What is a t-distribution?
The t-distribution, also known as the Student’s t-distribution, is a statistical function that creates a probability distribution. The t-distribution is similar to the normal distribution, with its bell shape, but it has heavier tails. It is used for estimating population parameters for small sample sizes or unknown variances. T-distributions have a greater chance for extreme values than normal distributions, and as a result have fatter tails.
214
How does the shape of the t-distribution compare to the standard normal distribution?
They are both bell-shaped curves centred at 0. The t-distribution has fatter tails, meaning observations are more likely to fall further away from the mean (over 2 SDs from the mean). The thicker tails are helpful for resolving our problem with a less reliable estimate of the standard error (since n is small).
215
What are the conditions for the t-distribution?
When the population SD is unknown and we have a small data sample (n<30) we address the uncertainty of the standard error using the t distribution.
216
What influences the shape of the t-distribution.
It is centred at zero and influenced by one parameter, the degrees of freedom (df). The larger the degrees of freedom, the more closely the t-distribution resembles the standard normal model. When df >= 30, it is nearly indistinguishable from the normal distribution.
217
What are degrees of freedom?
Degrees of freedom are the maximum number of logically independent values, which may vary in a data sample. Degrees of freedom are calculated by subtracting one from the number of items within the data sample.
218
What is the cut off value of n for the t-distribution and why?
n < 30 - for n >= 30, the t-distribution and the normal distribution are nearly indistinguishable
219
Describe the t-table
A t table is a reference statistical table that contains critical values of the t distribution, also known as the t score or t value. Each row represent a t-distribution with different degrees of freedom. The columns correspond to tail probabilities.
220
What are the formulas for obtaining the t-statistic and confidence intervals for a single mean?
[See flashcard]
221
What is a paired comparison t-test?
The Paired Samples t Test compares the means of two measurements taken from the same individual, object, or related units. Each subject has two observations.
222
What are the formulas for obtaining the t-statistic and confidence intervals for a paired comparison?
[See flashcard]
223
What is the formula for the test based on the t-distribution for comparison of two means - independent samples?
[See flashcard]
224
What is an assumption made in the formula for the test based on the t-distribution for comparison of two means - independent samples?
Use the pooled variance in the calculations
224
What is the formula for S-pooled in the test based on the t-distribution for comparison of two means - independent samples?
[See flashcard]
225
What is the chi squared test?
Goodness-of-fit test for classified data - The distribution of a categorical variable in a sample often needs to be compared with the distribution of a categorical variable in another sample. A chi-squared test is a statistical hypothesis test used in the analysis of contingency tables when the sample sizes are large. In simpler terms, this test is primarily used to examine whether two categorical variables are independent in influencing the test statistic.
226
What is a difference of the chi squared test from the test of proportions?
In chi squared tests we don't assume normal distribution.
227
In a chi squared test, observations are classified into classes. What is the condition for this?
Each observation is classified into k mutually exclusive and exhaustive classes ie each observation belongs to one and only one class.
228
Where does the critical region of the chi squared distribution lie and why?
The critical region lies in the right hand tail only. This is because, if H0 is not true, we would expect the Eis to be quite different from the Ois, resulting in a larger than expected phi squared value. Small phi squared results when Eis and Ois are in good agreement - we wouldn't want to reject H0 in this case.
229
What is the difference between the distribution of phi squared and chi squared?
The exact distribution of phi squared is discrete and is approximated by the continuous chi squared distribution. o For this approximation to be reasonable, Ei should be > 5 for each class o If not, combine adjacent classes with resultant loss of one or more degrees of freedom
230
To approximate the phi squared distribution (discrete) as the chi squared distribution (continuous) to be reasonable, what needs to be in place?
Ei should be > 5 for each class. If not, combine adjacent classes with the resultant loss of one or more degrees of freedom.
231
In the chi squared test, if the number of observations of Ei is not >5, what must you do?
Combine adjacent classes with the resultant loss of one or more degrees of freedom.
232
What is the formula for the test statistic phi in the chi squared test?
[See flashcard]
233
What adjustment do you have to make in the test statistic phi in the chi squared test if there is only 1 degree of freedom?
The Yates' Continuity Correction - add magnitude and -1/2 [See flashcard]
234
What parameters influence the chi squared distribution?
The chi squared distribution has just one parameter called the degrees of freedom (df) which influence the shape, centre and spread of the distribution.
235
How does changing the degrees of freedom influence the shape of the chi squared distribution?
Higher degrees of freedom – the distribution shifts to the right and becomes flatter
236
How do the t-table and chi-square table differ?
One important difference from the t-table is that the chi-square table only provides upper tail values
237
What is an ANOVA test?
ANOVA, or Analysis of Variance, is a test used to determine differences between results from three or more unrelated samples or groups. ANOVA is used to assess whether the mean of the outcome variable is different for different levels of a categorical variable.
238
How do we compare the means of 2 groups? How do we compare the means for 3 groups?
- 2 groups: Z or a T statistic - 3 groups: test Analysis of Variance (ANOVA) and a new statistic called F
239
What are the conditions to be met for ANOVA?
1 - The observations should be independent within and between groups. If the data are a simple random from less than 10% of the population, the condition is satisfied. Eg no pairing 2 - The observations within each group should be nearly normal (important when sample sizes are small) 3 - The variability across the groups should be about equal (especially important when the sample sizes differ between groups).
240
What test statistic is used for ANOVA?
F statistic
241
What is the purpose of carrying out statistical tests to compare statistics?
Compare to see whether they are so far apart that the observed difference cannot reasonably be attributed to sampling variability.
242
With only two groups, how do the t-test and ANOVA compare?
They are equivalent, but only if we use a pooled standard variance in the denominator of the test statistic.
243
With more than two groups, what does ANOVA compare the sample mean to?
An overall grand mean
244
What is the formula for the F statistic?
F = variability between sample groups / variability within sample groups
245
In order to reject H0, what size does the F statistic need to be?
A large F statistic is needed for the p-value to be small to reject the H0. A large F statistic means the variability between sample groups is greater than the variability within sample groups.
246
What are the different degrees of freedom associated with the ANOVA table?
Group - k - 1 Total - n - 1 Error - dft - dfg ie the difference between the total and the grouped degrees of freedom
247
What are the different sum squares columns in the ANOVA table and how are they calculated?
SSG - sum of squares between groups, measures the variability between the groups [see flashcard] SST - sum squares total, measures the total variability in the dataset [see flashcard] SSE - sum squares error, measures variability within groups SSE = SST - SSG
247
How do you compare the F value to the F tables / probability value?
From F-tables, find the F* value as the value from the column dfg and the row dfe. If F > F*, it is in the critical region therefore it is significant and at least one mean is different (different for at least one group). The P value can be computed. A large F value correlates to a smaller P value, therefore if F > F* P < 0.05 (alpha).
248
What is the Mean Sq column in the ANOVA table?
The mean square error. Calculated for the group and error row as Sum of squares / degrees of freedom
249
What adjustments do we make to the t-test following an ANOVA?
Use common variance (MSE from the ANOVA table) instead of each group's variances in the calculation of the SE. Use common degrees of freedom (dfE from the ANOVA table). Use a modified significance level, this resolves the issue of increasing the type I error rate if we run too many tests (false positives).
250
What is the scenario of testing many pairs of groups with a t test called?
Multiple comparisons
251
What significance level adjustment is used for the post-ANOVA t-test?
The Bonferroni correction, which is a more stringent significance level. alpha* = alpha / K K - number of comparisons being considered K = k(k-1) / 2
252
How do you calculate the significance level of the Bonferroni correction?
alpha* = alpha / K K - number of comparisons being considered K = k(k-1) / 2
253
What is the formula for the standard error of the differences in two means after ANOVA?
[see flashcard]
254
What is linear regression?
Linear regression is a statistical technique that can be used for prediction and evaluating whether there is a linear relationship between two numerical variables x and y. Linear regression assumes that the relationship between two variables can be modelled by a straight line
255
Linear regression assumes the relationship between two variables can be modelled by what straight line?
y = B0 + B1x x - predictor variable (explanatory variable, independent variable) y - response variable (dependent variable) B0 - intercept (expected value of the response variable when the predictor is 0) B1 - slope parameter (the change in the mean response for each one-unit increase in the predictor)
256
What does it mean if the slope of the linear regression model line is 0?
The predictor x has no effect on the value of the response y
257
How are the parameters B0 and B1 of the linear regression model estimated?
Using data - these are point estimates b0 and b1
258
What does y hat represent in the linear regression model determined from point estimates?
y_hat indicates it is a collection of estimated (predicted) observations of observed variable y, based on the input collection of predictor observations x
258
How do we rewrite the linear regression model using the point estimate from the data?
y_hat = b0 + b1x
259
What are the differences between observed and estimated values termed in the linear regression model?
Residuals (epsilon) n is the same, the same number of points
260
What are residuals (epsilon) in the linear regression model?
The differences between the observed and estimated values.
261
What is the residual of the i-th observation (xi, yi)
The difference of the observed response (yi) and the response we would predict based on the model fit (y_hati) Ei = yi - y_hati
262
When the regression line represents a good approximation of our dataset, what happens to the residuals?
The residuals are pretty small. The best fitting regression line (line that has the smallest possible residuals). A poor fitting regression line has large residuals.
263
What is one of the most common approaches of finding the line with the smallest possible residuals?
Ordinary least squares regression (OLS)
264
What is the goal of OLS?
OLS - ordinary least squares regression (OLS) Goal is to find the line that minimises the least square criterion ie minimises the sum of the squared residuals [see flashcard] The line that minimises this least squares criterion is usually called the least squares line
265
What is the line that minimises the least squares criterion?
The least squares line
266
How do you find the least squares line?
[see flashcard]
267
What is the assumption of linearity for the least squares line?
The data should show a linear trend. If there is a nonlinear trend, an advanced regression method should be applied.
267
When fitting a least squares line, what conditions need to be met?
- Linearity - Nearly normal residuals - Constant variability
267
What can we do once we have a formula of the least squares line?
We can use input values of x to get predicted values y_bar With a fitted simple linear model, you’re able to calculate a point estimate y_hati of the mean response value yi
268
What is the assumption of nearly normal residuals for the least squares line?
Generally, the residuals must be nearly normal. When this condition is found to be unreasonable, it is usually because of outliers or concerns about influential points Residuals are normally distributed if they are scattered around 0 with uniform variance.
269
What is the assumption of constant variability for the least squares line?
The variability of the points around the least squares line remains roughly constant
270
What should we do when we have estimated the regression coefficient b0 and b1?
We want to determine how good our model is. One approach is using the coefficient of determination R^2. R^2 describes the proportion of the variation in the response that can be attributed to the predictor ie is explained by the least squares line. Formula [ see flashcard ] If we can calculate how much variance is due to the residual variable, we can calculate how much is due to the outcome variable
271
What is the coefficient of determination?
We want to determine how good our model is. One approach is using the coefficient of determination R^2. R^2 describes the proportion of the variation in the response that can be attributed to the predictor ie is explained by the least squares line.
272
What is one of the first steps of data analysis?
Descriptive analysis - this helps to understand how the data is distributed and provides important information for further steps.
273
How does R match the input values to the function arguments?
By position or by name
274
What is data science?
Turning raw data into understanding, insight and knowledge
275
What is a variable?
A quantity, quality or property that you can measure. (values may vary from measurement to measurement)
276
What are synonyms of "variable"?
- Table column - Field - Attribute - Property - Feature - Vector - Dimension
277
What are the two basic types of variable?
Numeric Categorical
278
What are numeric variables?
Variables whose values are recorded as numbers (integer or real values)
279
What are categorical variables?
Variables whose values are recorded as symbols. Eg - gender Eg - countries
280
What are the types of numeric variables?
Discrete - numeric values may only take on certain (distinct) numeric variables. Usually obtained by counting eg people in a class. Synonyms: integer, count. Continuous - numeric variables that may take any real value in some interval. Synonyms: float, double, interval, numeric
281
What are discrete variables?
Discrete - numeric values may only take on certain (distinct) numeric variables. Usually obtained by counting eg people in a class. Synonyms: integer, count.
282
What are continuous variables?
Continuous - numeric variables that may take any real value in some interval. Synonyms: float, double, interval, numeric
283
What are the two types of categorical variables?
Ordinal - categorical variables whose values can be naturally ranked (eg eduction levels, driving speed categories). Nominal - categorical variables whose values cannot be naturally ranked (eg eye colour, gender)
284
What are ordinal variables?
Ordinal - categorical variables whose values can be naturally ranked (eg eduction levels, driving speed categories).
285
What are nominal variables?
Nominal - categorical variables whose values cannot be naturally ranked (eg eye colour, gender)
286
What is a dataset?
How we store collections of variables
287
What are the different types of datasets?
Univariate dataset – dataset consisted of measurements that correspond to the single variable Multivariate dataset – dataset consisted of measurements that correspond to two or more variables. Most relevant when individual components aren't as useful when considered on their own. eg spatial coordinates. Allows us to think about two or more variables Corresponding data analysis Univariate data analysis – the analysis performed on a single variable Multivariate data analysis – the simultaneous analysis of two or more variables
288
What are observations?
Measurements made under similar conditions
289
What is a tabular dataset?
A set of values, each associated with a variable and an observation. Variables are table columns. Observations are table rows.
290
What is tidy tabular data?
Tabular data - a set of values, each associated with a variable and an observation. Tabular data is tidy if each value is placed in its own "cell" - each variable in its own column, each observation in its own row.
291
What is the size of a dataset?
Defined by the number of observations (rows) in the table
292
What is the dimensionsionality of a dataset?
Defined by the number of variables (columns) in the table
293
How do we describe a dataset?
Size - observations (row) Dimensionality - variables (columns)
294
What is the population?
The (usually) large pool of observational units that we are interested in.
295
What is a sample?
A smaller collection of observational units selected from the population.
296
What is sampling?
Sampling refers to the process of selecting observations from a population. Simple random sampling Stratified sampling Cluster sampling Multistage sampling
297
What are the four common sampling strategies covered in this module?
- Simple random sampling - Stratified sampling - Cluster sampling - Multistage sampling
298
Why do we sample?
It doesn't make sense to collect data for the whole population - it is probably impossible to collect and calculate the actual population mean so we need a sample.
299
Define a representative sample.
A sample is said to be a representative sample if the characteristics of the observational units selected are a good approximation of the characteristics form the original population. Meal analogy.
300
What is bias?
Bias corresponds to a favouring of one group in a population over another group
301
Define generalisability.
Generalisability refers to the largest group in which it makes sense to make inferences about from the sample collected. This is directly related to how the sample was selected.
302
What are parameters and statistics?
Parameters and statistics are calculations based on the population and sample respectively. - Population - parameter - Greek letters - Sample - statistic - Arabic - The differences are denoted in the notation used
303
What is a parameter?
A calculation based on one or more variables measured in the population. Denoted by greek letters.
304
What is a statistic?
A calculation based on one or more variables measured in the sample. Denoted by lower case arabic letters (sometimes in combination with other symbols)
305
Describe Simple Random Sampling.
A sampling strategy where the individuals are selected from the list of units in the population, by means of some random process, in such a way that each individual has equal chance to be selected. Eg random number tables or pseudo-random number generators. Selection can be performed sequentially (one at a time without replacement, so that at each stage, remaining individuals in the population have the same probability of being selected).
306
What is sequential selection?
In simple random sampling, selection can be performed sequentially. Individuals can be selected from the population one at a time without replacement, so that each stage, remaining individuals in the population have the same probability of being selected.
307
Why is selection with replacement less common in practice?
There is usually an assumption that all observations are independent of each other - replacing them would lose this.
308
What is stratified sampling?
Stratified sampling is a divide-and-conquer sampling strategy. The population is divided into groups called strata. The sample of individuals is then drawn from each stratum using some other random sampling process, usually simple random sampling. Strata are chosen so that units in each stratum are as alike as possible and units in different strata are as different is possible. This sampling strategy is used in cases when it is known that the population is heterogeneous with respect to one or more variables which may have a bearing on the factor being studied. Eg if there was a difference in height by gender, you know to take it into consideration. This ensures things are well represented.
309
Which sampling strategy is described as a divide-and-conquer strategy?
Stratified sampling
310
How are strata chosen for stratified sampling?
Strata are chosen so that units in each stratum are as alike as possible and units in different strata are as different is possible.
311
What are the purposes of stratification?
1 - to increase the accuracy and precision of the overall population estimates. 2 - to ensure that domains of study are adequately represented.
312
What is cluster sampling?
A sampling strategy where the population is divided into many groups, called clusters, and then we sample a fixed number of clusters and include all observations from each of those clusters in the sample. [Strata are separated based on convenience, not a measure of interest ie the measure of interest is not why you're in that cluster] Eg divide the class into tables and pick a sample of two tables.
313
What is multistage sampling?
A sampling strategy where the population is divided into many groups, called clusters, and then we collect a random sample within each cluster. Similar to cluster sampling (but rather than keeping all observations in each cluster, we collect a random sample within each selected cluster)
314
Why might cluster or multistage sample be preferred?
Sometimes it can be more economical than the alternative sampling techniques. They are most helpful when there is a lot of case-to-case variability within the cluster, but the clusters themselves don't look very different from one another eg neighbourhoods as clusters
315
What are negatives of cluster/multistage sampling?
More advanced analysis techniques are typically required.
316
What should you consider when selecting a sampling strategy?
The situation, time and money. Simple random sampling may be the best to get representation but it can be expensive. Multistage sampling can reduce the costs without reducing reliability.
317
What are the stages of data science?
Collect data, process it and clean it. EXDA and use of machine learning, algorithms and statistical models Communicate, visualisations and report findings. [Which leads to making decisions] Build data product. Data is a cyclical process - once you build the data product, more data becomes viable.
318
What is exploratory data analysis (EDA)?
A creative process of exploring data sets for patterns and relationships. Starting with lots of visualisations and summaries is a good idea.
319
What are the goals of EDA?
1 - Develop an understanding about data by formulating questions 2 - Search for answers using visualisation techniques and summary statistics 3 - use answers obtained to refine questions and/or generate new questions
320
What are 5 techniques used in EDA to search for answers?
Using visualisations and summary techniques - Visualise distributions of all variables (using box plots and histograms) - Visualise time series of data - Investigate all pairwise relationships between variables using scatterplots - Perform data cleaning and variable transformation - Perform summary statistics (mean, median, lower and upper quartiles, minimum and maximum values, identify missing data, errors and outliers)
321
What kind of questions do you need to ask at the beginning of the EDA process?
Start simple, it is difficult to ask revealing questions at the start of analysis as you do not know what insights are hidden in your dataset. There are no universal rules of questions to ask to guide research. Useful starting points - What type of variation occurs within my variables? - What is the relationship between variables
322
What are summary statistics?
Statistics used to quantitatively describe a collection of measurements by summarising them in the form of a single variable
323
Describe the summary statistics and visualisation techniques for numerical variables.
Summary statistics: - Measures of centrality (mean, mode, median) ie the most typical values - Measures of variability (variance, standard deviation, range, quantiles, five number summary) ie the spread of the data Visualisation techniques: - Histograms - Boxplots
323
How do you answer, what type of variation occurs within my variables?
Summary statistics and visualisation techniques Numeric: - Measures of centrality - Measures of variability - Histograms and box plots Categorical: - Counts - Percentage - Proportions - Bar charts
324
Describe the summary statistics and visualisation techniques for categorical variables
Summary statistics - Counts - Percentages - Proportions Visualisation techniques - Bar charts
325
How do you investigate the relationship between variables?
Summary statistics - Covariance and correlation (N-N) - Contingency tables (C-C) Visualisation techniques - Scatterplots (N-N) - Paired boxplots (N-C) - Paired histograms (N-C) - Mosaic plots (C-C)
326