Module 1: Study Design and Inference Flashcards

(54 cards)

1
Q

Statistics Definition (Summarised)

A

Science of learning data, and how to control and communicate uncertainty.

This involves study design, data collection, analysis, and interpretation; for the purpose of drawing conclusions and presenting uncertainty.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Scope of Inference

A

Whether the results from the sample can be generalised to the broader population.

“What group items do we want our conclusion to be valid for?”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Study Types

A

Survey, Experimental, Observations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Goal of Surveys Studies

A

Sampling a population for the purpose of describing a populations characteristics.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Survey Studies, the ‘Researchers Role’

A

The researcher has control over the collection method of the sample.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Goal of Experimental Studies

A

Establish a causal relationship, by assigning treatments to experimental units.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Experimental Studies, the ‘Researchers Role’

A

The experimenter has control over treatment assignment.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Goal of Observational Studies

A

Investigate relationships (associations) between variables, as they occur in nature.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Observational Studies, the ‘Researchers Role’

A

The researcher has control over what data to include, and how to investigate data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Ways to reduce Noise/”haphazard variation”:

A
  • clever design that yields preciser results for the same cost/sample size.
  • measuring covariates, to explain more variation.
  • increase sample size.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Ways to avoid bias:

A
  • appropriate study design, analysis, and interpretation.
  • Collecting a sample that’s representative of the population.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Population of Inference is defined by…

A
  • Scope of Inference
  • Sampling Frame
  • And all items within the PoI follow a Probabilistic Sampling Scheme.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Probabilistic Sampling Scheme Definition

A

each sample unit in a population has a definable probability of being included in that sample.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Design-based Inference vs. Model-based Inference

A

Design-based inference relies on randomly assigning some population to the sample. (the sample being representative of population).
While, Model-based Inference relies on the distributional assumptions. (normal vs non-normal).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Sampling Types

A
  • Simple Random Sampling
  • Stratified Random Sampling
  • Cluster Sampling (I vs II)
  • Systematic Sampling
  • Convenience / Ad-Hoc Sampling
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Simple-Random Sampling

A

All possible items (including subgroups) have the same chance of being selected.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Benefits and Cons:
Simple-Random Sampling

A

This sampling method is cheap and easy to implement.

If the population is too large this sampling method may not be truly representative (mis-match), potentially leading to inaccuracies.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Stratified-Random Sampling

A

In the sample, the size of subgroups are proportional to in population.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Benefits and Cons:
Stratified-Random Sampling

A

The sample will be proportionally representative of the population. And can over-sample smaller strata to get strata specific estimates.

There is a potential for misclassification. And it may not be possible to identify every subgroup.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Cluster Sampling

A

When the population may be comprised of similar and naturally occurring groups.
There are two types, single-stage and two-stage.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Benefits and Cons:
Cluster Sampling

A

This sampling method is cheaper and can be easier to study clusters than individuals, e.g. a school’s avg vs student avg.

But less precise and can have multiple levels of variation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Single-Stage Cluster Sampling

A

Every unit within a cluster is sampled, e.g. everyone in the household.

23
Q

Two-Stage Cluster Sampling

A

Random selection of a unit within a cluster, e.g. one member within the household.

24
Q

Systematic Sampling

A

Selection of each unit is consistently spaced (interspersion), e.g., select every 4th entry on a list.

25
Benefits and Cons: Systematic Sampling
This sampling methods provides even sampling coverage of the population. And can be an alternative when randomisation is impossible. There is a heavy reliance on assumptions that units are independent and random, relative to spacing. If these assumption fail, then design can be horribly wrong.
26
Simple-Random vs Stratified-Random vs Clustered Sampling
Stratified R- is a more precise estimate of the population than Simple R- for the same sample size. Cluster is typically less precise than either for the same sample size.
27
Convenience / Ad-Hoc Sampling
Selecting units close at hand, e.g. surveying students in front of Business building.
28
Benefits and Cons: Convenience / Ad-Hoc Sampling
This sampling method is cheap, efficient, and easy to implement. But, involves no probabilistic sampling scheme. And can be heavily bias.
29
Why is having a 'probabilistic sampling scheme' important?
Without it, you can't use design-based justification for extrapolation from sample to population. Because extrapolation is based on untestable assumptions.
30
Why is Study Design important?
Using the correct study design allows... - Extrapolation justified by study design - More robust results - Identifying issues that may arise during sampling and analysis, e.g. pseudo-replication. - reduce mismatch between sample and population.
31
What does Mismatch mean in study design?
when the sample is not representative of the population, therefore generalisations may be inaccurate.
32
Why can Mismatch in study design occur?
When the population of interest is not clearly defined, no probabilistic sampling scheme is applied, or voluntary/non-response bias occurs.
33
Selection Bias, and types.
Occurs when selected units systematically differ from the population of interest. This can be exist as sampling bias, non-response bias, or voluntary bias.
34
Sampling Bias
When there is a mismatch that is not justified or accounted for in analysis and interpretation.
35
Non-Response Bias
When a particular subset of the population is less likely to respond.
36
Voluntary Bias
When certain participants self-select their involvement.
37
Response/ Information Bias, and examples.
When the measurement of units is systematically bias or incorrect, e.g. poorly calibrated instruments, leading questions, when true values are not observed (other option in surveys).
38
Two ways bias can be reduced through analysis.
Post-hoc stratification of respondents (categorise) or explaining variation with covariates with linear regression. These changes leads to more model-based inference.
39
What MUST experimental studies have?
Randomisation and Replication
40
How can the impact of 'unwanted variation' be reduce?
- Known variation can be controlled by matching or grouping (stratification, paired studies, etc). - Unknown variation ('random error') can be reduced by increasing replication. - Increasing 'signal to noise ratio' through study design.
41
Why is replication important?
Replication establishes valid experimental results by ensuring, reproducibility, robustness against aberrant results, experimental error (uncertainty). Also makes results more precise because, as replication increases uncertainty decreases.
42
Sampling Distribution
The distribution of a statistic if we could repetitively sample from the population. This is a theoretical property of statistics to make inference.
43
Statistical Inference
The application of the methods of probabilities to the analysis and interpretation of data. In inference we often infer properties of an unknown probability distribution using collected data.
44
What are we saying when we apply a model to our data?
The data will have certain characteristics and properties (dependant on the model used). Therefore, we can prove best fit by running diagnostic tests.
45
What are statistical models? And what are we interested in?
An approximation of reality that describes the data generating process. When we use them we are interested in the model parameters, as they often show underlying variance.
46
Simple Linear Regression, equations and explanation.
y = β0 + β1*x + ε ε ~ Norm(0,σ2) The first line explains the μ as a line. The second line line explains variability around the line.
47
Explain the effect different variances on a distribution when mean is unchanged.
Higher variance will make the curve shorter and more spread. Lower variance will make the curve taller and narrower.
48
What is the purpose of a Hypothesis Test?
Testing whether our observed data occurrence is consistent with our given hypothesis.
49
What parameters are used when standard deviation is unknown?
s (sample standard deviation) and t distribution instead of a normal distribution.
50
Types of Estimators
Internal estimate - 95% confidence interval. Point estimate - 0% confidence interval.
51
Name two methods of getting estimates
Ordinary Least Squares and Maximum Likelihood
52
How does Ordinary Least Squares (OLS) work?
Find estimates that minimise the total square difference between observed and expected values.
53
How does Maximum Likelihood (ML) work?
Finds estimates that maximise the likelihood of observing (in the data) the weight we measured in our model. (most commonly used)
54
Using a Simple T-Test instead of Paired T-Test: what issues occurred from the native analysis? What would we observe if we used the correct analysis? (lecture 6 example)
Because the simple t-test ignores the sampling scheme, treating all the cockles as independent, there are two flaws in our interpretation. Flaw 1: incorrect analysis (analysing cockles instead of quadrates) causing pseudo-replication. Flaw 2: ignoring controllable source of uncertainty (quadrate to quadrate variation). With the correct analysis we would see the CI narrow, because more uncertainty is controlled by accounting for quadrate to quadrate variation.