STAT Notes Flashcards

(184 cards)

1
Q

Define descriptive statistics

A

Methods used to summarize or describe our observations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Describe inferential statistics

A

Using observations as a basis for making estimates or predictions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What two methods can be used to ensure random sampling is truly random?

A

Mechanical
Blind

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Define mechanical sampling

A

Assigning every individual in the population a number and randomly generating numbers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Define stratified random sampling

A

Selects characteristics of the sample based on proportion of said characteristics in the population

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Define dispersion of data

A

How far it lies from a given average

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How is sample variance calculated?

A

Σ(difference between each value (xi) and the mean (x̄))^2 ÷ 1(n-1) where n is the number of observations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How is standard deviation calculated?

A

√var

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How is standard error calculated?

A

sx ÷ √n where n is the number of observations and sx is standard deviation of a sample

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Define confidence interval

A

Specific certainty of a predicted population mean with normal distribution

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What proportion of the population stands within one standard error?

A

68%

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What proportion of the population stands within two standard errors?

A

95%

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What proportion of the population stands within three standard errors?

A

99.7%

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What function shows perfect normal distribution?

A

Gaussian function

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Define nominal data

A

Classifies by names

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Define ordinal data

A

Classified in an order (by categories)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What are the two types of variables?

A

Categorical
Numeric

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What are the two sub categories of categorical data?

A

Ordinal
Nominal

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

How is categorical data referred to in R?

A

factor()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What are the two sub categories of numeric data?

A

Discrete
Continuous

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

How is discrete data referred to in R?

A

integer

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

How is continuous data referred to in R?

A

numeric

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Define skewed distribution

A

A measure of asymmetry

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Define bimodal distribution

A

There are two modes (can be symmetrical or asymmetrical)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Define a bin
An area in which data is collected
26
Define central tendency
Central values
27
Define probability
Proportion of times a particular outcome will occur from a large sample of trials or the likelihood of a particular outcome of an event
28
What does a P=0 (probability=0) suggest?
Impossible
29
What does a P=1 (probability=1) suggest?
Certainty
30
What does it mean if trials are independent?
The actions of one have no impact on the results of the next trial
31
What is the probability of a OR b where they are both mutually exclusive?
P(a)+P(b)
32
What is the sum of the probabilities of mutually exclusive outcomes?
1
33
When we use OR when describing mutually exclusive probabilities how do we combine these values?
Add
34
What is a probability distribution?
Graphical distribution of theoretical relative probabilities y=probability, x=potential outcomes
35
What is true about the area of any sections of a probability distribution graph?
Equivalent to the relative probability
36
How can we draw probabilities with multiple trials but limited outcomes?
Table Probability tree
37
How do we combine mutually exclusive events using AND?
Multiply probabilities together
38
Define probability distribution
Theoretical probability of each outcome
39
Define frequency distribution
Observed frequency of each outcome
40
After more trials what becomes true about the frequency distribution and probability distribution?
Frequency distribution approaches probability distribution
41
When can we use binomial statistics
Can be used when there are two groups (such as A and B or pass and fail) NOTE: we can create these groups if we define some outcomes as "success" and the others as "failure" and classify other outcomes beneath these banners
42
Give examples for which type of questions binomial distribution may be used for
Predict the probability of success in a single trial Predict the proportion of successes in n trials
43
What are requirements for binomial statistics?
- 2 outcomes (P(success)=p and P(failure)=q) and p+q=1 - Each trial is independent with equal p - Fixed no. trials
44
As number of trials increases what becomes true of discrete data?
Begins to resemble continuous data
45
How can we approximate binomial distribution?
Probability distribution
46
How can we find probability up to any point (normal distribution)
Area under the graph up until that point
47
Rules for hypothesis testing
- Understand the certainty of a hypothesis test - Don't base scientific decisions on hypothesis tests alone - Consider the wider picture and plausibility of results
48
Which letter denotes significance level?
Alpha
49
Which two hypothesis are needed for a hypothesis test?
H0: null hypothesis (no change) HA: alternative hypotheses (covers all other probability) These hypotheses must be mutually exclusive
50
What do we assume about H0 in a hypothesis test?
H0=true
51
What is referred to as the critical region?
Areas above the critical value (above the alpha)
52
When is the null hypothesis rejected in hypothesis testing?
p
53
What is a tail?
Area at the end of the distribution
54
How do we test both tails?
Two-tailed test
55
How many critical regions are present in a two tailed test?
2
56
If alpha=0.05 and a two-tailed test is performed, what % of values lie outside the critical region?
95%
57
What is a p value?
The p value assumes the null hypothesis is true and gives the probability of getting a result that extreme or more assuming this
58
What is a contingency table?
One that shows all possible HA and H0 outcomes
59
If H0 is true and we reject it, what is true?
False Positive Type I error We do not know what is true
60
If H0 is true and we fail to reject H0 what is true?
There is a true negative H0 is true
61
If HA is true and we fail to reject H0 what is true?
False negative Type II error HA was true
62
If HA is true and we reject H0 what is true?
True positive H0 is untrue, this does not confirm HA
63
If H0 is true, what are the possible outcomes/errors?
True negative (H0 is true and we fail to reject H0) Type I error (H0 is true and we reject H0)
64
At an alpha value of 0.05, how often would we expect a Type I error, if H0 is true?
5% Type I error (95% true negative)
65
What does statistical power tell us about a test?
How powerful a test is at detecting true positives when there really is a difference to detect
66
When the HA is true, when do we fail to reject the null hypothesis? What type of error is this?
When we are outside the critical value (in the direction of the H0) This is type II error and is shown where the HA graphs overlaps with H0
67
How is beta defined graphically?
The area of overlap between the H0 and HA graphs (where HA is true)
68
How is power of a test calculated?
Power=1-beta
69
If the power of a test is 0.979, how often do you get type II error?
2.1% of the time
70
If HA is closer to H0, the power is greater or smaller?
Smaller It is more difficult to identify a true error
71
If there is a high power, what is true about the error likely?
There will be a lower rate of false negatives (type II error)
72
How can the power of a test be increased graphically?
Increase effect size: Separate the curves to be skinnier Increase distance between peaks
73
What is true of power if effect size is increased?
Power increases (less type II error)
74
How is effect size increased?
Increased trials (decreases curve dispersion)
75
What must be present for hypothesis testing?
There must be two hypotheses: H0 - null hypothesis (no change/ effect) HA - alternative hypothesis (mutually exclusive and covers all other options (different for one and two-tailed tests))
76
Why is the p value not the probability of a false positive?
It is only the probability of a false positive if the alternative hypothesis is true, we can not know if the alternative hypothesis is true we can only speculate based on evidence
77
Define power
Proportion of true positives for a particular HA
78
Define Multiple Testing
Comparing and testing several conditions or treatments
79
When is a two-sample t test performed?
When comparing two samples with each other (i.e.: control and drug)
80
When is a one-sample t-test performed?
When comparing a sample to a mean
81
When may a paired t test be performed?
When samples are closely replated to one another (such as before and after a treatment)
82
What are the assumptions of a t test?
Outcome variable is continuous dependent variable and experimental variable is bivariate independent variable Normal distribution Equal Variance
83
What is a bivariate variable?
Contains two groups
84
What is a Q-Q plot?
A normal quantile-quantile plot compares quantiles of your data to theoretical quantiles for a normal distribution (if these match closely the data is normally distributed)
85
What is the danger of performing many tests?
There is an increase in the probability of false positives (FWER (family-wise error rate))
86
Define FWER
Family wise error rate is the probability of getting a false positive if the null hypothesis is true
87
What calculation gives the probability of not getting a false positive in n tests?
(1-alpha)^n in n tests
88
What calculation gives the probability of at least one false positive in n tests?
1-(1-alpha)^n
89
Define F-test
Compares several samples with each other and compares variance within samples with that between samples
90
What are other names for an F-test?
Analysis of variance (ANOVA)
91
Why is an ANOVA done?
Compare means with one another to find statistical difference
92
Define overall mean
Mean of sample means (Add all means and divide by number of groups)
93
What are the two types of studies?
Observational Experimental
94
Define observational study
Makes observations without intervention
95
Define interventional study
A study where an intervention is made to test a hypothesis
96
Define statistical or scientific variable
Any relevant condition, characteristic, number or quantity that can be measured, assessed or counted
97
What is another name for independent variable?
Explanatory variable
98
What is another name for dependent variable?
Response variable
99
Define a confounding variable
One that could impact the measurement from your dependent variable in addition to your independent variable
100
Define error
The difference between the result for a whole population and the result from our sample or experiment.
101
What are the two main types of error we can control for?
Sampling error Bias
102
Define sampling error
The possibility that the sample is not a perfect representation of the population
103
What distribution is shown by sampling error?
Normal (allowing for statistical testing)
104
What are the main techniques for controlling error?
Replication Balance Blocking
105
Why is replication a method of decreasing sampling error?
The more data we collect he more insignificant errors become
106
What are the two types of replicates?
Technical Biological
107
Define technical replicates
These are additional measurements or analyses taken from the same sample. They help account for variability introduced by the measurement process itself.
108
Define biological replicates
These involve separate samples that are independently manipulated or tested under identical conditions
109
Define Blocking
Grouping experimental units with similar properties
110
Define Balance
This is the process of comparing groups of similar sizes
111
Define bias
Error caused by a systematic difference in the estimation of the sample and the whole population
112
In what stages of an investigation may bias occur?
Any (Design, data collection, analysis, publication etc...)
113
How can bias be controlled for?
Simultaneous control groups Blinding Randomisation
114
Define a simultaneous control group
A group of subjects not exposed to the experimental treatment but are treater the same in all other ways
115
What are the two types of control treatments?
Untreated control Vehicle control
116
What is an untreated control?
Subject in it's native state with no treatment
117
What is a vehicle control?
Subject undergoes treatment with everything but the exact thing being tested (e.g.: the drug)
118
What is a best-available therapy control?
Testing against a pre-existing drug as opposed to a vehicle control
119
Define a positive control
A control which defines what a positive result looks like
120
Define a negative control
Result which defines what a negative result looks like
121
Describe blinding
The process of obscuring whom has which treatment to limit the placebo effect
122
Define randomisation
Assigning random places to random individuals such to not introduce further sampling bias
123
What methods are used to investigate the relationships between 2 continuous variables?
Correlation Regression
124
What may correlation tell us about a relationship?
It's strength and direction
125
What is denoted by "r"?
Correlation coefficient
126
What is the range of "r"?
-1 to +1
127
What would be the "r" value of a perfect positive linear correlation?
+1
128
What would be the "r" value of a perfect negative linear correlation?
-1
129
What does an "r" value between +/- 0-0.2 suggest?
Very weak correlation or negligible between the two variables
130
What does an "r" value between +/- 0.2-0.4 suggest?
Weak or low correlation between the two variables
131
What does an "r" value between +/- 0.4-0.7 suggest?
Moderate correlation between the two variables
132
What does an "r" value between +/- 0.7-0.9 suggest?
Strong, high and marked correlation between the two variables
133
What does an "r" value between +/- 1.0-0.9 suggest?
Very strong and very high correlation between the two variables
134
What does the r^2 value tell us?
How much of the variation in one variable can be explained by the other
135
In which types of experiments do we compare continuous variables?
1. Looking for an association between variables where neither is experimentally manipulated 2. Experimentally manipulating one variable and looking to see whether the other variable changes too
136
What can we use to predict the value of a variable when we know it's correlation to another?
Regression
137
What makes a regression prediction more confident?
A higher correlation coefficient
138
What is true of values around the line of best fit when there is a strong correlation?
There is little variability about the line of best fit
139
When can a y=mx+c regression line be drawn?
When there is a linear correlation
140
What is goodness of fit
Assessment of how well a linear regression line fits data
141
How can we judge how well a regression equation fits data?
Using the r^2 value Looking at the residuals
142
How is a linear regression drawn?
As a straight line through the data points
143
Define fitted value
The point (y) a dataset at a given is expected to be seen on a regression line
144
Define the residual
The distance between a given point and it's fitted value
145
How can we use the residual to check the goodness of fit of a linear regression?
Plot a residual plot - residual against fitted value - and observe if there are any patterns
146
What does a pattern on a residual plot suggest?
A linear equation may not be appropriate for the data presented
147
What does a residual plot look like where the liner relationship was the best possible fit?
Plots are evenly scattered about the line on either side with even distribution
148
If a is explained by b, with a known value of b, can we predict a?
Yes using the linear regression
149
If a is explained by b, with a known value of a, can we predict b?
No, we need to create a regression in the other direction to describe b in terms of a
150
Define questionable research practices (QRPs)
Refers to a number of activities, often related to the misinterpretation of statistics, that occur in published scientific work
151
Define cherry picking as a QPR
The practice of cherry picking refers broadly to only presenting one side of the story. Specifically in relation to statistics, this translates as choosing not to report parts of your analysis which do not agree with the story you are trying to tell. This is often used to "tidy up" or create a "convincing" story
152
Define P-hacking as a QPR
Ultimately manipulating your data or analysis to result in a significant p value
153
Give examples of P-hacking
- check the statistical significance before deciding whether to collect more data - stopping data collection as soon as results reflect those desired - excluding data after checking impact on significance - adjust models on the basis of whether or not a significant result is obtained without proper justification - rounding a p-value to the threshold - hidden multiple testing and therefore no p value adjustments
154
Define HARKing as a QPR
Hypothesis after results are known is presenting results that have been discovered as if they were expected or as if they were the main study aim (overstating prior knowledge of the study). Presenting ad hoc or unexpected results in this way is misleading
155
Define ad hoc
An unplanned or supplementary analyses conducted to explore specific aspects of data that weren't the primary focus of the study. This is done on an as-needed basis to investigate particular comparisons or relationships not initially accounted for in the main analysis.
156
Are QPRs evidence of academic misconduct?
No, they are questionable but not misconduct
157
What are the two main forms of research misconduct?
Fabrication and falsification
158
Define fabrication
Making up data or results
159
Define falsification
The manipulation of research materials, data or results
160
161
162
What are the assumptions of an ANOVA?
Data needs to be normally distributed Data should be from independent observations, which means that there is no relationship between the observations in each group or between the groups themselves. Equal variances between groups (Homogeneity of variances, Homoscedasticity)
163
Define homoscedastic
The fundamental assumption that the variance of the errors (or residuals) should be constant across all levels of the independent variable(s) (Violated homoscedasticity is known as heteroscedasticity)
164
Define homogeneity
Refers to the similarity or uniformity of certain characteristics within a group or between groups.
165
When doing an ANOVA how do you find the degrees of freedom (DF) between groups?
K-1 Where K is the number of groups being compared
166
When doing an ANOVA how do you find the degrees of freedom (DF) within groups?
N-K Where K is the number of groups being compared and N is the total number of observations/data points collected.
167
Define the sum of squares
Quantifies variability between the groups of interest and within groups of interest in separate rows
168
What is the overall sum of sqaures?
The square of the difference between each datapoint and the overall mean, also called SST, for sum of squares (total).
169
Define SSW
The sum of squares within the groups is defined as the square of the difference between each datapoint and the mean of the group it belongs to. This shows the variation among each single groups.
170
Define SSB
The sum of squares within the groups is defined as the square of the difference between each mean of the groups and the overall mean for each datapoint. This shows the variation among between the groups.
171
What is the maximum value of a datapoint before it is considered an outlier
Q3+1.5 IQR
172
What is the minimum value of a datapoint before it is considered an outlier
Q1-1.5 IQR
173
What is the nature of a binomial distribution?
The binomial distribution is discrete, dealing with the number of successes in a fixed number of trials.
174
What is the nature of a normal distribution?
The normal distribution is continuous and is often associated with the distribution of measurements in a population.
175
Which parameters are common in binomial distribution?
The binomial distribution is characterized by the number of trials (n) and the probability of success (p).
176
Which parameters are common in normal distribution?
The normal distribution is characterized by the mean (μ) and standard deviation (σ).
177
What must be born in mind when calculating the critical value for a 2 tailed test?
Use 1-(alpha/2) at each end
178
What is the difference in the analysis of variance between a boxplot and an ANOVA?
A boxplot is a qualitative analysis whilst an ANOVA is quantitative
179
What is the Mean sq. and how is it calculated (ANOVA)?
ANOVA output This is a variance estimate and what is used to calculate the F-statistic, the next column. Calculated by taking the Sum of Squares divided by DF on the same row
180
What is an F-statistic (ANOVA) and how is it calculated?
This is defined as the ratio between the Mean Squares between and within. Calculated by Mean squares of row 1/mean squares of row 2.
181
How do we use an F-statistic (ANOVA)?
If it is below a threshold value, the NULL hypothesis can be rejected
182
What does a high F-statistic mean (ANOVA)?
More likely to be a statistically relevant difference between groups.
183
How do we report an ANOVA?
F(dfbetween, dfwithin) = F Statistic, p =
184
What test is used to determine where the differences between two groups lies?
post-hoc tests such as the Tukey Honest Significance test