Critical numbers - statistics Flashcards
What is the target population and sample population?
We can’t collect info from everyone so we take a sub set from the whole population this is known as the sample population.
What is sampling bias?
What is recall bias?
Social desirability bias?
Information bias?
Sampling bias = individuals in the study are more/less likely to be included than others
Recall bias = individual can not remember specifics of a question
Social-desirability bias = individuals tell us incorrect information because they feel a societal pressure
Information bias = measurement bias
What is a background/confounding factor?
Something that is responsible for the outcome and related to the exposure.
Screen use and poor vision…. Cofounder = lack of natural light.
Types of study design:
Experimental vs observational
Retrospective vs prospective
Individual vs population
Experimental = researcher changes something/ has intervened
Observational = researcher just collected data
Retrospective = look back to see if exposure caused outcome
Prospective = collect information to see if current exposure leads to outcome
Individual = info collected on an individual - usual study design
Population/ecological = whole populations looked at
Types of study:
Case control
Look at individuals with outcome and matched individuals without and look to see who had exposure and the outcome.
Good for investigating rare disease
Cross sectional study:
Look at what is happening now (snapshot of time)
Who currently has exposure and the outcome
Difficult to establish order of events
Cohort study:
Collect information on a sample, some have exposure some do not, no one has outcome yet. Then follow up and see if those with exposure leads to more outcomes.
Time consuming, expensive
Randomised control trial:
Have multiple groups(also known as arms)
Give a different exposure to each group
Compare the outcomes between the groups
Steps to avoid bias: Blinding - single and double Randomisation - flipping a coin Placebos Matching - identical with only difference is the exposure
Gold standard, but expensive and not always suitable exposures
Crossover trial:
Extension of a RCT where everyone in the study has all different exposures. Therefore you can compare their effects to themselves.
Randomised which treatment/exposure they receive first
Not always suitable as may be carry over effects
What is a variable?
A quantitive measure of something that varies.
What is a categoric variable and what are the subtypes?
Categoric variables fit into a particular category.
Binary = 2 categories - yes or no
Ordinal > 2 with a natural ordering e.g. low medium and high
Nominal > 2 with no ordering e.g hair colour, ethnicity
What is a numeric variable and what are the subtypes?
A variable that is a measured on a scale.
Can be discrete = where this a distinct number of values e.g age in years
Continuous = can take nay value within its limits e.g. weight
What is descriptive statistics?
Collection of statistical measures used to describe the data sample we have.
Definitions of:
Proportion
Probability
Odds
Rate
Portion = total number with outcome/total number
Probability = proportion x 100
Odds = number with outcome/number without
Rate = number of times something happens per a quantified e.g x per 100 people
What is the risk difference?
Risk ratio?
Odds ratio?
Risk difference = subtraction of one proportion from the other
“Risk with X …% higher than with Y”
Risk ratio = Group A/Group B proportion or percentage
The focus on the top compared to on bottom
If greater than 1 risk in group A larger than B if 1 then its the same and if less than 1 its smaller.
1.85 shows a increase risk of 85%
0.80 shows a decreased risk of 20%
Odds ratio = Group A odds/ Group B odds
Odds increased or decreased by X
Remember a score of 2 is only 100% increase in odds
Odds ratio and risk ratio can cause what?
Can cause unnecessary panic, 200% increase may sound larger but actual risk could still be very very small.
What does standard deviation show?
Shows the spread of dat about the mean
Sigma is the symbol for
Standard deviation
Variance =
SD squared
Mean =
Sum of numbers/total number
Median =
middle number of data set
If invested 2 numbers take the average
In a perfect symmetric distribution the mean and median are…
Equal
When the distribution is not symmetric you are said to have a…
Skewed distribution
It can be right or left skewed depending on the position of the outlier.
The outlier will skew it in that direction
What are the three measures of spread we have learnt about?
Range
SD
Interquartile range
How to work out range?
Is it useful?
Largest minus smallest value
Not good for data with outlier
IQR
What is it and is it good?
Represents the middle 50% of the data.
To calculate, order clues, then calculate the 25th percentile and 75th percentile and leave like that and or minus the 75th from the 25th.
Associated with the median and is better for data with outliers .
Standard deviation
What is it and is it good?
Spread of data about the mean
Again it can be skewed by outliers as it takes into account the mean
However it is more powerful as it uses all the data. Therefore it should be used in statistics unless the data is skewed. If the data is skewed then IQR should be used.
In a symmetric distribution what should be used to summarise data?
In a non symmetric distribution what should be used to summarise the data?
Symmetric = use mean and SD
Non symmetric use IQR and median as they are not affected by outliers.
What is a normal distribution?
Certain numerical values e.g weight wen plotted will follow a normal distribution.
This is because most people have values that are in the middle around the mean with only a few extremes either side.
The shape of the curve is bell showed and hence it is sometimes referred to as a bell curve.
The mean is in the middle and the larger the SD the flatter and wider the curve will be.
Using a normal distribution mean and SD what can we work out?
We know that 1 SD either side of the mean = 68% of data
1.96= 96% of values
3 = potential outlier if further than this point
To work out the reference range what do we do and what does it show?
The reference range is 95% of the population.
We do mean - 1.96 x SD
And mean + 1.96 x SD
This shows in our sample that 95% of observed values fall between … and …
When not to use the reference range?
If the graph is not normally distributed and there are outliers present, you may get a range that is not possible or is factually incorrect.
There will not be 2.5% of values on either side.
To quantify the difference of numerical data you can?
Look at the differences in means.
If not possible you can use the difference in medians.
What is Pearsons correlation coefficient?
What does 1, 0 and -1 show?
A statistic that measures linear correlation between two variables X and Y. It has a value between +1 and −1.
1 shows a perfect positive linear association
0 shows no correlation
- 1 shows a perfect negative linear association
1 = as x increases y increases
-1 as x increases y decreases
Closer to 1 or -1 will show a stronger correlation
What should a graph have?
X6 points for a good graph
- The amount of information should be maximised for the minimum amount of ink
- Figures should have a title explaining what is being displayed
- Axes should be clearly labelled
- Gridlines should be kept to a minimum
- Avoid 3-D charts as these can be difficult to read
- The number of observations should be included
What types of graph should categorical data be displayed as?
Bar chart - frequency vs categories - better
Pie chart - always 2d
What types of graph should numerical data be displayed as?
And give an overview of each type…
Dot plots - dots on a continuous scale
Histograms - frequency density vs continuous non overlapping categories
Can see distribution from this graph.
Box plots - min, LQ, median, UQ, Max also any outliers 1.5 box length away
Scatter plot - used to see association between 2 variables - correlation
Dependent on y independent on x
What is the symbol for the sample mean and what is the symbol for the population mean?
…
Is the sample mean the best guess of our target mean?
Yes whenever we do a study our sample mean is our best guess of our target population.
What is standard error=
An estimate of precision of our mean (in this case)
= SD/square root of n for mean
Makes sense as the smaller our SD the more similar the sample population is and the large the sample size the better the representation hence both lead to a smaller SE
What is the best way to reduce SE?
Increase n
How do you use SE?
2 X WAYS?
Can compare SE of two groups the smaller the SE the more precise the mean is of that group.
Also used in confidence intervals!
What is a confidence interval?
How is it calculated?
A 95% confidence interval is the range of values we are 95% confident our mean lies between.
It is calculated by mean + or - 1.96*SE
Sum it up by saying my estimate is my best guess of the mean and i am 95% confident that the true mean is between the two limits.
Would a 99% confidence interval have a narrower or wider CI?
Wider as more confident it contains the true mean
When can CI not be used?
If the data is not normally disturbed we cant use SE and therefore we can use CI
The sample size also has to be greater than 20
What can CI be used for?
Means
Difference between 2 means (Ho = no difference)
Relative risk
Etc
If 1 doesn’t show statistically significant results but another does then the difference may still be statistically significant
What does a reference range show?
It shows where 95% of the data lies between and is calculated by mean + or - 1.96*SD
What is the point of a significance test?
To see if any observed difference is important/ significant.
What is the null hypothesis?
There is no difference between x and y…
E.g there is no difference in IQ between students at UoS or SHU
Start of by believing this hypothesis.
What can probability range between?
What is the percentage probability of 0.01?
Can range between 0-1
= 1% probability of occurring
On a normal distribution the further away from the mean the smaller the area under the curve and smaller the p value…
…
A value observed ear the null hypothesis will have a p value of…
Far away from the null hypothesis will have a p value of…
Near 1 (the null hypothesis appears to be true - evidence to accept, any deviation due to chance)
Close to 0 (there is evidence to reject the null hypothesis - statistically significant difference)
The lower the P value the result are more…
Statistically significant
Is SE related to P
Yes the smaller the SE, the smaller the P
What does a p value of 0.001 mean?
We would have seen this much of a difference by chance 1/1000 times if the null was true
If the null hypothesis value (no difference value) is not in the 95% CI then…
If the null value is in the 95% CI then your p value will most likely be
Then your p value will be less than 0.05
Greater than 0.05
A mean against a hypothesised value - this test i called?
One -sample t test
Are all statistical significant findings clinically relevant?
No -
What is a p value?
The probability of seeing the difference you have if the null hypothesis was true
What does a correlation of 1 show?
It shows a perfect linear association between two continuous variables (1 increases so does the other)
It is a measure of strength
However it does not take into account gradient of the liner.
A flat line could have the same r number as steep line.
What is regression?
What is the meaning of the line?
It is an advanced correlation which can be used to make future predictions
It take the general formula y = mx + c
Y = outcome - dependent variable X = predictor - independent variable
C or a = y intercept, when x = 0
M or b = gradient/ coefficient
How s a regression line calculated?
Use our software to fit a regression line to the data.
What are the most interesting value of the regression line?
What can you do with it?
M also known as b (the gradient)
Every 1 you go across you go up or down by m
You can do inferential statistics
SE, P value, confidence intervals
And hence see if this value is significant and hence if the relationship between the 2 variables is significant.
Ho = would be 0
What is linear regression?
A regression model where the outcome is a single continuous variable.
What is a multiple regression?
Why it it beneficial?
It is a regression line but with lots of continuous variables.
One Y, but lots of Xs (predictors)
As you are using lots of independent variables, it accounts for confounding factors!!! - see clearly the relationship between the main x and y.
The p value and CI will be adjusted for the new x values if they have a large affect the sign the significance of your p value will drop (randomise noise).
An author may say… when using. Regression model with multiple predictors?
After accounting for other variables in the model…
What are categoric variables and are they used in regression?
They are categoric data and can be used in regression models. The important thing to remoter is that they always have a reference range.
Each coefficient is in comparison to the reference value an increase will mean a positive association and a negative a negative association. Again they can have p values.
As we include more factors that have a relationship with y what happens to our x p value?
It will decrease.
What does prevalence mean?
Proportion of a population with a disease at a point in time
= number of cases at a point of time/ total population
What does incidence mean?
And equation?
Rate at which new cases occur in a population in a certain time period.
= number of new cases/ population at risk
What is an ecological study?
An ecological study is an observational study defined by the level at which data are analysed, namely at the population or group level, rather than individual level.
Looking at rates of smoking in a country and then rates of lung cancer
What are the advantages of an ecological study?
Uses routinely collected data - Quick, cheap
• Units of analysis are populations - groups of people
• Can examine patterns of ill-health by age, sex, ethnicity, country
and/or by time
• Few ethical issues
• Useful for generating hypotheses
What are the disadvantages of using an ecological study?
No link between individual exposure and effect • Bias - variation in diagnostic criteria
• Absence of records of individual attributes
• Unsuitable format of records
• Inconsistency in data presentation
Advantages of using a cross-sectional study?
Results used to generate hypotheses
• Rapid feedback of current events in the community • Quick and cheap
• Few ethical problems
Disadvantages of using a cross-sectional study?
Could just be reporting a medical oddity
• Prone to bias, e.g. sampling, subject and observer variation • No time reference
Advantages of a case control study?
- By concentrating effort on the identification of affected individuals and recruiting controls from the unaffected population, the number of subjects required to obtain significant results is kept to a minimum (so good for rare diseases)
- Results can be obtained relatively quickly because the investigation does not have to wait for the disease to develop (compare this with Cohort studies – see later) and can look for multiple causes
- It is a relatively inexpensive type of study
Disadvantages of a case control study?
Generally rely on retrospective data, which has its own dangers. The ability of individuals to recall past events tends to be unreliable due to a tendency for memory to be selective. Records of past events may be incomplete.
• Because data are collected retrospectively, it is difficult to say if an association is causal or not. This is less of a problem when the exposure is highly specific or where the time between exposure and disease is short
• Prone to selection and information biases
• There can be difficulties choosing controls
• The incidence of disease within a population cannot be calculated from this
type of study
Advantages of a cohort study?
- The main advantage is that it is possible to distinguish antecedent causes from concurrent associated factors (cause comes before effect)
- Since incidence can be determined for both exposed and non- exposed groups, we can determine absolute, relative and attributable risks
- We can study more than one outcome to the same exposure
- There is less chance of bias since exposure is measured before development of disease
What are the disadvantages of a cohort study?
• Cannot be certain that exposures are causal- this requires controlled studies
• Long periods of study, and large populations mean that cohort studies are
expensive
• Follow-up can be a problem- especially if the period of study is long- this
needs to be considered in the design of the study
• Diagnosis of cases may change over the years as medical science becomes
more advanced- better at detecting the disease or with different criteria for a diagnosis
What are the advantages of a RCT?
Randomization should mean that confounding factors (age, sex etc.) are equally distributed. This helps to concentrate the study on the effect of the intervention
• By randomly allocating patients to interventions, it is likely that staff and patients will not break the blinding
• Statistical tests for significance are easier to interpret when the study design removes confounders
• Confounders and many biases minimised
What are the disadvantages of a RCT?
- To allow sufficient numbers to balance confounders these tend to be large and expensive trials. They are often multicentre and may even be multinational
- There is always a chance that volunteer bias will be a problem: what about people that refuse to be included in the trial or those that are never asked.
- There may be ethical difficulties in withholding treatment from the control group or offering what is believed to be an inferior treatment to one group
- May lose statistical power if poor compliance
There are specific questions you can ask to help critically appraise an article - see lecture for more details.
Also Axis and CASP - will be focussed upon
..
What is AXIS?
20 questions that assesses difference key aspects of an article.
How should you display the data collection methods?
Via a flow diagram:
Number approached
Who left and why
Who was analysed
What is a parametric test what is a non-paremtric test and when are they used?
Parametric test = test that follow particular assumptions and if these are not met then a non parametric alternative should be used.
However, parametric tests use all the data, non parametric tests only use the ranks and are therefore less powerful.
Why is it important to have critical appraisal skills as a doctor?
Patients may read an article - Worry and ask questions
You may need to read the article appraise it and see if it is relevant/ of their concern