Statistics Flashcards
What is descriptive statistics?
- Describes the range of values
- Identify central tendency e.g. average, median
- Describe the distribution of the whole set e.g. varied, similar
- Identify outliers
- Describe percentages
Must know type of data to do this
What is categorical data?
Nominal
- Discrete categories that are mutually exclusive and unordered e.g. sex, blood group
Ordinal
- Discrete categories that are mutually exclusive and ordered (ranked) e.g. disease stage –> cannot be in more than one category
Used in quantitative research
What is continuous data?
‘Scale’ variables e.g. counts and measures
Numerical and discrete
- e.g. counts of days
Numerical and continuous
- e.g. age, height
Used in quantitative research
How can data be summerised?
Bar charts
Box + whisker plot –> e.g. when presenting median values
Line graph –> continuous data changing over time
Scatter plot –> 2 sets of continuous data e.g. grip strength vs arm strength
Pie chart
Histogram –> further development of bar chart data showing the distribution within a category
How do you describe central tendency?
Mean
- Sum of all values divided by the sample size
- Cannot be used as central tendency when there isn’t normal distribution
Median
- The middle or 1/2(n+1) value
- Can be used where there isn’t normal distribution
Mode
- The most frequently occurring value
- Can be used in ordinal data
If data is normally distributed used mean and SD
If data is skewed use median and interquartile range or mode and ranges
Describe standard deviation
Used for normally distributed data to describe the distribution of the values
Describes the range of values of the whole group around the mean
A small SD indicates that most values are close to the mean
Z-scores are the number of SD away from the overall mean
What central tendency is used for different distributions of data?
If data is normally distributed used mean and SD
If data is skewed use median and interquartile range or mode and ranges
What are confidence intervals?
Identify a range in which we can be confident that the ‘true’ population will lie
A 95% CI is the range within 95% of the population will lie
95% CI = mean +- 1.96x standard error
A large 95% CI indicates a high degree of uncertainty in the results
Confidence limits define the lower and upper values of a confidence interval
What is inferential statistics?
The process of using data obtained from a small group of elements (sample) to make estimates and test hypothesis about the characteristics of a larger group of elements (population)
Sample must accurately represent the population
Used in quantitative data with an appropriate research question in an appropriate research design with an adequate sample size
How can a study be underpowered?
If a sample size is too small and there are confounding data undermining whether you can support/ refute null hypothesis –> stats will be underpowered
Statistical methods can still be run but must be highlighted as trends
What can you interpret from inferential statistics?
- The relationship /association between variables e.g. correlation coefficients
- The difference between two or more groups
- The likelihood that the result has occurred by chance (p-values)
What are P-values?
What do they show?
The likelihood that the result has occurred by chance
p=0.5 (a 50% chance)
p=0.05 (a 5% chance)
The lower the p-values the less likely that any observed effect is due to chance
Also known as the alpha value. Larger than 0.05 is not significant
p=0.05 ‘significant’
p=0.01 ‘highly significant’
0=0.001 ‘very highly significant
The p-values represents the amount of evidence in support of the null hypothesis
What is the difference between parametric and non-parametric data?
Parametric tests have more power than non-parametric i.e. you are less likely to make a type II error
Parametric data:
- Assumed normal distribution
- Assumed homogeneity of variance across groups (the spread of scores around the mean are equal)
- Data sets are independent
- Data are numerical and scale
- Data sets are… interval, continuous with an equal distance between values OR ratio, continuous with an equal distance between values and a true zero
Parametric tests are for those with a normal distribution
Non parametric tests are for non-normal distributions
Non-parametric
Skewed
Biomodal
Small sample size
Flat or very point graph
How do you find out if you have normal data?
Use the Shapiro-Wilk test for less than 2000 cases
For more than 2000 cases use the Kolmogorov Smirnov
Levene’s test the distribtuion of two tests, looking at the shape of the distribution
What is a T-test?
Used in parametric data
Compared two sample group by comparing two means relative to their distribution
Tests the probability that the samples come from the same population
Can be
Independent - two groups made up of different people
Paired - same people measured twice
Two tailed testing means the differences between the groups are tested for in either direction
Pairs the two means and SD and Looks at the distributions and compares how much overlap between the two groups at the end of the test.
What is an ANOVA test?
Analysis of variance
More then two groups
Made up of factors, different categories being compared
Outputs are effects
Interactions between main effects are being measured
Significant p values, doesn’t tell which group is different from another but says one of the groups is different
A post hoc test is needed to identify which give a p value for each group
What are some nonparametric tests for inferences between groups?
Mann-Whitney U test
- For comparison of two groups with different subjects in each group
Wilcoxon signed ranked test
- Comparison of data where the same subject has been tested twice - giving two groups
Kruskal Wallis test
- Comparison of more than two groups with different subjects in each group
Freidman ANOVA
- Comparison of data where the subject has been tested more than twice
What is Chi-Squared test?
For non parametric categorical data
A measure of the difference between observed (actual) and expected frequencies
Tests the association or difference between two categorical variables
Expected frequency is that there is no difference between sets of results value = 0
The larger the difference the great the chi value
Uses percents to compare
How do you test for association/ correlation?
Test relationship between two variables
Usually done with scale data
Where there is a linear relationship there is said to be a correlation
Non Parametric - Spearman’s correlation
Parametric - Pearson’s correlation
R= 1 there is a perfect linear relation ship (-1 if negative relationship)
R= 0.6-0.8 a high correlation
R= 0.2-0.4 a low correlation
Significance depends on the sample size - the more people = the smaller the p value
What are some errors in inferential statistics?
Type 1 error (a)
- rejecting a true null hypothesis, a false positive
- the probability that you will accept something as statistically significant when it’s actually not
P value reflects chance in making error
Type 2 error
- failing to reject a false a null hypothesis, a false negative
- hypothesis is true but has not recognised that it’s true with the results as the results are true
- the probability of retaining the null hypothesis when it is in fact false
1-B is the power
0.8 = 80% chance of detecting if one does in fact exist
B = 20
The probability of making a type 2 error is 20%
There is a 20% chance of not identifying an effect when there is one
What are power calculations?
1-B is the power of a test to correctly reject the null hypothesis often set at. .8
Power can be determined by
P value
Effect size - a clinical,t meaningful difference in means of the outcome measure
Sample size
Standard deviation
The equation can be rearranged to find sample size for a study if you already have other values
When is power calculated?
Ideally before setting up the study as it informs sample size
You need data from previous similar studies as the basis of the calculation
Can do a post hoc calculation
How do you increase power?
Increase sample size of the study
Using a less stringent significance level e.g. p0.05 rather than p0.01 - this could increase chance of type 1 error
Replication of the study and it’s findings by independent researchers
Minuses the chance if type 2 errors
What are the different types of clinical importance
Statistically significant but clinically unimportant
- if a difference is found to be statistically significant then it may well be a real but not necessarily clinically important
Not statistically significant but clinically important
- if a difference is found not to be statistically significant then it may still be real (due to a type 2 error) and clinically important