Flashcards in Lectures Notes Deck (124):

1

## what are the two types of categorical data?

### nominal and ordinal

2

## what is nominal data? give examples

###
categorical data with no natural order

e.g. blood group, sex

3

## what is ordinal data? give examples?

###
ordered categorical data.

e.g. pain severity, social class, grade of breast cancer

4

## what are the two types of numerical data?

### discrete and continuous

5

## what is binary data?

### a form of nominal categorical data, where there are only two categories

6

## how might you display categorical data?

### bar chart, pie chart

7

## how might you display numerical data?

### dot plot, stem and leaf, histogram, box and whisker

8

## how do you calculate the mean of a data set?

### total all the values, and divide by the number of values

9

## how do you calculate the median of a data set?

###
order the values, median is the middle value.

if there is an even number of values, take the mean of the middle two values.

10

## how do you calculate the mode of a data set?

### the most common value observed

11

## what's the main advantage of using a median of a data set?

### robust to outliers.

12

## what's the main advantage of using the mean of a data set?

###
uses all the data.

13

## when would you use median vs mean?

###
symmetrical data = mean

skewed data = median

14

## list the three main approaches to quantifying variability

###
range

interquartile range

standard deviation

15

## what is the interquartile range of a data set, and how could you display it graphically?

###
the middle 50% of your data.

upper quartile - lower quartile.

box and whisker plot.

16

## how do you calculate variance?

###
1. draw a table

2. calculate difference between observed value and mean for each value

3. square each of these values

4. calculate the total of the squared differences from mean

5. divide this by n-1

(n= number of values)

17

## how do you calculate standard deviation (SD)?

### square root of variance

18

## how many decimal places should you use when calculating SD?

### usually 2 or 3 more decimal places than the original data

19

## what is the relationship between mean and SD in Normally distributed data?

###
mean ± 1 SD covers 68% of data

mean ± 2 SD covers 95% of data

20

## how do you calculate the 'normal reference range' of an investigation?

### mean ± 2 SDs

21

## what is the relationship between the mean and the median in Normally distributed data?

### will be the same!

22

## formula for risk

### risk = no. events observed / number in the group

23

## formula for risk difference

### RD = risk (exposed group) - risk (unexposed group)

24

## what's the difference between risk difference and ABSOLUTE risk difference?

### in ARD you ignore the sign - so it's always expressed as a positive number, but it might represent an increase or decrease in risk

25

## your 95% CI for a relative risk includes 1.00 - what does this mean?

### there is NO difference between groups

26

## formula for number needed to treat (or harm!)

### 1/ARD

27

## formula for odds

### no. people with disease / no. people without

28

## formula for odds ratio

### odds (exposed) / odds (unexposed)

29

## if an event is rare, what is the relationship between odds and risk ratios?

### they'll basically be the same - but for a common event, they can be really different

30

## ____ is a useful measure of spread when data is distributed symmetrically

### standard deviation

31

## if data is symmetrically distributed, what percentage of data lies within 2 SD of the mean?

### 95%

32

## what would you see on a histogram of positively skewed data?

###
peak of data is at the left, tail extends to the right.

the mean will be greater than the median.

33

## how can you tell which direction data is skewed in from the mean and median?

###
mean = median : symmetrical

mean > median : positive skew

mean < median : negative skew

34

## what would you see on a histogram of negatively skewed data?

###
peak of data is at the right, tail extends to the left.

mean will be less than the median.

35

## what summary measures would you use for positive/negative skewed data?

### median and interquartile range

36

## formula for the addition rule of probability

### P(A or B) = P(A) + P(B)

37

## formulae for the multiplication rule of probability

### P(A and B) = P(A) x P(B)

38

## if an event has a probability of 0, what does this mean? what about 1?

###
0 = it can never happen

1 = it definitely happens

39

## define standard error

### estimate of the precision of a sample estimate - a measure of how far from the true population value a sample estimate is likely to be.

40

## what type of distribution will a set of sample means take, given a large enough sample size?

### Normal

41

## formula for standard error of a mean

### SD / sq root of n

42

## formula for standard error of a proportion

###
square root of:

p(1-p) / n

(p = sample proportion)

43

## how do you calculate the standard error of the difference between two sample means

###
SD of first sample / n for that sample

+

SD of second sample / n for that sample

square root the answer

44

## what does a large standard error mean?

### that your estimate of a population mean is imprecise

45

## what does a small standard error mean?

### that your estimate of a population mean is precise

46

## if sample size increases, does standard error go up or down?

### down - we get a more precise estimate

47

## what is the general formula for calculating a 95% CI?

### mean ± (1.96xSE)

48

## what is the technical definition for a 95% confidence interval?

### if the study were to be repeated 100 times, of the 100 resulting 95% CIs, we would expect 95 of these to include the population mean

49

## what is the correct way to interpret a 95% CI of 120-130mmHg, mean 125

### We are 95% confident that the true population mean sys. BP lies between 120 and 130, but the best estimate we have is 125.

50

## explain the difference between standard deviation and standard error

### SD describes the variability of the observations in the sample, whereas SE is a measure of the precision of an estimate of the population mean.

51

## what are the four main steps in hypothesis testing?

###
1. state null hypothesis

2. choose a significance level

3. obtain P-value

4. use P-value to decide whether to reject your null hypothesis

52

## how should you interpret a P-value?

### the probability of observing your results, or more extreme, if the null hypothesis is true

53

## how do we obtain P-values?

###
carry out a statistical significance test, and that generates a test statistic.

we then use the test statistic and distribution tables to find the P-value.

54

## what is the general formula for a test statistic?

###
observed value - hypothesis value

all divided by standard error

55

## define the "power" of a study

###
the probability of rejecting the null hypothesis when it is actually false.

i.e. probability of concluding that there is a difference, when a difference truly does exist.

= 1 - beta

(beta = type II error)

56

## what is type II error?

### same as a false negative - probability of not rejecting the null hypothesis, when it is in fact false.

57

## what is type I error?

###
this is the P-value!

same as false positive - probability of rejecting null hypothesis when it is in fact true

58

## If a confidence interval includes 0, will this be a statistically significant result?

###
NO.

but if it doesn't - don't need the P-value, as it shows that it is statistically significant to the 5% level

59

## what are parametric tests?

### a type of statistical test, that assume data are distributed according to a specific distribution (e.g. Normal distribution)

60

## give some examples of parametric tests

###
t-test

analysis of variance (ANOVA)

linear regression techniques

61

## what are non-parametric tests?

###
type of statistical test that does not make any assumptions about the shape of the data. used when you can't meet assumptions for parametric test, data is skewed, or there are outliers.

useful for data that is skewed, ranked or ordinal.

robust to outliers.

based of ranks of the data, not the actual data.

62

## explain the difference between paired and unpaired data

###
paired data = same individuals studied at two different times.

independent = data collected from two separate groups.

63

## what are the two assumptions underlying a paired t-test?

###
- that the differences between values are Normally distributed (e.g. difference in PHQ9 score at 0 and 4 months)

- that the differences are independent of each other

64

## what is the Wilcoxon (matched pairs) signed rank test, and when would you use it?

###
non-parametric equivalent of the paired t-test!

used when you can't meet the assumptions of the paired t-test.

65

## name two tests you can use when your data consists of more than 2 groups?

###
- analysis of variance (ANOVA) - parametric.

-Kruskal-Wallis test (non-parametric version)

66

## what two tests might you use to comparing data from two independent groups?

### Chi-squared, difference in proportions, or Fisher's exact test

67

## when can you compare independent groups using the difference in proportions?

###
when the sample is large enough.

np and n(1-p) should both be greater than 5.

n = total no. individuals in both samples.

p = proportion of individuals with the condition (regardless of group)

68

## when would you use Chi-Squared test?

###
- two nominal categorical variables that can form a r x c contingency table

- at least 80% of expected cell counts >5

- all expected cell counts >1

69

## when is Yates' correct used for Chi-squared tests?

### should be used for all chi-squared tests on 2x2 tables

70

## when would you use Fisher's exact test?

### when values are too small to do chi-squared!

71

## what test is used to compare paired proportions?

### McNemar's test

72

## what is bivariate data?

### data where there are two variables, either categorical or numerical

73

## when would you use correlation over regression?

### when you aren't implying an order or causation, just an association

74

## when would you use regression over correlation?

### when one variable (Y) is a response to another variable (X) - you could use value of X to predict Y

75

## what is meant by "correlation coefficient"?

###
it's a measure of the linear association between two variables

(cannot use to predict one variable from another)

76

## what are the properties of Pearson's correlation coefficient (r)

###
r must be between -1 and +1

+1 = perfect positive linear association.

-1 = perfect negative linear association.

0 = no linear relation at all.

77

## how does regression work?

###
plot scatter plot with the X (predictor/explanatory variable) on X axis, and the Y (response) variable going up Y axis.

Then finds line of best fit using "least squares" model.

equation from that line can be used to predict Y from X.

78

## what is the generic regression equation?

###
Y = a + bX

a = intercept

b = slope

79

## what is multiple regression?

### form of regression used when there are multiple variables influencing the outcome variable.

80

## give 3 reasons for carrying out a multiple regression analysis?

###
1. to identify any explanatory variables that may be associated with the Y variable

2. to investigate extent to which 1+ variables are linearly related to Y, after adjusting for other variables

3. to predict value of Y from X variables

81

## how does a multiple regression equation work?

### Y = a + then multiple 'b's (coefficients), which you multiple by each corresponding variable (X)

82

## what is the conventional minimum power for a study?

### 0.80

83

## how do you calculate the standardised effect size?

### difference in means between intervention and control group, divided by the standard deviation of the outcomes

84

## what are the 4 ingredients needed for a sample size calculation?

###
1. target/anticipated effect size (δ)

2. standard deviation of the outcome data (σ)

3. power (typically 80-90%)

4. significance level (0.05)

85

## what happens to sample size needed as significance level gets smaller?

### goes up

86

## what happens to sample size needed as power increases?

###
goes up.

(this is the same as saying type II error decreases)

87

## what happens to sample size needed as anticipated effect size decreases?

### goes up

88

## what happens to sample size needed as variability of outcome data decreases?

### goes down

89

## list the 3 factors determining the sample size needed for a survey

###
1. how precise should estimate be? e.g. within ±5%

2. probability that estimate is close to the population parameter

3. some idea of the prevalence in the population under study

90

## formula for sensitivity

### = true positives / no. people with disease

91

## formula for specificity

### = true negatives / no. people without disease

92

## definition of sensitivity

### given that the patient has the disease, sensitivity is the proportion of times the test is positive

93

## definition of specificity

### given that the subject doesn't have the disease, specificity is the proportion of times the test will be negative

94

## definition of PPV

### probability that someone has the disease when the test is positive

95

## formula for PPV

### true positives / no. positive results

96

## definition of NPV

### probability that someone is without disease when the test is negative

97

## formula for NPV

### true negatives / no. negative results

98

## formula for accuracy of test

### true positives + true negatives / no. people tested

99

## how can we decide on a diagnostic cut-off value for tests with continuous outcomes?

###
can use the Receiver Operating Characteristic (ROC) curve - plots sensitivity vs 1-specificity for each distinct cut-off value.

best cut-off point is the one nearest the top left-hand corner.

an ROC curve lying on the 45 degree line is no better than chance!

100

## how do you calculate the likelihood ratio of a positive result?

###
sensitivity / 1-specificity

this is the probability of getting this result, if patient is truly diseased vs if they were healthy.

interpret as you would any other ratio!

101

## how do you calculate the likelihood ratio for a negative result?

###
inverse of LR(+)

1-specificity / sensitivity

102

## how can you interpret likelihood ratios?

###
a large LR(+) e.g. >10 = test could be useful in ruling IN a diagnosis.

small LR(-), close to 0 = test could be useful in ruling OUT a diagnosis

103

## which test would you use to compare two independent groups, with continuous, Normally distributed data?

### Independent samples t-test

104

## which test would you use to compare two independent groups with continuous, but not Normally distributed data?

### Mann-Whitney U

105

## which test would you use to compared two independent groups, with ordinal data?

###
Mann-Whitney U

or, Chi-squared test for trend

106

##
which test would you use to compare two independent samples of nominal data with:

>2 categories

large sample

most expected frequencies >5

### Chi-squared test

107

## which test would you use to compare two independent samples of binary data, with a large sample and all expected frequencies >5?

###
Comparison of two proportions

OR

Chi-Squared

108

##
which test would you use to compare two independent samples of binary data, but without:

a large sample and all expected frequencies >5?

###
Chi-squared with Yates' correction

OR

Fishers' exact test

109

## which test would you use for paired, continuous, Normally distributed data?

### paired t-test

110

## which test would you use for paired, continuous, but not Normally distributed data?

### Wilcoxon matched pairs test

111

## which test would you use for paired, ordinal data?

### Sign test or Wilcoxon matched pairs test

112

## which test would you use for paired, binary data?

### McNemar's test

113

## which test would you use for paired, nominal data with >2 categories?

### trick question! consult statistician!

114

## how many degrees of freedom do you use for the independent t-test?

### (n1 + n2) - 2

115

## how many degrees of freedom do you use for the paired t-test?

### n-1

116

## assumptions for independent t-test

###
1. two independent groups

2. continuous outcome variable

3. outcome data Normally distributed in both groups

4. outcome data in both groups have similar SDs

117

## assumptions for paired t-test

###
1. the differences between pairs are plausibly Normally distributed (not the actual data itself)

2. the differences between pairs are independent of each other

118

## When calculating the sample size for a randomised controlled trial to compare a new treatment to a standard treatment for a particular disease the power of the study is _____?

### the probability of NOT making a type II error

119

## what are the assumptions required to make linear regression models valid?

###
1. variance of Y is same at each value of X

2. standard deviation of Y is same at each value of X

3. relationship between two variables is linear

4. residuals are Normally distributed for each value of X

120

## list the sample size ingredients for a continuous outcome?

###
1. target/anticipated effect size

2. SD of outcome

3. power

4. significance

121

## with increasing significance level (alpha), sample size ____

### decreases

122

## with increasing power, sample size _____

### increases

123

## with increasing effect size, sample size_____

### decreases

124