descriptive statistics Flashcards

1
Q

Categorical Data (factor, nominal)

A
  • No particular relationship between the different possibilities
  • Example: what prison sentence does someone have?
  • Answers might be suspended, determinate, indeterminate
  • Can’t average them or do maths with them
  • Doesn’t make sense to talk about “average” prison sentence
  • But you could talk about the most / least frequently occurring prison sentence
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Continuous data (interval)

A
  • Goes in a specific order
  • Example: what is a patient’s weight today?
  • Can do maths with interval data
  • It would make sense to talk about average weight or weight increase or decrease
  • Its meaningful to say someone who is 60kg is 10kg heavier than someone who is 50kg
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Ordinal Data (ordered categorical, ordered factor, Likert scale)

A

• Like categorical but there is an order to the sequence
• Example: how tired are you feeling right now? Pick one of the following options
1. Very tired
2. Tired
3. Alert
4. Very alert
• Can’t do maths with ordinal data
• Like categorical data, we can talk about the most chosen and least chosen options, but not the average tiredness

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Mode

A

• The score / value / number / response that happens most often

• You can have more than one mode
• One mode = unimodal
• Two modes = bimodal
• Can take modes of continuous data too
What is the mode of the variable bdi.8m, shown on the right? Interpret the result in the context of the data
A score of 0 is the mode (it has 7 appearances or “counts” in the data)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

median

A
  • The middle number
  • Only useful for continuous data
  • Sort data in ascending order
  • Find the number in the middle of the dataset
  • You cannot have more than one median
  • If there wasn’t a middle value you’d take the average of the two middle values
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

mean

A

What most people mean when they say “average”
Only useful for continuous data
It is the sum (total) of all the values divided by the number of values

 Would need to know more about the measurement scale used
Let’s formalise that in a formula X ̅=Σx/n The mean has outliers
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

variability

A
  • Talked about centres or “averageness” in the data
  • Another type of statistic we need to calculate to understand our data: measures of variability
  • How spread out the data are
  • How far away from the mean or median do the datapoints tend to be
  • Bdi.pre mean was 23.33 – how near to this value do most of the patients’ depression scores tend to be?
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

range

A
  • Simply the highest – lowest value
  • 14 - 0 = 14
  • Know the boundaries of our data
  • Useful to detect outliers or data input errors
  • But doesn’t tell you how common really high or low numbers are
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

interquartile range

A
  • Split our data into quarters
  • Each quarter contains 25% of the datapoints
  • To do that we need to find the quartiles
  • The three points that split the data into the quarters
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

interpreting the IQR

A
  • For first 13 people in variable bdi.8m have depression score IQR of 9
  • The IQR is the range (max-min) of the middle 50% of the data
  • IQR plays a key role in data visualization (boxplots)
  • IQR is useful as it is not as affected by outliers compared to the following measures
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

variance

A
  • How far numbers are spread out from the mean
  • Big number that isn’t useful on its own
  • Feeds into other statistics that we’ll use lots
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

interpreting the variance

A
  • Variance is
  • A really big number
  • Not in the original units
  • It is difficult to interpret – this is a problem with using the measure
  • Not useful to say that the spread of depression scores before treatment was 89.12 around the mean
  • So we need to ‘undo’ the squaring that we did earlier
  • Means that the variances is interpretable in the same units as the data
  • Which gives us the standard deviation….
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

standard deviation

A

just the square route of the variance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

interpreting standard deviation

A
  • The average distance between the values in the dataset and the mean of that dataset
  • Most often used to understand the variability in continuous / ordinal data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Calculating skew: Pearson’s coefficient of skewness

A
  • Negative number means data are negatively skewed
  • Positive number means data are positively skewed
  • Symmetrical data has skew of zero
μ = mean
ν = median 
σ = standard deviation
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

R studio

A

Describe(data, mean = mean(dataset), stdev = sd(dataset))

17
Q

R studio, arrange by the descriptive stats

A

Describe(data, by dataset, mean = mean(dataset), stdev = sd(dataset))

18
Q

R studio min and max

A

describe(data =data, mean_dataset = mean(dataset), SD_dataset = sd(dataset), max_dataset = max(dataset), min_dataset = min(Intrusion), by = Condition)

19
Q

R studio example

A

describe(data =tetris, mean_intrusion = mean(Intrusion), SD_Intrusion = sd(Intrusion), max_Intrusion = max(Intrusion), min_Intrusion = min(Intrusion), by = Condition)

20
Q

r studio correlation

A

cor(data)

gives you something like:
exercise_mins stai_state
exercise_mins 1.0000000 -0.3985458
stai_state -0.3985458 1.0000000

21
Q

Calculate the variance explained in R studio

A

r_exercise_anxiety = 0.3985458