Working With Data Flashcards

1
Q

Descriptive statistics

A

Create a picture of the data , they describe the data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Barplot with stars ( not effective)

A
  • hides all observed values
  • effect size not shown
    Precision of effect size not shown
    Confidence and likelihood of effect size not shown
    Creates false dichotomy with “significance asterisk”
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Alternative to barplot , Boxplot

A

Good:
medians, quarries, minima, and maxima shown,

Bad:
- all observed values still missing
- effect size not shown
- precision of effect size not shown
- confidence and likelihood of effect size not shown
- creates- false dichotomy with p value threshold

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Jitter plot

A

Good :
- all observed values are shown

Bad:
- underlying distribution not accurately depicted
- Effect size not shown
- precision of effect size not shown
- confidence and likelihood of effect size not shown
- creates false dichotomy with significance test

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Alternative to bar plots

A

Dot plots
Probability distribution of rain cloud or violin plots
Pirate plots

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Two-groups estimation plot (beneficial)

A
  • all observed values are shown
  • effect size is shown
  • precision of effect size is shown
  • confidence and likelihood of effect size are shown
  • no false dichotomy arising from significance testing
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Data screening

A

This is the process that we have to go through with all of our data which is to verify that the data is in fact what we want it to be
- to correct any problems that might present themselves
- and also to verify that the assumptions that we are making are in fact being satisfied

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Data structure

A

Data are messy and chaotic - if you have designed it hopefully won’t be chaotic
- but if the data is collected from online records if you have access to medical records … this is an important process because it has often mistakes and you need to screen your data

  • online records often need to be rearranged
  • data needs to be closely screened
    Garbage in —> garbage out
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is data screening (steps)

A

You need to make sure you have met all your assumptions
- check for outliers
- error problems
- each kind of analysis will have different types of data screening

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

You want to keep as much as possible of the data you have collected

A
  1. It is easier to keep data than to collect new data , the more data you have the more statistical power you have, the more confident you can be that your data represents the population that you are sampling
  2. Also important to keep as much as possible because if you delete you might not be deleting for the right reasons and so we need to be very strict criteria
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Hypothesis testing

A

Traditionally use p < 0.05 ( you want values less than 0.05 because that’s what we are trying to find .. statistically significant)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

For Data Screening hypothesis testing:

A

we want to use p < 0.001 because we want to make sure things are really crazy before we fix/delete/etc

This is because we want to see if things are crazy before we change our data generally we need to accept the data we have wether we like it or not

  • if the scores are less than .1% then it’s real-time skewed, different, our assumptions are violated !!! This needs to be fixed
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Data screening process ( order)

A

Accuracy
Missing
Outliers
Assumptions
Additivity
Normality
Linearity
Homogeneity/homoscedasticity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q
  1. Accuracy (data screening)
A
  1. Check for data technical problems during data recording
  2. Check for typos and problems with the dataset
  3. Generally, look for values that are out of (expected or logical) range
    - check out min and max to see if they are within what you would expect
    - fixem or delete that data point
    DO NOT DELETE THE PERSON JUST THE WRONG DATA POINT
    - if there are too many wrong points for the same person, then you can delete the person
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Missing data , two types of missing data

A

MCAR - missing data at random (you want this)
- probably caused by skipping a question or missing a trial
MNAR - missing not at random (this would mean we may have a problem with our experimental design )
- not a random, maybe people do no pt want to answer the question or don’t understand it

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Missing data what can u do

A

MNAR - exclude or eliminate the data
MCAR - replace the data with a special function

17
Q

What should I replace ? (Data)

A

Do not replace categorical variables
Do not replace demographic variables
Do not replace data that’s is MNAR
Most people replace continuous variables (interval, ratio)

18
Q

How much can I replace ?

A

Depending on your sample size - in large datasets < or = 5% is ok
Applies to variables or participants

Small samples = you may need collect more data

19
Q

Old ways of replacing: mean substitution

A

Used to be popular , easy
People no longer recommend this
It artificially lowers the variance
Could lead to artificially low P- values and you get kind of an artificial power with your missing days
- not ideal

20
Q

Old ways to replace: regression

A

Same problems as the mean data but there are now two dimensions that were replacing the data in
Same process, solution and problems as mean substitution
- not ideal

21
Q

Problems with the old ways

A

Conservative, doesn’t change the mean values used to find significant differences
Unchanged mean could be seen as cherry picking
Does reduce the variance, which may cause significance tests to change with lot of missing data …

22
Q

Expected maximization algorithm (Considered the best at replacing data)

A

Uses multiple imputation
Creates an expected values set for each missing point
Using matrix algebra, the program estimates the probability of each value and picks the highest one

23
Q

Outliers issues

A

An extreme value in one variable or multiple variables
- we don’t like these
- they are outside of the normal range of the distribution, they can violate some of the assumptions of homogeneity or variance

24
Q

Types of Outliers

A

Univariate outliers: you are an outlier for one variable
Multivariate outliers: you are an outlier for multiple variables
- use boxplot to show outlier
We will see more of this later in the course

25
Q

Regression outliers analyses , exist in two dimensions

A

Discrepancy : a score that is far out of lie but may not influence regression slopes ( measured by leverage values)

Leverage: a measure of how much a value changes the model ( the regression slope, an outlier In the middle of the data might not change the regression slope)

Influence: product of leverage and discrepancy ( measured by Cook’ s values) how much we care about these outliers

26
Q

What to do with outliers , when you find them

A

First there is a reason for them
Did the participant complete the study accurately
Was there a technical issue

As yourself : did they do the study correctly ? Are they part of the pop you wanted ?
- eliminate them
- leave them in, real data, this person.
Behaved this way
Don’t want to break the rule of keeping as much data as possible

27
Q

Assumptions checks

A

Additivity
Normality
Flinearity
Homogeneity and homoscedasticity