Working With Data Flashcards
(27 cards)
Descriptive statistics
Create a picture of the data , they describe the data
Barplot with stars ( not effective)
- hides all observed values
- effect size not shown
Precision of effect size not shown
Confidence and likelihood of effect size not shown
Creates false dichotomy with “significance asterisk”
Alternative to barplot , Boxplot
Good:
medians, quarries, minima, and maxima shown,
Bad:
- all observed values still missing
- effect size not shown
- precision of effect size not shown
- confidence and likelihood of effect size not shown
- creates- false dichotomy with p value threshold
Jitter plot
Good :
- all observed values are shown
Bad:
- underlying distribution not accurately depicted
- Effect size not shown
- precision of effect size not shown
- confidence and likelihood of effect size not shown
- creates false dichotomy with significance test
Alternative to bar plots
Dot plots
Probability distribution of rain cloud or violin plots
Pirate plots
Two-groups estimation plot (beneficial)
- all observed values are shown
- effect size is shown
- precision of effect size is shown
- confidence and likelihood of effect size are shown
- no false dichotomy arising from significance testing
Data screening
This is the process that we have to go through with all of our data which is to verify that the data is in fact what we want it to be
- to correct any problems that might present themselves
- and also to verify that the assumptions that we are making are in fact being satisfied
Data structure
Data are messy and chaotic - if you have designed it hopefully won’t be chaotic
- but if the data is collected from online records if you have access to medical records … this is an important process because it has often mistakes and you need to screen your data
- online records often need to be rearranged
- data needs to be closely screened
Garbage in —> garbage out
What is data screening (steps)
You need to make sure you have met all your assumptions
- check for outliers
- error problems
- each kind of analysis will have different types of data screening
You want to keep as much as possible of the data you have collected
- It is easier to keep data than to collect new data , the more data you have the more statistical power you have, the more confident you can be that your data represents the population that you are sampling
- Also important to keep as much as possible because if you delete you might not be deleting for the right reasons and so we need to be very strict criteria
Hypothesis testing
Traditionally use p < 0.05 ( you want values less than 0.05 because that’s what we are trying to find .. statistically significant)
For Data Screening hypothesis testing:
we want to use p < 0.001 because we want to make sure things are really crazy before we fix/delete/etc
This is because we want to see if things are crazy before we change our data generally we need to accept the data we have wether we like it or not
- if the scores are less than .1% then it’s real-time skewed, different, our assumptions are violated !!! This needs to be fixed
Data screening process ( order)
Accuracy
Missing
Outliers
Assumptions
Additivity
Normality
Linearity
Homogeneity/homoscedasticity
- Accuracy (data screening)
- Check for data technical problems during data recording
- Check for typos and problems with the dataset
- Generally, look for values that are out of (expected or logical) range
- check out min and max to see if they are within what you would expect
- fixem or delete that data point
DO NOT DELETE THE PERSON JUST THE WRONG DATA POINT
- if there are too many wrong points for the same person, then you can delete the person
Missing data , two types of missing data
MCAR - missing data at random (you want this)
- probably caused by skipping a question or missing a trial
MNAR - missing not at random (this would mean we may have a problem with our experimental design )
- not a random, maybe people do no pt want to answer the question or don’t understand it
Missing data what can u do
MNAR - exclude or eliminate the data
MCAR - replace the data with a special function
What should I replace ? (Data)
Do not replace categorical variables
Do not replace demographic variables
Do not replace data that’s is MNAR
Most people replace continuous variables (interval, ratio)
How much can I replace ?
Depending on your sample size - in large datasets < or = 5% is ok
Applies to variables or participants
Small samples = you may need collect more data
Old ways of replacing: mean substitution
Used to be popular , easy
People no longer recommend this
It artificially lowers the variance
Could lead to artificially low P- values and you get kind of an artificial power with your missing days
- not ideal
Old ways to replace: regression
Same problems as the mean data but there are now two dimensions that were replacing the data in
Same process, solution and problems as mean substitution
- not ideal
Problems with the old ways
Conservative, doesn’t change the mean values used to find significant differences
Unchanged mean could be seen as cherry picking
Does reduce the variance, which may cause significance tests to change with lot of missing data …
Expected maximization algorithm (Considered the best at replacing data)
Uses multiple imputation
Creates an expected values set for each missing point
Using matrix algebra, the program estimates the probability of each value and picks the highest one
Outliers issues
An extreme value in one variable or multiple variables
- we don’t like these
- they are outside of the normal range of the distribution, they can violate some of the assumptions of homogeneity or variance
Types of Outliers
Univariate outliers: you are an outlier for one variable
Multivariate outliers: you are an outlier for multiple variables
- use boxplot to show outlier
We will see more of this later in the course