1. Accuracy (data screening)

1. Check for data technical problems during data recording 2. Check for typos and problems with the dataset 3. Generally, look for values that are out of (expected or logical) range - check out min and max to see if they are within what you would expect - fixem or delete that data point DO NOT DELETE THE PERSON JUST THE WRONG DATA POINT - if there are too many wrong points for the same person, then you can delete the person

Working With Data Flashcards by Unknown Unknown

Descriptive statistics

Create a picture of the data , they describe the data

How well did you know this?

Not at all

Perfectly

Barplot with stars ( not effective)

hides all observed values
effect size not shown
Precision of effect size not shown
Confidence and likelihood of effect size not shown
Creates false dichotomy with “significance asterisk”

How well did you know this?

Not at all

Perfectly

Alternative to barplot , Boxplot

Good:
medians, quarries, minima, and maxima shown,

Bad:
- all observed values still missing
- effect size not shown
- precision of effect size not shown
- confidence and likelihood of effect size not shown
- creates- false dichotomy with p value threshold

How well did you know this?

Not at all

Perfectly

Jitter plot

Good :
- all observed values are shown

Bad:
- underlying distribution not accurately depicted
- Effect size not shown
- precision of effect size not shown
- confidence and likelihood of effect size not shown
- creates false dichotomy with significance test

How well did you know this?

Not at all

Perfectly

Alternative to bar plots

Dot plots
Probability distribution of rain cloud or violin plots
Pirate plots

How well did you know this?

Not at all

Perfectly

Two-groups estimation plot (beneficial)

all observed values are shown
effect size is shown
precision of effect size is shown
confidence and likelihood of effect size are shown
no false dichotomy arising from significance testing

How well did you know this?

Not at all

Perfectly

Data screening

This is the process that we have to go through with all of our data which is to verify that the data is in fact what we want it to be
- to correct any problems that might present themselves
- and also to verify that the assumptions that we are making are in fact being satisfied

How well did you know this?

Not at all

Perfectly

Data structure

Data are messy and chaotic - if you have designed it hopefully won’t be chaotic
- but if the data is collected from online records if you have access to medical records … this is an important process because it has often mistakes and you need to screen your data

online records often need to be rearranged
data needs to be closely screened
Garbage in —> garbage out

How well did you know this?

Not at all

Perfectly

What is data screening (steps)

You need to make sure you have met all your assumptions
- check for outliers
- error problems
- each kind of analysis will have different types of data screening

How well did you know this?

Not at all

Perfectly

You want to keep as much as possible of the data you have collected

It is easier to keep data than to collect new data , the more data you have the more statistical power you have, the more confident you can be that your data represents the population that you are sampling
Also important to keep as much as possible because if you delete you might not be deleting for the right reasons and so we need to be very strict criteria

How well did you know this?

Not at all

Perfectly

Hypothesis testing

Traditionally use p < 0.05 ( you want values less than 0.05 because that’s what we are trying to find .. statistically significant)

How well did you know this?

Not at all

Perfectly

For Data Screening hypothesis testing:

we want to use p < 0.001 because we want to make sure things are really crazy before we fix/delete/etc

This is because we want to see if things are crazy before we change our data generally we need to accept the data we have wether we like it or not

if the scores are less than .1% then it’s real-time skewed, different, our assumptions are violated !!! This needs to be fixed

How well did you know this?

Not at all

Perfectly

Data screening process ( order)

Accuracy
Missing
Outliers
Assumptions
Additivity
Normality
Linearity
Homogeneity/homoscedasticity

How well did you know this?

Not at all

Perfectly

Accuracy (data screening)

Check for data technical problems during data recording
Check for typos and problems with the dataset
Generally, look for values that are out of (expected or logical) range
- check out min and max to see if they are within what you would expect
- fixem or delete that data point
DO NOT DELETE THE PERSON JUST THE WRONG DATA POINT
- if there are too many wrong points for the same person, then you can delete the person

How well did you know this?

Not at all

Perfectly

Missing data , two types of missing data

MCAR - missing data at random (you want this)
- probably caused by skipping a question or missing a trial
MNAR - missing not at random (this would mean we may have a problem with our experimental design )
- not a random, maybe people do no pt want to answer the question or don’t understand it

How well did you know this?

Not at all

Perfectly

Missing data what can u do

Study These Flashcards

MNAR - exclude or eliminate the data
MCAR - replace the data with a special function

What should I replace ? (Data)

Study These Flashcards

Do not replace categorical variables
Do not replace demographic variables
Do not replace data that’s is MNAR
Most people replace continuous variables (interval, ratio)

How much can I replace ?

Study These Flashcards

Depending on your sample size - in large datasets < or = 5% is ok
Applies to variables or participants

Small samples = you may need collect more data

Old ways of replacing: mean substitution

Study These Flashcards

Used to be popular , easy
People no longer recommend this
It artificially lowers the variance
Could lead to artificially low P- values and you get kind of an artificial power with your missing days
- not ideal

Old ways to replace: regression

Study These Flashcards

Same problems as the mean data but there are now two dimensions that were replacing the data in
Same process, solution and problems as mean substitution
- not ideal

Problems with the old ways

Study These Flashcards

Conservative, doesn’t change the mean values used to find significant differences
Unchanged mean could be seen as cherry picking
Does reduce the variance, which may cause significance tests to change with lot of missing data …

Expected maximization algorithm (Considered the best at replacing data)

Study These Flashcards

Uses multiple imputation
Creates an expected values set for each missing point
Using matrix algebra, the program estimates the probability of each value and picks the highest one

Outliers issues

Study These Flashcards

An extreme value in one variable or multiple variables
- we don’t like these
- they are outside of the normal range of the distribution, they can violate some of the assumptions of homogeneity or variance

Types of Outliers

Study These Flashcards

Univariate outliers: you are an outlier for one variable
Multivariate outliers: you are an outlier for multiple variables
- use boxplot to show outlier
We will see more of this later in the course

Regression outliers analyses , exist in two dimensions

Discrepancy : a score that is far out of lie but may not influence regression slopes ( measured by leverage values) Leverage: a measure of how much a value changes the model ( the regression slope, an outlier In the middle of the data might not change the regression slope) Influence: product of leverage and discrepancy ( measured by Cook’ s values) how much we care about these outliers

What to do with outliers , when you find them

First there is a reason for them Did the participant complete the study accurately Was there a technical issue As yourself : did they do the study correctly ? Are they part of the pop you wanted ? - eliminate them - leave them in, real data, this person. Behaved this way Don’t want to break the rule of keeping as much data as possible

Assumptions checks

Additivity Normality Flinearity Homogeneity and homoscedasticity

Working With Data Flashcards

(27 cards)