Interpreting plots and recommending next steps Flashcards
(51 cards)
How do you assess a continuous variable?
describe the center, spread (dispersion), and shape (symmetry, outliers)
normal distribution (symetric): mean/sd
Non-normal (skewed): median/IQR
What plots do we use to assess a continuous variable?
dotplot, histogram, Normal Q-Q plot, and boxplot help with assessment
Numerical summaries can also help but are not as useful as plots
Draw the damn picture
Why plot the data?
- To test for normal distribution
- determines how data should be summarized (use of mean/SD vs. median/IQR) and tests to create inferences about populations from samples (t-test vs. other approaches)
How do outliers affect the mean/sd and median/IRQ?
May completely change mean/sd but median and IRQ my be unaffected.
What is a bootstrap and why is it used?
If data are not compatible with a Normal distribution, you can use the bootstrap to estimate the population mean with 95% CI
Why might it be useful to view data with a histogram?
- Count (or density) of continuous variable values within equally-sized bins expressed as a bar graph
- Can stratify (subgroup) data by a factor (sex, etc.) and show separate histograms
- Can overlay a Normal density plot to assess compatibility with a Normal distributio
What is a Q-Q plot?
Scatterplot of the observed data value (y-axis) versus the expected value for that point in a Normal distribution with mean 0 and SD 1 (x-axis)
Line drawn through Y=X
Especially helpful to assess skew and behavior of tails (outlier-proneness)
What are some ways to interpret the shape of a Q-Q plot?
- Points close to the line = Normal distribution
- substantial curving on both ends away from line = skew
- S-shaped curve or reverse s-shape = Dearth or abundance of outliers (light or heavy tailedness)
What are properties of a box plot?
- Box covers the middle half of the data (25th and 75th percentiles, the IQR)
Solid line indicates the median - Whiskers extend from the quartiles to the most extreme values that are not judged by Tukey’s “fences” method to be candidate outliers
- Fences are drawn at the end of the whiskers at 25th percentile - 1.5IQR and 75th percentile + 1.5IQR
- Outliers are beyond the fences
How is a violin plot helpful?
- it traces the density of the data (almost like a histogram with small bin sizes)
- Used as a supplement to boxplot to better show the shape of the distribution
How does a normal distribution relate to SD?
Approximately 68% of the data would be within 1 SD of the mean
Approximately 95% of the data would be within 2 SD of the mean
Essentially all (99.7%) of the data would be within 3 SD of the mean
What items should be assessed with plots and numerical summaries?
symmetry, skew, kurtosis (heavy- or light-tailedness, outlier-proneness)
Normal distribution in histogram looks how?
symmetric and bell-shaped
Normal distribution in box plot looks how?
box is symmetric around the median, as are the whiskers, without serious outliers
Normal distribution in Q-Q plot looks how?
data essentially falls on a straight line
Right-skewed data
mean>median
Q-Q curve up on both tails
box = outliers above whiskers
histogram right
Left-skewed data
o Mean < median (if data are unimodal)
o On histogram data extend to the left of center
o On a Q-Q plot the distribution bends in a “convex” shape down and away from the line in both tails
o On boxplot, multiple dots (outliers) below the whiskers
Outlier-proneness
Kurtosis
o Indicated by “s-shaped” curves on a Normal Q-Q plot
o Heavy-tailed but symmetric distributions are indicated by reverse “S”-shapes, as shown on the left below
o Light-tailed but symmetric distributions are indicated by “S” shapes in the plot, as shown on the right below
Plot of discrete data
Values can only take discrete values and are not completely continuous
Symmetric but outlier-prone (heavy-tailed)
What can you assess with a scatterplot?
- two continuous variables, typically response (outcome) on y-axis
- Look for **direction **(positive or negative) and strength of a linear relationship (if one exists at all)
- Also look for non-linear patterns in association (loess smooth curve can help here)
What does Pearson correlation coefficient (r) assess?
strength of a linear association between the two continuous variables
What does the r2 value tell you?
proportion of variability in y-variable explained by x-variable
What plots can help assess relationship between continuous and factor variables?
- comparison boxplot
- faceted histograms
- overlapping density plot
- ridgeline plot