Interpreting plots and recommending next steps Flashcards

(51 cards)

1
Q

How do you assess a continuous variable?

A

describe the center, spread (dispersion), and shape (symmetry, outliers)

normal distribution (symetric): mean/sd
Non-normal (skewed): median/IQR

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What plots do we use to assess a continuous variable?

A

dotplot, histogram, Normal Q-Q plot, and boxplot help with assessment

Numerical summaries can also help but are not as useful as plots

Draw the damn picture

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Why plot the data?

A
  • To test for normal distribution
  • determines how data should be summarized (use of mean/SD vs. median/IQR) and tests to create inferences about populations from samples (t-test vs. other approaches)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How do outliers affect the mean/sd and median/IRQ?

A

May completely change mean/sd but median and IRQ my be unaffected.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is a bootstrap and why is it used?

A

If data are not compatible with a Normal distribution, you can use the bootstrap to estimate the population mean with 95% CI

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Why might it be useful to view data with a histogram?

A
  • Count (or density) of continuous variable values within equally-sized bins expressed as a bar graph
  • Can stratify (subgroup) data by a factor (sex, etc.) and show separate histograms
  • Can overlay a Normal density plot to assess compatibility with a Normal distributio
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is a Q-Q plot?

A

Scatterplot of the observed data value (y-axis) versus the expected value for that point in a Normal distribution with mean 0 and SD 1 (x-axis)

Line drawn through Y=X

Especially helpful to assess skew and behavior of tails (outlier-proneness)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are some ways to interpret the shape of a Q-Q plot?

A
  • Points close to the line = Normal distribution
  • substantial curving on both ends away from line = skew
  • S-shaped curve or reverse s-shape = Dearth or abundance of outliers (light or heavy tailedness)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are properties of a box plot?

A
  • Box covers the middle half of the data (25th and 75th percentiles, the IQR)
    Solid line indicates the median
  • Whiskers extend from the quartiles to the most extreme values that are not judged by Tukey’s “fences” method to be candidate outliers
  • Fences are drawn at the end of the whiskers at 25th percentile - 1.5IQR and 75th percentile + 1.5IQR
  • Outliers are beyond the fences
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How is a violin plot helpful?

A
  • it traces the density of the data (almost like a histogram with small bin sizes)
  • Used as a supplement to boxplot to better show the shape of the distribution
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How does a normal distribution relate to SD?

A

Approximately 68% of the data would be within 1 SD of the mean
Approximately 95% of the data would be within 2 SD of the mean
Essentially all (99.7%) of the data would be within 3 SD of the mean

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What items should be assessed with plots and numerical summaries?

A

symmetry, skew, kurtosis (heavy- or light-tailedness, outlier-proneness)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Normal distribution in histogram looks how?

A

symmetric and bell-shaped

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Normal distribution in box plot looks how?

A

box is symmetric around the median, as are the whiskers, without serious outliers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Normal distribution in Q-Q plot looks how?

A

data essentially falls on a straight line

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Right-skewed data

A

mean>median
Q-Q curve up on both tails
box = outliers above whiskers
histogram right

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Left-skewed data

A

o Mean < median (if data are unimodal)
o On histogram data extend to the left of center
o On a Q-Q plot the distribution bends in a “convex” shape down and away from the line in both tails
o On boxplot, multiple dots (outliers) below the whiskers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Outlier-proneness

Kurtosis

A

o Indicated by “s-shaped” curves on a Normal Q-Q plot
o Heavy-tailed but symmetric distributions are indicated by reverse “S”-shapes, as shown on the left below
o Light-tailed but symmetric distributions are indicated by “S” shapes in the plot, as shown on the right below

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Plot of discrete data

A

Values can only take discrete values and are not completely continuous

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Symmetric but outlier-prone (heavy-tailed)

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What can you assess with a scatterplot?

A
  • two continuous variables, typically response (outcome) on y-axis
  • Look for **direction **(positive or negative) and strength of a linear relationship (if one exists at all)
  • Also look for non-linear patterns in association (loess smooth curve can help here)
22
Q

What does Pearson correlation coefficient (r) assess?

A

strength of a linear association between the two continuous variables

23
Q

What does the r2 value tell you?

A

proportion of variability in y-variable explained by x-variable

24
Q

What plots can help assess relationship between continuous and factor variables?

A
  • comparison boxplot
  • faceted histograms
  • overlapping density plot
  • ridgeline plot
25
How can we assess for non-linear terms with a plot?
* Spearman’s rho-square plot to help determine which covariate has highest “predictive punch” for a non-linear term * Usual options to consider are (1) restricted cubic spline (rcs, with 3-5 knots) for a quantitative variable, and (2) interaction between categorical variables (or between a categorical and quantitative variable)
26
Spending degrees of freedom (df)
Quant main effect = 1 Binary main effect = 1 Multi-cat with L levels = L-1 RCS with n-knots adds n-1 or n-2 if already including main effect
27
How do we determine degrees of freedom for interaction terms?
o If the product term’s predictors have df1 and df2 degrees of freedom, product term adds df1 × df2 degrees of freedom. o An interaction of a binary and quantitative variable adds 1 × 1 = 1 additional degree of freedom to the main effects model. o When we use a quantitative variable in a spline and interaction, we’ll do the interaction on the main effect, not the spline.
28
Interprete comparison boxplots
* BMIs of the two groups are roughly similar * The shape of the distribution of BMI for each group is slightly right skewed (more so in HTN) * But may be ok since ~5% of sample being outliers matches a Normal distribution
29
Interpret the histogram
* Based on the histogram we can conclude AGE has a uniform distribution (with a slightly left skew) | peak on left or long tail to the right may indicate right skew
30
# Interpret these comparison boxplots
Based on the plot we can conclude BMI appears higher in females across the two diagnoses
31
Interpret the comparison boxplot
Based on the plot the Baseline temperature values appear higher than those “after 2 hours”
32
Interpret the comparison boxplots
* The boxplot shows that the baseline temperature is overall fairly similar between the three races. * It appears the distribution of baseline temperature for White (more so) and Black (less so) patients may be slightly left skewed since several outliers exist below the lower whiskers of the boxplot
33
How are skew and outliers related?
Skew is asymmetry of a distribution whereas outliers are extreme individual data points. Outliers can cause skew by pulling the tail of the distribution, especially if very extreme. But not all skewed distributions are because of obvious outliers.
34
Characteristics of right-skewed data
AKA positive-skew long tail on the right mean pulled higher than median may be caused by high outliers (income, hospital stays)
35
Characteristics of left-skewed data
AKA negative skew long tail on the left mean pulled lower than median may be caused by low outliers (e.g. age of retirement - some retire very early)
36
Interpret the scatterplot
the relationship looks approximately linear. There does not appear to be a non-linear relationship between the variables
37
Do you see any major violations of the regression assumptions?
1. Linear because values in Residuals vs. Fitted plot look fairly linear 2. Independent because values in Residual vs. Fitted plot don’t appear to have a “snake-like” pattern 3. Approximately normally distributed * Normal Q-Q plot looks ok with possibly some values falling below the line 4. Equal variances based on the spread of the data across the line in the Residual vs. Fitted plot 5. No points of leverage above 0.5
38
Which transformation of the data best approximate a Normal distribution?
Best answer is taking the natural log of viral load log(load). The log transformation is the best choice here. It’s the only one that produces a symmetric histogram, or a straight line in the Normal Q-Q plot (orange).
39
Which of the plots would be most help in assessing whether taking an inverse of the outcome would improve the assumption of linearity?
Figure B gives us direct evidence as to the impact of choosing an inverse transformation and then fitting a regression model on the linearity assumption. None of the others do so. Plots E and F together at best would tell us something about the assumption of Normality.
40
Do you have any concerns with the regression assumptions being satisfied?
The residuals vs. fitted value plot shows a serious problem with the assumption of linearity.
41
Do you have any concerns with the regression assumptions being satisfied?
The point in row 18 shows high influence, leverage and poor fit (high standardized residual value in Normal Q-Q plot). We have a problem with the assumption of Normality here.
42
Do you have any concerns with the regression assumptions being satisfied?
This model has a serious problem with the assumption of constant variance (homoscedasticity), as seen in the fan shape in the residuals vs. fitted values plot, and the upward trend in the scale-location plot.
43
A trial was designed to assess change in DAS28 by therapy. Which type of test would be best?
Two sample (Student’s) t-test. These are independent samples and the data are reasonably close to a Normal distribution in each sample.
44
Consider the Box-Cox plot below. What is the most promising strategy for fitting a linear regression model to describe the relationship?
lambda = 0.5 therefore model with square root transformation of the outcome
45
For boxcox transformation, what does the lambda value of 1 suggest?
No transformation
46
For boxcox transformation, what does the lambda value of 0.5 suggest?
square root transformation
47
For boxcox transformation, what does the lambda value of 0 suggest?
log transformation
48
For boxcox transformation, what does the lambda value of -0.5 suggest?
inverse square root transformation; y= 1/sqrt(y)
49
For boxcox transformation, what does the lambda value of -1 suggest?
Inverse; y= 1/y
50
Which analysis approach to assess weight change in 50 overweight males in a before-after study of diet?
Available options: (1) One sample t test on diffs, (2) Hmisc::smean.cl.boot on diffs, (3) Two sample t-test for before – after, (4) Welch Two Sample t-test for before – after, or (5) bootdif from Love-boost.R for before – after. * The comparison in this design involves paired samples, specifically before-after comparisons of the same individuals. Note that responses (options) 1 and 2 are for paired samples, and that responses (options) 3-5 are for independent samples comparisons. Thus, Figure 28B is completely irrelevant and should be ignored. * The large outliers in the plots of paired differences in Figure 28A encourage us to use the bootstrap rather than a t test (conclude that data do not follow a Normal distribution), and that leads us to response (option) 2, which shows the bootstrap result for a paired samples comparison. * Our estimate of the average weight change is a loss of 6.8 pounds. Since the 99% confidence interval (0.059, 13.402) for the population mean of the paired differences shown in response (option) 2 doesn’t include zero and specifically because all of the values in that interval are positive, we conclude that the mean weight loss was detectably greater than 0.
51
 Based on the plot below, should we only include main effects of treatment and insurance in the model, or an interaction term between these two variables?
There’s a substantial interaction (lots of non-parallelism in the lines joining the group means) shown in the plot. We are thus inclined to favor a model that includes the interaction term between treatment and insurance.