9. Fri. Sept 14th Flashcards

1
Q

Quiz

A
  1. Why are there so many different post hoc tests?
    - They all attempt to balance type i and type ii error, and none of them is perfect
  2. With a continuous variable x, the meaning of the beta is slope (or the amount of changei n y for each 1 unit change in x). What is the meaning of the beta when the x variable is categorical.
    - Difference between that group and the reference gruop
  3. What is the ultimate purpose of conducting post-hoc tests?
    - To adjust the pair-wise p-values to maintain the experiment-wise error rate
  4. If a relationship is known to be linear, how does Cottingham suggest you should distribute your treatments in an experiment?
    - Many treatment levels with few replicates
  5. If a relationship is known to be linear, how does Steury suggest you should distribute your treatments in an experiment?
    - All replicates in two treatment levels at the extremes in natural variation (because that maximizes R squared, which maximizes power)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Mini review of what we’ve covered so far

A
  • We basically reviewed all of stats 7000
  • Need to know:
  • – There’s a general linear model (insert equation)
  • —- This model can capture data with a continuous x and y, and a categorical x and y
  • —- In R you can use the same function lm()
  • —— he only things that’s really difference is the interpretation of beta
  • —- IF x continuous, beta is slope, if x is categorical its the difference between the gruops, and you might technically have more than 1 x, but they’re technically single categories within the x variable
  • The 5 Assumptions
    1.

Special case when x could be eitgher categorical or continuous
- His argument is always treat it as continuous unless there’s significant evidence of non-linearity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Now we expand on this model for the rest of the class

A

When saw that with the ANOVA, we don’t have to have just 1 one.

  • We can put as many x’s in here as we want
  • So TRUE linear model looks likes [insert equation]
  • – The x’s can be any combination of categorical and continuous variables
  • These tests are fancy names, just slight variations in the linear model, and various combinations of x’s
  • – And WHY you would want to do those fancy things. Because you don’t HAVE to

You CAN have an unlimited number of x’s, but in practice/computational limits, it’s not true.

  • These next models have to be fit ITERATIVELY
  • – Goings to sum of squared error, minimizing the sum of squared error (minimize standard deviation)

Think of each x as a dimension of this curve, and each x represents a dimension of this y curve

  • You are practically limited to somewhere between 8-10 x’s
  • – And that is somewhat determined by your sample size
  • —– You don’t want 8 samples and 8 x’s
  • – You need a MINIMUM of 10 samples PER x variable
  • For the betas to be accurate on average, that’s what you need

After 12 to 14 x’s, your computer will start to smoke and explode. They just can’t do it.

There’s more computational power in your phone than NASA had to send men to the moon.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Multi-Variable Model

A

A model with multiple x’s

  • Some people call them multi-variate models WHICH IS INCORRECT
  • – Those are something very specific that we talk about after the mid term

You do NOT have to use a multivariable model
Ex: You have a categorical x and a continuous x
- You can run two separate general linear models
- NOTHING WRONG WITH THAT

BUT
There are a number of advantages to running multivariable models AS OPPOSED to analyzing all your x’s separately

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

6 advantages of multivariable models

A
  1. They are more elegant
    - The best experiments are the ones that don’t need any statistics at all. (You just throw up a bar chart)
    - But IT IS EASIER to get published if you do them
    - – It SHOULDN’T be that way, but people are impressed by high-power statistics
    - —- And that leads to unknowledgeable folks using high-powered them inproperly

“If I cant’ understand what you did, that’s a problem.”

  • And he’s a good statistician
  • Some professors just let it slide to not look dumb
  1. Because of “swamping” (his term)
    - When the effect of one x variable is masked (or swamped out) by the effect of another x-variable
    - Goes back to the idea that one of the things that influences your p-value is noise/error in the system
    - – Makes it harder to determine significant effect
    - So if there’s an X that exists that ISN’T in your model, it’s going to cause problems
  2. Colinearity
    Talk about this on Monday and Wednesday
    - When two or more x-variables are correlated with each other
    - It’s a HUGE problem in MANY sciences
    - Unfortunately, most ecologists do not understand how to deal with it
    — Ask them “I have colinearity, what do I do?” “Just take one out” Which is often the worst thing you can do
    — Autocorrelation: your two different SAMPLES are related to each other
    — Colinearity: is when x VARIABLES are related (totally different)
  3. You may have interactions
    - Interactions are THE COOLEST thing in ecology
    - – We’re gonna read a paper that’s in Science ONLY BECAUSE it has interaction, which is coveted, and the people in the article didn’t even realize it
    - We’ll spend a whole week on that
    - It is when the effect of one x-variable (the Beta) depends on the VALUE of another x-variable
    - – Ex, difference between males and females (effect of age on size depends upon which sex you’re talking about)
  4. We may want to include random blocking variables
    - Go back to swamping, kind of
    - Happens when you measure something repeatedly (spatially or temporally) and we need to include that variable in the model because it explains some of the noise.
    - About 2 weeks
    - Generally, doing this (including random blocking variables) is a good thing
  5. We may need to account for pseudo-replication
    - We’ll spend a few days talking about it
    - – Big problem, easy to do without knowing it
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Example: An ANCOVA

A

The simplest multivariable model to understand
- Analysis of Covariance

Other definitions you’ll need to know in the literature:

  • Covariate: a continuous x
  • Factor: a categorical x

Typically has one continuous x and one categorical x
- y is usually still error normally distributed

BIG EXAMPLE
Continuous X: Age
Categorical X: Sex
Y: Size

We might have something that looks like [dot graph]
- ?Dif colors for male/female dots?

Size = B0 + B1Age + B2Sex + E
Dummy code: 1 male, 0 female

WE DO NOT HAVE TO RUN THIS MULTIVARIATE MODEL

  • We could analyze the effects separately (of age and sex)
  • We’ll still get a good estimate
  • Beta is odd, between the two,
  • And sigma is large: it has to capture noise

But, if we run the multivariable model, we essentially get 2 equations
- 1 for females

Size Equation for Females
female size to age
B0 + B1Age + E

B0 = females at age 0
B1= slope of age/size relationship

Size Equation for Males
(B0 + B2) + B1Age + E
- Beta 2
- The slope doesn’t change. The only thing that changes is our intercept

IMPORTANT
When you run the multi model, the mean of the betas hasn’t changed from as if we ran uni-variable models
- still how much size changes for each unit change in age

What would be mean of beta if:

  • uni-variable (GLM): difference between males and females
  • multi variable: STILL just the difference

The mean of the beta hasn’t changed AT ALL

What has changed is SIGMA

  • Gives us a MUCH smaller error
  • So p-values and confidence intervals will go down

That’s the essence of SWAMPING

  • if we ran the first model, we might not actually catch the effect of sex because there’s SO much error in the system
  • – Because there’s so much overlap in the data
  • – If we hadn’t added SEX to the just size to age graph, you wouldn’t explain a CRAZY amount of noise

What if these lines aren’t parallel?

  • I don’t want to answer that for at least a week
  • – Because then you have an interaction, and it complicates things so much
  • —- We want to always ASSUME these lines are parallel unless we have significant evidence that they are not?
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Let’s do this in R

A

Data in the syllabus

  • Look at it in Excel first

Col 1: Age

2: Sex
3: SexN
4: Error
5: Size (truth)

And error is 1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

In R

A

datum=read.csv(file.choose())
head(datum)

plot(Size~Age,data=datum)

results=lm(Size~Age, data=datum)
summary(results)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What he goes over in the summary(results)

A

If we don’t care about the sex

Estimate of truth
Standard error
Confidence interval (2* standard error)
P-value
The KEY: R-squared: 98% of variation is driven by age
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

THEN he does size as a function of sex, SAME data

A

results=lm(Size~Sex,data=datum)
summary(results)

Estimate is good, but significance?
Confidence interval
P-value
R-squared

We KNOW there’s a variation in the data depending on sex BECAUSE WE MADE THE DATA

  • Why didn’t we catch it?
  • There’s so much noise
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

So let’s do it as ANCOVA

A

results=lm(Size~Sex+Age,data=datum)

Confidence interval (went from 9 to .5!!!)
P-value
Standard error

ALL we did was add age into the model

  • A proto-typical example of swamping
  • All that noise caused by age has been explained
  • So consequently now we CAN detect the effect of sex
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Where would you see this

A

As a scientist, I think there might be this one factor that people hadn’t studied yet

You’ve eliminated your way out of some of that noise

  • If there is some data that you don’t CARE about because other people have tracked it, STILL COLLECT THAT DATA
  • – Because if you don’t, it drives your results into CRAZY, insignificant directions
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Something else kind of important, tangential to the base topic but important to understand

A

Anyone use SAS before?

  • When you run a model like this in SAS, it always gives you 2 results
  • – Type III sum of squares
  • – Type I sum of squares

It’s important EVEN IF you never use SAS
- You need to understand the dif b/t type 1 sum of squares and type III sum of squares

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Important take away

A

In this type, ORDER MATTERS

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Important take away

A

In this type, ORDER MATTERS

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

I’m telling you because

A

If you do an anova table of “results”,

You CAN get a type III anova table specialty function

17
Q

KEys

A
  • 6 Reasons we do multivariable models

- What swamping is and why it’s important that you run a multivariable model when it’s there