6. Sept. 5 Flashcards

1
Q

Quiz

A
  1. Stats equation for line
    Y = B0 + B1X1 + E-N(0,s)
    Y equals beta zero plus beta 1 times x plus error that’s normally distributed with a mean of zero and standard deviation of sigma

Confidence interval correct definition
95% of such intervals contain the true value

What are the 3 characteristics of the data and underlying relationship that influence the significance (p-value) of a regression analysis?
- Sample size, slope(effect size), noise,

Definition of R^2?
Proportion of variation in Y explained by variation in X

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

5 Assumptions of General Linear Model

A
  1. Y is continuous
  2. Normal distribution of error
  3. Linear relationship
  4. Homoscedasticity (constant variance)
  5. No autocorrelation (lack of interdependence)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How to look for violations of assumptions?

A

Plot it

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

In R

A
datum is non-normal
datumNonlin=read.csv(file.choose())
head(datum)
datumHetero=read.csv(file.choose())
head(datum)
datumAuto=read.csv(file.choose())
head(datumAuto)

plot(Y~X,datum)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Residuals Plot

A

Useful for each one of these violations
Have to run the analysis

results=lm(Y~X,data=datum)
plot(residuals(results) ~datum$X)
- Pulling from two different objects, so have to use dollar in datum$X to stipulate
— He’s taken the regression line and “flattened it” so it is now the x axis (horizontal)
— You see it’s non-normally distributed because it’s not equally above and below 0

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Nonlinear

A

plot(Y~X, data=datumNonlin)
For non-linear

resultsNonlin=lm(Y~X,data=datumNonlin)
abline(resultsNonlin)

plot(residuals(resultsNonlin)~datumNonlin$X)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Hetero

A

plot(Y~X,data=datumHetero)
resultsHetero=lm(Y~X,data=datumHetero)
abline(resultsHetero)
plot(residuals(resultsHetero)~datumHetero$X)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Autocorrelation

A

plot(Y~X,data=datumAuto)
resultsAuto=lm(Y~X,data=datumAuto)
abline(resultsAuto)
plot(residuals(resultsAuto)~datumAuto$X)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Histogram of residuals

A

A way of looking at normality of residuals in a GLOBAL sense.
There’s 2 ways of thinking about how NORMAL data really is.
- Global - how normal is it around the entire line
- Locally - how normal it is in relation to other points

Important for things like:
If we fit a line through the non-linear
- Non-normally distributed

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Also, if you have supposedly TWO violations

A

You usually just have one BIG one, and the other is not so bad when you fix the first one

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Histogram in R

A

hist(residuals(results))

Best for looking at ____

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Bell curve “critosis”?

A

Where the curve is pinched. Most of it in tiny sliver in middle

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

ACF

A

Auto-correlation function

Both a graph AND a statistical test

datumNorm=read.csv(file.choose())
plot(Y~X,data=datumNorm)
resultsNorm=lm(Y~X,data=datumNorm)
abline(resultsNorm)

acf(residuals(results) [order(datum$X)])

  • – Still need things to be in X order, but we can’t technically use a tilde because it’s not a ____
  • Have to put in the brackets
  • Data has to be in order spatially or temporally, and that happens with the order command
  • – Not necessary if your data was already [in such and such order]
  • – It will show you that a point is perfectly correlated with itself (x = 0, y = 1.0 for perfect correlation)

You’re looking for correlation within the first couple (1-4). Past 10 you really don’t care much.

acf(residuals(resultsAuto) [order(datumAuto$X)])

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Nonlinear ACF

A

acf(residuals(resultsNonlin) [order(datumNonlin$X)])

LOOKS like it’s autocorrelated, but it’s related LOCALLY. There is a consistent amount of brethren points close to it because they are clustered such

SO, ACF is only good for autocorrelation.
- BUT remember it will give a strong signature if you have non-linearity.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

SHIFT GEARS from assumptions to predictions

A

One of the real values of statistics is you can use them to make PREDICTIONS.
- You can easily use the GLM to make predictions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

GLM as an example

A
y = B0 + B1X1 + E-N(0,s)
Biomass - y
Rainfall - x
We can use the formula to go FORWARD and PREDICT
- Just plug everything in
17
Q

Two (3) things to talk about when making predictions

A
  1. Anytime you make predictions, you need to be careful
    - Ex. data from 0-10 cm of rainfall at 1cm intervals
    - How much biomass we got
    - – Plug in
    - —– called INTERPOLATION
    - —– Making predictions within the range of observed data
    - —– It’s GOOD, cool, one of statistics STRENGTHS
    - —– Probably at

What about if you go PAST what you measured (17 cm when you stopped at 10cm)
2) EXTRAPOLATION
Making predictions outside the range of observed values
- It’s not BAD
- But you have to be VERY CAREFUL
— At BEST it is a hypothesis

WHY is it such a problem? Why be so careful

  • Because we have NO idea what happens to the Y at that point
  • – Without extra data, we’re being very ambitious
  • – Understand EXACTLY what you’re doing if you choose to extrapolate

3) Measures of Uncertainty
Any time you give an estimate of truth, give a measure of certainty (confidence interval)
- For predictions, we use a slightly different measure of certainty
— PREDICTION interval

Confidence: measure of uncertainty on the average value of something

  • Ex: CI of a slope, what we think the slope might be, it’s range
  • – Remember, the slope is an AVERAGE of a bunch of points at a certain X value

Prediction interval
If we want to make predictions of how much Y you get at certain value of X
- Predictions have MUCH more uncertainty
— Because they have individual points, NOT averages (As confidence intervals are)
— They are measures of uncertainty in possible INDIVIDUAL outcomes
- Prediction intervals much BIGGER (what we might see in ANY given outcome)

Prediction intervals are about individual outcomes (data), while the confidence intervals are about averages (slopes, groups of data)

18
Q

Predictions in R

A

plot(Y~X,data=datumNorm)
summary(resultsNorm)

Equation is Y = 2.83 + 2.00 * X

When I do predictions, I do them in a big sequence

x=seq(from=0,to=10,by=0.1)
x
NewX=data.frame(X=x)
head(NewX)

You have to create a new dataset that has that X data in it

19
Q

Create object

A

predictions=predict(resultsNorm, NewX, interval=”prediction”)
predictions

3 arguments

  1. What do you use to make the predictions (your results)
  2. The inputs you want to make predictions AT (your newly created thing)
  3. Then give prediction intervals for us
20
Q

Plot it last

A

matplot(NewX,predictions[,1:3],type=”l”)

Gives three lines