Further Theory Flashcards
Suppose we are interested in comparing population means between 4 groups. Compared to multiple pairwise T-tests, the post-hoc comparison tests after ANOVA are
Able to account for the experiment-wise error in each pairwise comparison.
SIMPLE LINEAR REGRESSION Assumption #1
The mean of error is 0, i.e., E(error)=0.
This is not too restrictive as long as the intercept ß0 is included in the equation.
Setting an appropriate ß0 can enable us to assume the average value of error in the population is zero.
Assumption #2:
error is mean independent of x, i.e., E(error|x) = E(error)
The average value of error does not depend on the value of x.
Assumption #1 and #2 are usually combined into one assumption:
zero conditional mean assumption
zero conditional mean assumption is the key to?
Zero conditional mean assumption is the key to obtaining the OLS estimates b0 and b1 (ensuring unbiasedness)
Assumption #3:
The value of error associated with any particular value of y is independent of y associated with any other value of y.
Often, this assumption is equivalently stated as that the sample at hand is a randomsample obtained from the population
errors are independent of each other.
Assumption #4 (Homoskedasticity)
The error has the same variance given any value of x
–> when the variance of error depends on x then the error term exhibits heteroskedasticity (nonconstant variance)
Assumption #5 (Normality)
The error is normally distributed.
What do assumptions 4 and 5 ensure?
Assumptions #4 and #5 ensure the lowest variances of b0 and b1 as estimators of ß0 and ß1.
What are the assumptions modeled above?
The assumptions mentioned above are called classical linear model assumptions
What happens under all these assumptions?
OLS estimators are the minimum variance unbiased estimators
Scatter plot and the assumptions
Residuals vs. Fitted/Predicted values of !
If all assumptions are met, residuals should randomly and symmetrically distributed around the horizontal line.
If a clearly non-random pattern emerges from this plot, then one or more assumptions are probably violated.
What are standardized residuals?
They are basically the variation of the error terms. When plotted they should be constant as y hat changes. Only then homoscedasticity is not violated.
What if the residual plots indicate the assumptions are violated?
Thinking about improving the model
An important cause of violating the assumptions is mis-specifying the relationship between the dependent variable and independent variable
For example, important independent variables are left out to the error
Ceteris paribus
Other relevant factors being equal (all else being equal; holding all other relevant factors constant)
Why is multiple linear regression analysis (compared to simple linear regression analysis) more able to make ceteris paribusinference?
By modeling the dependent variable as a function of multiple independent variables, multiple linear regression analysis can explicitly control for many other factors that simultaneously affect the dependent variable when we assess the effect of the focal independent variable on the dependent variable.
In assessing the linear relationship between two interval variables, what are common and different between Pearson’s coefficient of correlation and simple linear regression analysis?
Common: Both methods can indicate whether there exists a linear relationship between the two interval variables and, if yes, the direction (positive or negative) of the linear relationship.
Different: Pearson’s coefficient of correlation measures the strength of the linear relationship over the range of [−1,1], while slope estimate in simple linear regression analysis measures what is the expected change in y given one-unit change in x. [2.5’]
A sampling distribution is a hypothetical distribution of a test statistic from
repeated samples of the same size, each from the same underlying population
In a matched paid or randomized block design, we usually group observations from different samples together based on one variable mainly because
we want to control for the impact of this variable while investigating the impact of the focal variable on the outcome variable
Based on an estimated linear regression model is the prediction or confidence interval wider?
The prediction interval for one value of y based on x is always wider than the confidence interval.
What is a prediction interval?
A prediction interval is a range that is likely to contain the response value of an individual new observation under specified settings of your predictors.
If Minitab calculates a prediction interval of 1350–1500 hours for a bulb produced under the conditions described above, we can be 95% confident that the lifetime of a new bulb produced with those settings will fall within that range.
You’ll note the prediction interval is wider than the confidence interval of the prediction. This will always be true, because additional uncertainty is involved when we want to predict a single response rather than a mean response.
What happens when we increase the sample size to the prediction and confidence intervals?
They get smaller
When is a equation not linear
- when it includes more than one parameter per predictor variable
- when the parameter is transformed
when is an equation linear
A model is linear when each term is either a constant or the product of a parameter and a predictor variable. A linear equation is constructed by adding the results for each term. This constrains the equation to just one basic form:
if pearson coefficient of correlation is .64 how much of the model is explained?
you have to square r so .64^2 = R^2. so .4096 is explained.
Bivariate distribution?
Bivariate distribution are the probabilities that a certain event will occur when there are two independent random variables in your scenario. The distribution tells you the probability of each possible choice of your scenario.
Marginal probability distribution
- Univariate probability distributions derived from joint probability distributions
- Obtained by summing across rows or down