Further Theory Flashcards

1
Q

Suppose we are interested in comparing population means between 4 groups. Compared to multiple pairwise T-tests, the post-hoc comparison tests after ANOVA are

A

Able to account for the experiment-wise error in each pairwise comparison.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

SIMPLE LINEAR REGRESSION Assumption #1

A

The mean of error is 0, i.e., E(error)=0.
This is not too restrictive as long as the intercept ß0 is included in the equation.
Setting an appropriate ß0 can enable us to assume the average value of error in the population is zero.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Assumption #2:

A

error is mean independent of x, i.e., E(error|x) = E(error)

The average value of error does not depend on the value of x.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Assumption #1 and #2 are usually combined into one assumption:

A

zero conditional mean assumption

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

zero conditional mean assumption is the key to?

A

Zero conditional mean assumption is the key to obtaining the OLS estimates b0 and b1 (ensuring unbiasedness)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Assumption #3:

A

The value of error associated with any particular value of y is independent of y associated with any other value of y.

Often, this assumption is equivalently stated as that the sample at hand is a randomsample obtained from the population

errors are independent of each other.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Assumption #4 (Homoskedasticity)

A

The error has the same variance given any value of x

–> when the variance of error depends on x then the error term exhibits heteroskedasticity (nonconstant variance)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Assumption #5 (Normality)

A

The error is normally distributed.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What do assumptions 4 and 5 ensure?

A

Assumptions #4 and #5 ensure the lowest variances of b0 and b1 as estimators of ß0 and ß1.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are the assumptions modeled above?

A

The assumptions mentioned above are called classical linear model assumptions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What happens under all these assumptions?

A

OLS estimators are the minimum variance unbiased estimators

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Scatter plot and the assumptions

A

Residuals vs. Fitted/Predicted values of !
If all assumptions are met, residuals should randomly and symmetrically distributed around the horizontal line.
If a clearly non-random pattern emerges from this plot, then one or more assumptions are probably violated.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are standardized residuals?

A

They are basically the variation of the error terms. When plotted they should be constant as y hat changes. Only then homoscedasticity is not violated.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What if the residual plots indicate the assumptions are violated?

A

Thinking about improving the model

An important cause of violating the assumptions is mis-specifying the relationship between the dependent variable and independent variable
For example, important independent variables are left out to the error

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Ceteris paribus

A

Other relevant factors being equal (all else being equal; holding all other relevant factors constant)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Why is multiple linear regression analysis (compared to simple linear regression analysis) more able to make ceteris paribusinference?

A

By modeling the dependent variable as a function of multiple independent variables, multiple linear regression analysis can explicitly control for many other factors that simultaneously affect the dependent variable when we assess the effect of the focal independent variable on the dependent variable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

In assessing the linear relationship between two interval variables, what are common and different between Pearson’s coefficient of correlation and simple linear regression analysis?

A

Common: Both methods can indicate whether there exists a linear relationship between the two interval variables and, if yes, the direction (positive or negative) of the linear relationship.
Different: Pearson’s coefficient of correlation measures the strength of the linear relationship over the range of [−1,1], while slope estimate in simple linear regression analysis measures what is the expected change in y given one-unit change in x. [2.5’]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

A sampling distribution is a hypothetical distribution of a test statistic from

A

repeated samples of the same size, each from the same underlying population

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

In a matched paid or randomized block design, we usually group observations from different samples together based on one variable mainly because

A

we want to control for the impact of this variable while investigating the impact of the focal variable on the outcome variable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Based on an estimated linear regression model is the prediction or confidence interval wider?

A

The prediction interval for one value of y based on x is always wider than the confidence interval.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What is a prediction interval?

A

A prediction interval is a range that is likely to contain the response value of an individual new observation under specified settings of your predictors.

If Minitab calculates a prediction interval of 1350–1500 hours for a bulb produced under the conditions described above, we can be 95% confident that the lifetime of a new bulb produced with those settings will fall within that range.

You’ll note the prediction interval is wider than the confidence interval of the prediction. This will always be true, because additional uncertainty is involved when we want to predict a single response rather than a mean response.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What happens when we increase the sample size to the prediction and confidence intervals?

A

They get smaller

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

When is a equation not linear

A
  • when it includes more than one parameter per predictor variable
  • when the parameter is transformed
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

when is an equation linear

A

A model is linear when each term is either a constant or the product of a parameter and a predictor variable. A linear equation is constructed by adding the results for each term. This constrains the equation to just one basic form:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

if pearson coefficient of correlation is .64 how much of the model is explained?

A

you have to square r so .64^2 = R^2. so .4096 is explained.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Bivariate distribution?

A

Bivariate distribution are the probabilities that a certain event will occur when there are two independent random variables in your scenario. The distribution tells you the probability of each possible choice of your scenario.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

Marginal probability distribution

A
  • Univariate probability distributions derived from joint probability distributions
  • Obtained by summing across rows or down
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

Does mulitcollinearity hur the model’s overall fit or prediction accuracy?

A

No it does not. It just hinders the interpretation fo the regression coefficients.

29
Q

What is there to know about the F-distribution?

A

It depends on two degrees of freedom. In ANOVA it is a right-tail test. In other tests it can also be 2 tail test.

30
Q

What is VIF (Variance Inflation Factor)

A

In statistics, the variance inflation factor is the quotient of the variance in a model with multiple terms by the variance of a model with one term alone. It quantifies the severity of multicollinearity in an ordinary least squares regression analysis.

31
Q

Unbalanced two-factor ANOVA

A

The term “unbalanced” means that the sample sizes nkj are not all equal. A balanced design is one in which all nkj = n.

32
Q

How do you interpret the slope of a linear regression?

A

b1 is the slope and means that for one more unit of x you get x times the slope more y on AVERGE. DONT FORGET AVERAGE or expected!

33
Q

What does R squared tell you?

A

That the percentage of R^2 can be explained by the independent variable(s). The rest can’t be explained.

34
Q

Which test is good to test if two categorical variables are dependent of each other?

A

The chi-squared test (Goodness of fit test). You use the contingence table formula from the formula sheet but you must first construct a table yourself.

35
Q

What is the assumption for the Chi-squared contingency test?

A

That the expected frequency for each cell is at least 5. (so a larger sample is desirable)

36
Q

Which side is the Chi-squared test?

A

Always right tail

37
Q

What is the Chi-squared goodness of fit test good for?

A
  • to test interdependence btw. two categorical variables

- to test for normality (whether a given random sample is drawn from a certain distribution)

38
Q

If some natural relation exists between each pair of observation then it is a

A

matched pair

39
Q

What is a paired sample t-test

A

The paired sample t-test, sometimes called the dependent sample t-test, is a statistical procedure used to determine whether the mean difference between two sets of observations is zero. In a paired sample t-test, each subject or entity is measured twice, resulting in pairs of observations.

40
Q

What if an answer option contains “will”

A

Be extremely careful because it is probably wrong and infers causation.

41
Q

What is a Variance Inflation Factor?

A

A variance inflation factor(VIF) detects multicollinearity in regression analysis. Multicollinearity is when there’s correlation between predictors (i.e. independent variables) in a model; it’s presence can adversely affect your regression results. The VIF estimates how much the variance of a regression coefficient is inflated due to multicollinearity in the model.

42
Q

How to interpret the VIF

A

Variance inflation factors range from 1 upwards. The numerical value for VIF tells you (in decimal form) what percentage the variance (i.e. the standard error squared) is inflated for each coefficient. For example, a VIF of 1.9 tells you that the variance of a particular coefficient is 90% bigger than what you would expect if there was no multicollinearity — if there was no correlation with other predictors.
A rule of thumb for interpreting the variance inflation factor:

1 = not correlated.
Between 1 and 5 = moderately correlated.
Greater than 5 = highly correlated.

43
Q

After estimating a linear regression model with 3 independent variables, you decide to verify whether or not the degree of multicollinearity is acceptable. You then regress each independent variable against the remaining others, and obtain the following R^2 when using the first, second, and third variabel as the dependent variabel: .2,.3,.9 What would be the most appropriate course of action?

A

To remove the third variable because it has a VIF of 10, thus being highly explained by the other variables.

44
Q

What does multicollinearity not affect?

A

the model’s overall fit or predictive capability. It only affects the interpretation of the coefficients

45
Q

What is the purpose of Randomized block design?

A

To reduce SSE and thus the Type II error rate

46
Q

balanced, unbalanced design

A

In ANOVA and Design of Experiments, a balanced design has an equal number of observations for all possible level combinations. This is compared to an unbalanced design, which has an unequal number of observations.

47
Q

What are levels in Two-way ANOVA?

A

Levels (sometimes called groups) are different groups of observations for the same independent variable.

e.g. Education as one variable and then levels are high school, college, university

48
Q

What three hypothesis can we test in a two-way ANOVA?

A
  1. test for the differences btw. the levels of factor A
  2. test for the differences btw. the levels of factor B
  3. test for the interaction effect btw. A and B
49
Q

RBD blocking criteria

A
  • you can identify a characteristic related to the experimental subjects that can group subjects into more homogenous subgroups (blocks) with respect to the outcome variable
  • the blocking characteristics is something which has substantial impact on the outcome variable but it not of your focal interest
50
Q

What is the difference btw. matched t-tests and RBD

A
  • matched t-test is used to determine whether the mean difference btw. two sets of observations is zero
    In a paired sample t-test, each subject or entity is measured twice, resulting in pairs of observations
51
Q

RBD What is it?

A

For randomized block designs, there is one factor or variable that is of primary interest. However, there are also several other nuisance factors.

A randomized block design is an experimental design where the experimental units are in groups called blocks. The treatments are randomly allocated to the experimental units inside each block.

With a randomized block design, the experimenter divides subjects into subgroups called blocks, such that the variability within blocks is less than the variability between blocks.

Then, subjects within each block are randomly assigned to treatment conditions. Compared to a completely randomized design, this design reduces variability within treatment conditions and potential confounding, producing a better estimate of treatment effects.

52
Q

Is there enough evidence at the 1% significance level to infer that the x and the y are linearly related?

A

Conduct a two tail t test. to see if ß is =0 or not. RR: t>t0.005,df

53
Q

How to find out if we want to test a proportion or if population means differ?

A

Proportions are tested if we have categorical data and we want to know the number of successes in the proportion. The mean is for when we have rather many observations for each sample and want to make an inference for the population of the sample.

54
Q

what is a prediction interval?

A

In linear regression a prediction interval is used when we want to predict a one-time occurrence for a particular value of y when the independent variable is a given x value.

55
Q

The ratio of two independent chi-squared variables divided by their degrees of freedom is

A

F-distributed

56
Q

Is there enough evidence at the 1% significance level to infer that the average number of hours of exercise per week and the age at death are linearly related?

Which test do we use?

A

Note this is a two-tail test, so split 1% into equal halves in looking for rejection regions.

Rejection region: | t | > t0.005,36 = 2.724

Because we are testing if they are linearly related. It is not indicated whether positive or negative.

57
Q

Is there enough evidence at the 5% significance level to infer that the cholesterol level and the age at death are negatively linearly related?

which test do we use?

A

Note that this is a left-tail test. Because we are proving negative relation.

58
Q

The expected value of the difference of two sample means equals the difference of the corresponding population means when

A

a. The populations are normally distributed.
b. The samples are independent.
c. The populations are approximately normal and the sample sizes are large.
D. All of these choices are true. (yes)

59
Q

Is the regression model as a whole valid? Give your assessment.

A

Yes. The F-Test for the validity of the model is highly significant (F=20.577; p- value=.000). In addition, there is no reason to suspect that a multicollinearity problem exists since all VIF statistics are close to 1 and significantly smaller than 5. The explanatory power of the model is high ( , indicating that the model has a good linear fit.

60
Q

The covariance of two variables X and Y

A

Can be any real number

61
Q

H0 chi square

A

The null hypothesis for a chi-square independence test is that two categorical variables are independent in some population

62
Q

In a matched pair or randomized block design, we usually group observations from different samples together based on one variable mainly because

A

we want to control for the impact of this variable while investigating the impact of the focal variable on the outcome variable

63
Q

What happens to the prediction or confidence interval when xg is closer to the sample mean of x?

A

It gets narrower

64
Q

What is perfect multicollinearity?

A

when one of the independent variable is a linear combination of the other independent variables

65
Q

F-Distribution

A

The F-Distribution is also called as Variance Ratio Distribution as it usually defines the ratio of the variances of the two normally distributed populations.

66
Q

Difference between Chi and F-distribution

A

F-test is used for testing equality of two variances from different populations and for testing equality of several means with technique of ANOVA.
Chi-square test is used to test the population variance against a specified value, testing goodness of fit of some probability distribution and testing for independence of two attributes.

67
Q

How is the VIF created?

A

By regressing one independent variable on all other independent variables.

68
Q

What does a very large VIF indicate?

A

A large VIF with one independent variable indicates the information contained in this variable is mostly redundant

69
Q

In the randomized block design, we divide subjects into different blocks to

A

Reduce the variation caused by error.