Further Theory Flashcards

Question 1

Q

Suppose we are interested in comparing population means between 4 groups. Compared to multiple pairwise T-tests, the post-hoc comparison tests after ANOVA are

Answer

A

Able to account for the experiment-wise error in each pairwise comparison.

Question 2

Q

SIMPLE LINEAR REGRESSION Assumption #1

Answer

A

The mean of error is 0, i.e., E(error)=0.
This is not too restrictive as long as the intercept ß0 is included in the equation.
Setting an appropriate ß0 can enable us to assume the average value of error in the population is zero.

Question 3

Q

Assumption #2:

Answer

A

error is mean independent of x, i.e., E(error|x) = E(error)

The average value of error does not depend on the value of x.

Question 4

Q

Assumption #1 and #2 are usually combined into one assumption:

Answer

A

zero conditional mean assumption

Question 5

Q

zero conditional mean assumption is the key to?

Answer

A

Zero conditional mean assumption is the key to obtaining the OLS estimates b0 and b1 (ensuring unbiasedness)

Question 6

Q

Assumption #3:

Answer

A

The value of error associated with any particular value of y is independent of y associated with any other value of y.

Often, this assumption is equivalently stated as that the sample at hand is a randomsample obtained from the population

errors are independent of each other.

Question 7

Q

Assumption #4 (Homoskedasticity)

Answer

A

The error has the same variance given any value of x

–> when the variance of error depends on x then the error term exhibits heteroskedasticity (nonconstant variance)

Question 8

Q

Assumption #5 (Normality)

Answer

A

The error is normally distributed.

Question 9

Q

What do assumptions 4 and 5 ensure?

Answer

A

Assumptions #4 and #5 ensure the lowest variances of b0 and b1 as estimators of ß0 and ß1.

Question 10

Q

What are the assumptions modeled above?

Answer

A

The assumptions mentioned above are called classical linear model assumptions

Question 11

Q

What happens under all these assumptions?

Answer

A

OLS estimators are the minimum variance unbiased estimators

Question 12

Q

Scatter plot and the assumptions

Answer

A

Residuals vs. Fitted/Predicted values of !
If all assumptions are met, residuals should randomly and symmetrically distributed around the horizontal line.
If a clearly non-random pattern emerges from this plot, then one or more assumptions are probably violated.

Question 13

Q

What are standardized residuals?

Answer

A

They are basically the variation of the error terms. When plotted they should be constant as y hat changes. Only then homoscedasticity is not violated.

Question 14

Q

What if the residual plots indicate the assumptions are violated?

Answer

A

Thinking about improving the model

An important cause of violating the assumptions is mis-specifying the relationship between the dependent variable and independent variable
For example, important independent variables are left out to the error

Question 15

Q

Ceteris paribus

Answer

A

Other relevant factors being equal (all else being equal; holding all other relevant factors constant)

Question 16

Q

Why is multiple linear regression analysis (compared to simple linear regression analysis) more able to make ceteris paribusinference?

Answer

A

By modeling the dependent variable as a function of multiple independent variables, multiple linear regression analysis can explicitly control for many other factors that simultaneously affect the dependent variable when we assess the effect of the focal independent variable on the dependent variable.

Question 17

Q

In assessing the linear relationship between two interval variables, what are common and different between Pearson’s coefficient of correlation and simple linear regression analysis?

Answer

A

Common: Both methods can indicate whether there exists a linear relationship between the two interval variables and, if yes, the direction (positive or negative) of the linear relationship.
Different: Pearson’s coefficient of correlation measures the strength of the linear relationship over the range of [−1,1], while slope estimate in simple linear regression analysis measures what is the expected change in y given one-unit change in x. [2.5’]

Question 18

Q

A sampling distribution is a hypothetical distribution of a test statistic from

Answer

A

repeated samples of the same size, each from the same underlying population

Question 19

Q

In a matched paid or randomized block design, we usually group observations from different samples together based on one variable mainly because

Answer

A

we want to control for the impact of this variable while investigating the impact of the focal variable on the outcome variable

Question 20

Q

Based on an estimated linear regression model is the prediction or confidence interval wider?

Answer

A

The prediction interval for one value of y based on x is always wider than the confidence interval.

Question 21

Q

What is a prediction interval?

Answer

A

A prediction interval is a range that is likely to contain the response value of an individual new observation under specified settings of your predictors.

If Minitab calculates a prediction interval of 1350–1500 hours for a bulb produced under the conditions described above, we can be 95% confident that the lifetime of a new bulb produced with those settings will fall within that range.

You’ll note the prediction interval is wider than the confidence interval of the prediction. This will always be true, because additional uncertainty is involved when we want to predict a single response rather than a mean response.

Question 22

Q

What happens when we increase the sample size to the prediction and confidence intervals?

Answer

A

They get smaller

Question 23

Q

When is a equation not linear

Answer

A

when it includes more than one parameter per predictor variable
when the parameter is transformed

Question 24

Q

when is an equation linear

Answer

A

A model is linear when each term is either a constant or the product of a parameter and a predictor variable. A linear equation is constructed by adding the results for each term. This constrains the equation to just one basic form:

Question 25

Q

if pearson coefficient of correlation is .64 how much of the model is explained?

Answer

A

you have to square r so .64^2 = R^2. so .4096 is explained.

Question 26

Q

Bivariate distribution?

Answer

A

Bivariate distribution are the probabilities that a certain event will occur when there are two independent random variables in your scenario. The distribution tells you the probability of each possible choice of your scenario.

Question 27

Q

Marginal probability distribution

Answer

A

Univariate probability distributions derived from joint probability distributions
Obtained by summing across rows or down

Question 28

Q

Does mulitcollinearity hur the model’s overall fit or prediction accuracy?

Answer

A

No it does not. It just hinders the interpretation fo the regression coefficients.

Question 29

Q

What is there to know about the F-distribution?

Answer

A

It depends on two degrees of freedom. In ANOVA it is a right-tail test. In other tests it can also be 2 tail test.

Question 30

Q

What is VIF (Variance Inflation Factor)

Answer

A

In statistics, the variance inflation factor is the quotient of the variance in a model with multiple terms by the variance of a model with one term alone. It quantifies the severity of multicollinearity in an ordinary least squares regression analysis.

Question 31

Q

Unbalanced two-factor ANOVA

Answer

A

The term “unbalanced” means that the sample sizes nkj are not all equal. A balanced design is one in which all nkj = n.

Question 32

Q

How do you interpret the slope of a linear regression?

Answer

A

b1 is the slope and means that for one more unit of x you get x times the slope more y on AVERGE. DONT FORGET AVERAGE or expected!

Question 33

Q

What does R squared tell you?

Answer

A

That the percentage of R^2 can be explained by the independent variable(s). The rest can’t be explained.

Question 34

Q

Which test is good to test if two categorical variables are dependent of each other?

Answer

A

The chi-squared test (Goodness of fit test). You use the contingence table formula from the formula sheet but you must first construct a table yourself.

Question 35

Q

What is the assumption for the Chi-squared contingency test?

Answer

A

That the expected frequency for each cell is at least 5. (so a larger sample is desirable)

Question 36

Q

Which side is the Chi-squared test?

Answer

A

Always right tail

Question 37

Q

What is the Chi-squared goodness of fit test good for?

Answer

A

to test interdependence btw. two categorical variables

- to test for normality (whether a given random sample is drawn from a certain distribution)

Question 38

Q

If some natural relation exists between each pair of observation then it is a

Answer

A

matched pair

Question 39

Q

What is a paired sample t-test

Answer

A

The paired sample t-test, sometimes called the dependent sample t-test, is a statistical procedure used to determine whether the mean difference between two sets of observations is zero. In a paired sample t-test, each subject or entity is measured twice, resulting in pairs of observations.

Question 40

Q

What if an answer option contains “will”

Answer

A

Be extremely careful because it is probably wrong and infers causation.

Question 41

Q

What is a Variance Inflation Factor?

Answer

A

A variance inflation factor(VIF) detects multicollinearity in regression analysis. Multicollinearity is when there’s correlation between predictors (i.e. independent variables) in a model; it’s presence can adversely affect your regression results. The VIF estimates how much the variance of a regression coefficient is inflated due to multicollinearity in the model.

Question 42

Q

How to interpret the VIF

Answer

A

Variance inflation factors range from 1 upwards. The numerical value for VIF tells you (in decimal form) what percentage the variance (i.e. the standard error squared) is inflated for each coefficient. For example, a VIF of 1.9 tells you that the variance of a particular coefficient is 90% bigger than what you would expect if there was no multicollinearity — if there was no correlation with other predictors.
A rule of thumb for interpreting the variance inflation factor:

1 = not correlated.
Between 1 and 5 = moderately correlated.
Greater than 5 = highly correlated.

Question 43

Q

After estimating a linear regression model with 3 independent variables, you decide to verify whether or not the degree of multicollinearity is acceptable. You then regress each independent variable against the remaining others, and obtain the following R^2 when using the first, second, and third variabel as the dependent variabel: .2,.3,.9 What would be the most appropriate course of action?

Answer

A

To remove the third variable because it has a VIF of 10, thus being highly explained by the other variables.

Question 44

Q

What does multicollinearity not affect?

Answer

A

the model’s overall fit or predictive capability. It only affects the interpretation of the coefficients

Question 45

Q

What is the purpose of Randomized block design?

Answer

A

To reduce SSE and thus the Type II error rate

Question 46

Q

balanced, unbalanced design

Answer

A

In ANOVA and Design of Experiments, a balanced design has an equal number of observations for all possible level combinations. This is compared to an unbalanced design, which has an unequal number of observations.

Question 47

Q

What are levels in Two-way ANOVA?

Answer

A

Levels (sometimes called groups) are different groups of observations for the same independent variable.

e.g. Education as one variable and then levels are high school, college, university

Question 48

Q

What three hypothesis can we test in a two-way ANOVA?

Answer

A

test for the differences btw. the levels of factor A
test for the differences btw. the levels of factor B
test for the interaction effect btw. A and B

Question 49

Q

RBD blocking criteria

Answer

A

you can identify a characteristic related to the experimental subjects that can group subjects into more homogenous subgroups (blocks) with respect to the outcome variable
the blocking characteristics is something which has substantial impact on the outcome variable but it not of your focal interest

Question 50

Q

What is the difference btw. matched t-tests and RBD

Answer

A

matched t-test is used to determine whether the mean difference btw. two sets of observations is zero
In a paired sample t-test, each subject or entity is measured twice, resulting in pairs of observations

Question 51

Q

RBD What is it?

Answer

A

For randomized block designs, there is one factor or variable that is of primary interest. However, there are also several other nuisance factors.

A randomized block design is an experimental design where the experimental units are in groups called blocks. The treatments are randomly allocated to the experimental units inside each block.

With a randomized block design, the experimenter divides subjects into subgroups called blocks, such that the variability within blocks is less than the variability between blocks.

Then, subjects within each block are randomly assigned to treatment conditions. Compared to a completely randomized design, this design reduces variability within treatment conditions and potential confounding, producing a better estimate of treatment effects.

Question 52

Q

Is there enough evidence at the 1% significance level to infer that the x and the y are linearly related?

Answer

A

Conduct a two tail t test. to see if ß is =0 or not. RR: t>t0.005,df

Question 53

Q

How to find out if we want to test a proportion or if population means differ?

Answer

A

Proportions are tested if we have categorical data and we want to know the number of successes in the proportion. The mean is for when we have rather many observations for each sample and want to make an inference for the population of the sample.

Question 54

Q

what is a prediction interval?

Answer

A

In linear regression a prediction interval is used when we want to predict a one-time occurrence for a particular value of y when the independent variable is a given x value.

Question 55

Q

The ratio of two independent chi-squared variables divided by their degrees of freedom is

Answer

A

F-distributed

Question 56

Q

Is there enough evidence at the 1% significance level to infer that the average number of hours of exercise per week and the age at death are linearly related?

Which test do we use?

Answer

A

Note this is a two-tail test, so split 1% into equal halves in looking for rejection regions.

Rejection region: | t | > t0.005,36 = 2.724

Because we are testing if they are linearly related. It is not indicated whether positive or negative.

Question 57

Q

Is there enough evidence at the 5% significance level to infer that the cholesterol level and the age at death are negatively linearly related?

which test do we use?

Answer

A

Note that this is a left-tail test. Because we are proving negative relation.

Question 58

Q

The expected value of the difference of two sample means equals the difference of the corresponding population means when

Answer

A

a. The populations are normally distributed.
b. The samples are independent.
c. The populations are approximately normal and the sample sizes are large.
D. All of these choices are true. (yes)

Question 59

Q

Is the regression model as a whole valid? Give your assessment.

Answer

A

Yes. The F-Test for the validity of the model is highly significant (F=20.577; p- value=.000). In addition, there is no reason to suspect that a multicollinearity problem exists since all VIF statistics are close to 1 and significantly smaller than 5. The explanatory power of the model is high ( , indicating that the model has a good linear fit.

Question 60

Q

The covariance of two variables X and Y

Answer

A

Can be any real number

Question 61

Q

H0 chi square

Answer

A

The null hypothesis for a chi-square independence test is that two categorical variables are independent in some population

Question 62

Q

In a matched pair or randomized block design, we usually group observations from different samples together based on one variable mainly because

Answer

A

we want to control for the impact of this variable while investigating the impact of the focal variable on the outcome variable

Question 63

Q

What happens to the prediction or confidence interval when xg is closer to the sample mean of x?

Answer

A

It gets narrower

Question 64

Q

What is perfect multicollinearity?

Answer

A

when one of the independent variable is a linear combination of the other independent variables

Answer 65

A

The F-Distribution is also called as Variance Ratio Distribution as it usually defines the ratio of the variances of the two normally distributed populations.

Answer 66

A

F-test is used for testing equality of two variances from different populations and for testing equality of several means with technique of ANOVA.
Chi-square test is used to test the population variance against a specified value, testing goodness of fit of some probability distribution and testing for independence of two attributes.

Answer 67

A

By regressing one independent variable on all other independent variables.

Answer 68

A

A large VIF with one independent variable indicates the information contained in this variable is mostly redundant

Answer 69

A

Reduce the variation caused by error.

Brainscape's Knowledge GenomeTM

Further Theory Flashcards

Brainscape's Knowledge Genome^TM