stats final mcq Flashcards

1
Q

Covariation

A

an unstandardized statistical measure summarizing the general pattern of association (or lack thereof) between two continuous variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Covariation is a measure of the degree to which

A

two variables change together.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Positive covariation occurs when

A

two variables tend to increase or decrease together.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Negative covariation occurs when

A

one variable tends to increase while the other decreases

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Covariation doesn’t necessarily mean

A

causation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Covariation can be influenced by

A

other variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Covariation can be used to make

A

predictions!

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Sometimes, two variables might appear to be positively or negatively correlated, but the relationship is actually

A

being influenced by a third variable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Although covariation is a useful measure for understanding whether a relationship between two variables is positive, negative, or does not exist, it has some drawbacks

A
  • Covariation does not measure the strength of the relationship between two variables
    -it is not standardized, meaning its value can vary widely based on the units of the variables involved.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Correlation coefficient

A

a measure that quantifies the strength and direction of the linear association between two continuous variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

the most commonly employed correlation coefficient

A

Pearson’s r

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Pearson’s r:

A

a statistical measure that quantifies the strength and direction of the linear relationship between two continuous variables, ranging from -1 (perfect negative correlation) to 1 (perfect positive correlation).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Correlation specifically assesses linear relationships; non- linear relationships

A

may not be adequately represented by the correlation coefficient

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Correlation can be heavily influenced by

A

outliers, which may skew the results

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Outlier

A

a data point that significantly deviates from the other observations in a dataset, often appearing as an unusually high or low value.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

We can perform hypothesis testing to determine the

A

statistical significance of our correlation coefficient.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

in hypothesis testing Reject H0 if

A

the absolute value of the test statistic |t| is greater than the critical value

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Fail to reject H0 if

A

|t| is less than or equal to the critical value

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

If you rejected H0, conclude that there is

A

a statistically significant correlation between the variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

To report the results of your test, include the

A

correlation coefficient r, the test statistic t, the degrees of freedom, and the p-value associated with the test.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Choose a difference in means test

A

when testing 2 variables, if
-the independent variable is categorical and the dependent variable is numeric
-the numeric dependent variable is normally distributed, and
-you are interested in the difference in the average values of the dependent variable across the categories of independent variable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Steps for hypothesis testing:

A
  1. State the null hypothesis.
  2. Set a critical value.
  3. Calculate a test statistic.
  4. Compare the test statistic to the critical value.
  5. Find the p-value.
  6. Compare the p-value of your data to the critical value’s significance
    level.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Identify the critical value.

A

To identify the critical value for this test, we need to know the sample size (N) and the number of categories in our categorical variable, which we can then use to calculate the degrees of freedom (df)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

We calculate the sample size (N) as

A

the number of observations in our dataset for our independent and dependent variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
we calculate the degrees of freedom (df) by
summing the number of observations each category of the independent variable, then subtracting the number of categories
26
Difference in means testing relies on the
Student’s t-distribution
27
The student’s t distribution is not the
distribution of either of your variables. - Rather it is the distribution of the values that the differences in sample means can take
28
when the sample size is small (less than 30 observations),
the Student’s t-distribution shows increased variability (i.e., is flatter) than the normal distribution
29
Our difference in means test statistic (t) is greater than our critical value! But what does that mean?
We can reject the null hypothesis that no relationship exists between partisan affiliation and approval ratings of President Biden
30
The difference in means test is known as
an independent sample t-test
31
Independent sample t-test
a statistical test that compares the means of two independent groups to see if there is a significant difference.
32
Paired Samples t-test (Dependent t-test)
a statistical test that compares the means of two related groups or matched pairs
33
One-Sample t-test
a statistical test that compares the mean of a single sample to a known value (often the population mean)
34
Each of these tests * the independent sample t-test, * the paired samples t-test, and * one-sample t-test can be
one- or two-tailed
35
One-tailed t-test
a statistical test used to determine if there is a significant difference in the means of two groups, with a specific directional hypothesis
36
The one-tailed t-test only looks at
one end (tail) of the distribution
37
Two-tailed t-test
a statistical test used to determine if there is a significant difference in the means of two groups, without specifying a direction.
38
Two-tailed t-test:
The hypothesis does not specify a direction of the effect.
39
The two-tailed t-test looks at
both ends (tails) of the distribution.
40
in a two-tailed t-test, the significance level (alpha) is split between the
two tails of the distribution
41
to determine which t-test you need, you need to know
1) What kind of samples are you comparing, * an independent pair * a matched pair, or * a sample and a population? 2) Does your hypothesis have a direction?
42
While t-tests are primarily designed for comparing means, they can be useful in contexts related to means or differences in means, as well as in regression analysis to assess the significance of
predictors
43
Assumptions of difference in means (independent, two sample) t- tests
* Normality * Homogeneity of Variances
44
* Normality
he data in each group should be approximately normally distributed. * This assumption is particularly important when sample sizes are small (typically n < 30). * For larger sample sizes, the t-test is robust to violations of normality due to the Central Limit Theorem
45
* Homogeneity of Variances
the variances of the two groups should be equal (or approximately equal). * Homogeneity can be tested using Levene's test or Bartlett's test. * If the variances are significantly different, you may need to use a Welch's t-test, which does not assume equal variances
46
slope (m)=
rise/run
47
Rise
The change in y for a given change in x
48
Run
The change in x. It represents how much x increases or decreases, which is typically by one unit when calculating the slope
49
in statistics, a parameter is
The value of an unknown population characteristic
50
population regression line
the value of the dependent variable (y) is conditional on one or more independent variable
51
the population regression line is interpreted as
the baseline or starting point of the dependent variable
52
intercept
alpha
53
slope
beta
54
sample regression model
if u know alpha and beta you can predict value of y
55
systematic components
explains predictable relationship between ind and dep relationship
56
For OLS (Ordinary Least Squares) to produce unbiased and reliable estimates, a certain set of assumptions must be met
* normality of the error terms, * no bias in the error term, * homoscedasticity (constant variance of errors), * no autocorrelation, * X values are measured without errors, * no omitted or unnecessary variables included in the model, * parametric linearity, * X must vary, and * the sample size must exceed the number of parameters (variables) in the model (n > k).
57
multiple regression model
we are able to control for another variable (Z) as we measure the relationship between our independent variable of interest and our dependent variable
58
two variable regression model
a line is fit to a scatterplot of data
59
the line in a two variable regression model is defined by its __ and serves as __
slope and y-intercept, a statistical model of reality
60
how is two variables regression and three hypothesis-testing different
although they allow hypothesis testing, they don't produce a statistical model
61
the two elements (m and b) are described as the line's
parameters
62
in a two-variable regression, we represent y intercept by the greek letter _ and the slope parameter by greek letter _
a, B
63
our theory about the underlying population in which we are interested is expressed in the
population regression models
64
in the population regression models there is one additional components which does not correspond with what we are used to seeing in line formulae from math classes
this term is the stochastic or "random" component of our dependent variable
65
why do we have the term stochastic
because we do not expect all of our data points to line up perfectly on a straight line
66
in two variable regression we use information from the ___ to make inferences about the unseen population regression
sample regression model
67
we place _ over the terms in the sample regression model that are estimates of terms from the unseen population regression model
hats
68
our best guesses of the unseen population parameters a and b
parameter estimates
69
another name for the estimated stochastic component
residual -meaning leftover
70
another way to refer to u is to call it the
sample error term
71
OLS regression
a method to compute a linear regression model of a sample
72
covariation
can unstandardized statistical measure summarizing the general pattern of association (or lack thereof) between two continuous variables.
73
to figure out how well the regression line matches the actual data points we estimate the
goodness-of-fit of our model
74
Root mean squared error (RMSE) aka, root MSE or model standard error
a calculation of goodness-of-fit made by squaring each sample error term, summing them up, and dividing by the number of observations, and then taking the square root
75
RMSE
a measure of the average magnitude of the prediction errors, reflecting how well the model fits the data
76
the r-squared statistics (r2)
a goodness-of-fit measure that varies between 0 and 1 representing by the proportion of variation of the dependent variable that is accounted for by a model
77
the residual sum of squares (RSS)
the total of the squared differences between the observed values and the values predicted by a regression model, reflecting the amount of terror (unexplained variance) in the model's prediction for the dependent variable
78
the total sum of squares (TSS)
the total of the squared differences between the observed values and the mean of those values, reflecting the overall variability in the dependent variable
79
Model sum of squares (MSS)
the total squared differences between the predicted values from a regression model and the mean of the observed values, reflecting the variability in the dependent variable that is explained by the independent variable(s) in the model
80
confidence intervals
a range of values derived from a sample statistic that estimate an unknown population parameter with a specified level of confidence, typically expressed as a percentage, indicating the range within which we expect the true parameter value to lie across repeated samples
81
Which statistical test is appropriate for hypothesis testing when the independent variable is categorical and the dependent variable is also categorical -chi-square -correlational analysis -experimental analysis -T-test
-chi-square
82
Which statistical test is appropriate for hypothesis testing when the independent variable is categorical and the dependent variable is numerical -T-test -chi-square test -correlational analysis -experimental analysis
-T-test
83
In a study of whether support for environmental policies is related to ideology, which of the following could be the null hypothesis -There is no difference in support for environmental policies between liberals and conservatives -There is a significant relationship between ideology and support for environmental policies - Conservatives have a higher mean support for environmental policies than liberals - support for environmental policies is higher among liberals than among conservatives
-There is no difference in support for environmental policies between liberals and conservatives
84
what does a standard error measure - the average value of a dataset - uncertainty about a statistical estimate - the difference between the maximum and minimum values in a dataset - the total number of observations in a dataset
- uncertainty about a statistical estimate
85
What do degrees of freedom represent in statistical analysis -they reflect the ideas that we will gain confidence in an observed pattern as the amount of data on which that pattern is based increases - they measure the variability of individual data points in a dataset -they represent the number of variables in a statistical model -they indicate the total number of observations in a dataset
-they reflect the ideas that we will gain confidence in an observed pattern as the amount of data on which that pattern is based increases
86
in a study of whether support for environmental policies is related to ideology, which of the following could be the alternative hypothesis? - support for environmental policies is the same across all ideological groups - ideology does nor influence support for environmental policies -there is no difference in support for environmental policies between liberals and conservatives -support for environmental policies is greater among liberals than among conservatives
-support for environmental policies is greater among liberals than among conservatives
87
What distribution is used for conducting a difference in means test when comparing the means of two independent groups? - normal distribution -chi-square distribution - exponential distribution - student's distribution
- student's distribution
88
what does a large standard error indicate about a sample mean - the sample size is very large - there is very little variability in the sample data - the sample mean is not a reliable estimate of the population mean - the sample mean is a precise estimate of the population mean
- the sample mean is not a reliable estimate of the population mean
89
can a bivariate hypothesis test effectively control for confounding variables? - no, it only examines the relationship between two variables without controlling for other factors - yes, it automatically accounts for all confounding variables - no, it is designed specifically for univariate analysis - yes, it can control for confounding variables if they are included in the analysis
- no, it only examines the relationship between two variables without controlling for other factors
90
what does a lower p-value indicate about the relationship between two variables in hypothesis testing? - there is more confidence that there is a systematic relationship between the two variables for which we estimated a particular p-value - there is no relationship between the two variables - the sample size is too small to draw any conclusions - the means of the two groups are equal
- there is more confidence that there is a systematic relationship between the two
91
Multivariate regression
a regression model with more than 2 variables, which allows researchers to control for the impact of potentially confounding variables on the dependent variable when examining the relationship between an independent variable of interest and a dependent variable
92
population regression model
a theoretical formulation of the proposed linear relationship between at least one independent variable and a dependent variable
93
the population regression model specifies
the relationship we theorize to exist between our variables in the real world
94
The beta coefficients in multiple regression fit a
hyperplane to the data
95
In three dimensions (with 2 ind and 1 dependent), a multivariate regression fits a
plane
96
in higher dimensions (with more than two independent variables and a dependent variable), a multivariate regression fits a
hyperplane that exists in that multi-dimensionsal space
97
Multiple regression only controls for the
variables that are measured and included in the equation
98
MR uses statistical controls which are not as effective as
isolating the effects of X on Y as experimental controls
99
you cannot compare beta coefficients from a regression table because they are
unstandardized
100
Unstandardized coefficients:
regression coefficients such that the rise-over-run interpretation is expressed in the original metrics of each variable
101
substantive significance
a judgment call about whether or not statistically significant relationships are “large” or “small” in terms of their real-world impact.
102
multivariate regression requires one more assumption
No perfect multicollinearity – there can be no exact linear relationship between two or more of the independent variables in the model
103
What is the difference between a population regression model and a sample regression model? -A population regression model is a theoretical formulation of the proposed linear relationship between at least one independent variable and a dependent variable, whereas a sample regression model is a sample-based estimate of the population regression model. -A population regression model predicts future outcomes, while a sample regression model is solely used for descriptive analysis of past data. -A population regression model is only applicable to small datasets, whereas a sample regression model is used for large datasets -A population regression model includes only categorical variables, while a sample regression model can include both categorical and continuous variables.
-A population regression model is a theoretical formulation of the proposed linear relationship between at least one independent variable and a dependent variable, whereas a sample regression model is a sample-based estimate of the population regression model.
104
What component does a population regression model have that a mathematical formula for a line lacks? -An intercept parameter -A systematic component -A stochastic component -A slope parameter
-A stochastic component
105
How do we estimate a residual in a sample regression model? -Count the total number of observations used in the regression analysis. -Find the mean of the dependent variable across all observations in a dataset. -Determine the slope of the regression line representing the change in the dependent variable for a one-unit change in the independent variable in the sample regression model. -Take the difference between the actual value of the dependent variable and the predicted value of the dependent variable from the sample regression model.
- Take the difference between the actual value of the dependent variable and the predicted value of the dependent variable from the sample regression model.
106
Which of the following options is a mathematical property of an OLS regression? - The line produced by the parameter estimates goes through the sample mean values of X and Y. - Residuals are the predicted values of the dependent variable in the regression model. - The slope of the regression line is always equal to the y-intercept of the model. - The regression coefficients represent the maximum values of the independent variables.
- The line produced by the parameter estimates goes through the sample mean values of X and Y.
107
Which of the following statistics does NOT measure the goodness-of-fit -t-ratio -Root mean-squared error - R-squared
-t-ratio
108
How does Ordinary Least Squares (OLS) regression fit a line to data? -By averaging the dependent variable values. -By fitting a line that passes through all data points. -By maximizing the sum of the absolute residuals. -By minimizing the sum of the squared residuals.
-By minimizing the sum of the squared residuals.
109
What does DFBETAS measure in a regression analysis? -The change in each regression coefficient when a specific observation is removed. -The influence of an observation based on its leverage alone. -The change in the residuals when an observation is removed from the dataset. -The change in the overall R? value when an observation is excluded.
-The change in each regression coefficient when a specific observation is removed.
110
What is the primary difference between perfect multicollinearity and high multicollinearity in regression analysis? -Perfect multicollinearity leads to an increase in R?, while high multicollinearity decreases R$. -Perfect multicollinearity occurs when two variables are completely correlated, while high multicollinearity Indicates a strong but not perfect correlation. -Perfect multicollinearity is acceptable in regression analysis, while high multicollinearity must always be addressed. -Perfect multicollinearity affects only categorical variables, whereas high multicollinearity affects continuous variables.
-Perfect multicollinearity occurs when two variables are completely correlated, while high multicollinearity Indicates a strong but not perfect correlation.
111
What is the primary difference between an outlier and an influential case in regression analysis? -Outliers always have a large effect on the regression coefficients, while influential cases do not. -Outliers can be ignored without affecting the model, while influential cases must always be included. -Outliers are extreme values in the data, whereas influential cases significantly affect the fit of the model and the estimated coefficients. -Outliers are only found in the dependent variable, while influential cases can only be in the independent variables.
- Outliers are extreme values in the data, whereas influential cases significantly affect the fit of the model and the estimated coefficients.
112
What does leverage indicate in regression analysis? -The difference between the observed and predicted values of the dependent variable. - The degree to which an individual case is unusual in terms of its value for a single independent variable, or its particular combination of values for two or more independent variables. -The overall goodness-of-fit of the regression model. -The strength of the relationship between the independent and dependent variables.
-The degree to which an individual case is unusual in terms of its value for a single independent variable, or its particular combination of values for two or more independent variables.
113
What is the role of the reference category when using categorical variables in regression analysis? - It serves as the baseline against which the effects of other categories are compared. -It indicates the category with the most observations. -It is the category with the highest numerical value. -It is excluded from the model to simplify the analysis.
- It serves as the baseline against which the effects of other categories are compared.
114
What does an interaction term in a multivariate regression model represent? -The independent effect of one variable on the dependent variable. -A method to eliminate multicollinearity among independent variables. -The average effect of an independent variable across all observations. -The combined effect of two independent variables on the dependent variable that varies depending on the level of one or both variables.
-The combined effect of two independent variables on the dependent variable that varies depending on the level of one or both variables.
115
What is omitted variable bias in multivariate regression analysis? - It happens when a relevant variable is left out of the model, causing the estimates of the included variables to be biased. -It refers to the situation where all variables are perfectly correlated, making it impossible to determine their individual effects. - It occurs when the regression model includes too many independent variables, which can make the model overly complex. - It occurs when the regression model includes too many independent variables, which can make the model overly complex.
- It happens when a relevant variable is left out of the model, causing the estimates of the included variables to be biased.
116
What is perfect multicollinearity in multivariate regression analysis? -It arises from including irrelevant variables that do not influence the dependent variable. -It refers to a situation where all variables are uncorrelated. -It occurs when the dependent variable is fully explained by one or more independent variables. -It happens when two or more independent variables have a perfect linear relationship.
- It happens when two or more independent variables have a perfect linear relationship.
117
What is the main difference between unstandardized and standardized beta coefficients in a multivariate regression? - Unstandardized coefficients are always larger than standardized coefficients. Unstandardized coefficients indicate the relationship between the dependent variable and independent variables in their original units, while standardized coefficients express the relationship in terms of standard deviations. - There is no difference; both coefficients represent the same information. - Standardized coefficients can only be used for categorical independent variables, while unstandardized coefficients are used for continuous variables.
Unstandardized coefficients indicate the relationship between the dependent variable and independent variables in their original units, while standardized coefficients express the relationship in terms of standard deviations.
118
What is the key difference between statistical significance and substantive significance in regression analysis? - Statistical significance only applies to large sample sizes, while substantive significance applies to small samples. - Statistical significance assesses whether a result is likely due to chance, while substantive significance evaluates the practical importance or real-world relevance of the result. - There is no difference; both terms mean the same thing in research. - Statistical significance refers to the size of the effect, while substantive significance indicates whether the effect is statistically significant.
- Statistical significance assesses whether a result is likely due to chance, while substantive significance evaluates the practical importance or real-world relevance of the result.
119
Which of the following statements correctly interprets the coefficient of an independent variable in a multiple regression analysis? -The coefficient shows the correlation between the independent variable and the dependent variable. - The coefficient represents the change in the dependent variable for a one-unit increase in the independent variable, holding all other variables constant. - The coefficient indicates the overall effect of all independent variables on the dependent variable. - The coefficient indicates whether the independent variable is significant in predicting the dependent variable.
- The coefficient represents the change in the dependent variable for a one-unit increase in the independent variable, holding all other variables constant.
120
What do standard errors represent in a multivariate regression output? - The average value of the dependent variable. -The total amount of variation explained by the regression -The degree of variability or uncertainty in the estimated coefficients. - The correlation between the independent and dependent variables.
-The degree of variability or uncertainty in the estimated coefficients.
121
Which of the following is NOT an assumption made about the population regression model in Ordinary Least Squares (OLS)? - The variance of the population stochastic component is constant across all observations. -The population stochastic component is normally distributed. - The stochastic terms for any two or more cases are correlated. - The population stochastic component has a mean equal to zero.
The stochastic terms for any two or more cases are correlated.
122
Which of the following is NOT an assumption made about the model specification in regression analysis? * The model exhibits parametric linearity. * There are no non-causal variables included in the model. * There are no omitted causal variables. * The model parameters remain constant across different populations.
T-he model parameters remain constant across different populations.
123
Which of the following is NOT a measure of uncertainty about the parameters of the sample regression model? * Standard error for the slope parameter * R-squared value * Variance of the intercept parameter
- R-squared value
124
In the context of an OLS regression model, what does a one-tailed hypothesis test evaluate? *Whether the residuals are normally distributed. * Whether the slope coefficient is greater than or less than zero. * Whether the overall fit of the model is statistically significant. - Whether the slope coefficient is equal to zero
- Whether the slope coefficient is greater than or less than zero.
125
What is the primary purpose of using dummy variables in regression analysis? - To convert categorical variables into a format that can be included in regression models. -To represent continuous variables in the model. - To account for non-linear relationships between variables. - To reduce multicollinearity among independent variables.
- To convert categorical variables into a format that can be included in regression models.
126
What is the "dummy variable trap"? - A method for simplifying complex categorical variables. * A situation where including all categories of a categorical variable in a regression model leads to multicollinearity. * A statistical technique to measure the effect of interaction terms. * A situation where dummy variables perfectly predict the outcome variable.
- A situation where including all categories of a categorical variable in a regression model leads to multicollinearity.
127
What does the Variance Inflation Factor (VIF) indicate in regression analysis? - The difference in residuals between the predicted and actual values. * The overall fit of the regression model. -The strength of the correlation between the dependent variable and the independent variables. * The degree to which multicollinearity inflates the variance of the estimated regression coefficients.
- The degree to which multicollinearity inflates the variance of the estimated regression coefficients.
128
What happens when you control for a collider variable in the presence of collider bias? - It strengthens the causal relationship between the independent and dependent variables. * It eliminates the confounding effect of the collider variable. * It can introduce bias by creating a spurious association between the independent and dependent variables. * It has no effect on the relationship between the independent and dependent variables.
It can introduce bias by creating a spurious association between the independent and dependent variables.
129
What is a key consideration regarding causality in multivariate regression analysis? * A significant relationship between variables automatically proves causality in multivariate regression. * In multivariate regression, the size of the coefficients directly indicates the strength of a causal relationship. * Multivariate regression can suggest potential causal relationships, but it does not provide definitive proof of causation. - Causality can be inferred simply by including more independent variables in the multivariate regression model.
- Multivariate regression can suggest potential causal relationships, but it does not provide definitive proof of causation.
130
Do the assumptions of bivariate regression also apply to multivariate regression? * Yes. but multivariate regression has additional assumptions that must also be considered, such as no perfect multicollinearity. - No, the assumptions for bivariate regression are different and cannot be applied to multivariate regression.
-Yes. but multivariate regression has additional assumptions that must also be considered, such as no perfect multicollinearity.
131
What is the main difference between statistical control and experimental control in research? * Statistical control is less reliable than experimental control because it does not require data collection. * Statistical control is only applicable in qualitative research, while experimental control is used in quantitative research. -Statistical control uses statistical techniques to account for confounding variables, whereas experimental control involves random assignment to conditions to eliminate confounding. * Statistical control involves manipulating variables, while experimental control does not.
-Statistical control uses statistical techniques to account for confounding variables, whereas experimental control involves random assignment to conditions to eliminate confounding.
132
Binary variables are usually coded as
0 or 1
133
dummy variable trap
perfect multicollinearity that results from the inclusion of dummy variables representing each possible value of a categorical variable
134
Perfect multicollinearity
when there is an exact linear relationship between any two or more of a regression model’s independent
135
The coefficient for the binary X variable indicates the difference in the Y variable between the respective category and the
reference category (the one omitted in the dummy coding)
136
This coefficient provides insights into
how the response varies across different groups
137
Reference category
in a regression model, the value of a categorical independent for which we do not include a dummy variable
138
Categorical independent variables can be used in
interactions
139
Interactive models
multiple regression models that contain at least one independent variable that researchers create by multiplying together two or more independent variables
140
Use an interaction model in multiple regression if
you suspect that the effect of one independent variable on the dependent variable varies depending on the level of another independent variable
141
A significant interaction effect between income and voter status indicates that the increase in donations with income is greater for voters than for nonvoters, suggesting that voter status ---- the
moderates
142
Moderation
the alteration of the relationship between two variables by a third variable, indicating that the effect of one variable on an outcome changes depending on the level or category of the modifying variable
143
Interaction effects can be modeled between --- of two categorical variables, two numeric variables, or one of each
any combination
144
Interaction Between a Categorical and a Numeric Variable
The effect of a numeric variable on the dependent variable is modified by a categorical variable
145
Interaction Between Two Categorical Variable
In this case, the interaction term assesses how the effect of one categorical variable on the dependent variable changes based on the levels of another categorical variable
146
Interaction Between Two Numeric Variables
Here, the interaction term assesses how the relationship between one numeric variable and the dependent variable changes at different levels of another numeric variable
147
When an exposure and an outcome independently cause a third variable, that variable is
‘collider’.
148
Inappropriately controlling for a collider variable, by study design or statistical
results in collider bias
149
Influential case
a case in a regression model which has either a combination of large leverage and a large squared residual or a large DFBETA score
150
An influential case can be influential if it
leverage.
151
Leverage
in a regression model, the degree to which an individual case is unusual in terms of its value for a single independent variable, or its particular combination of values for two or more
152
A case can be influential if it has a large
squared residual value.
153
A large residual value indicates that the observed data point deviates markedly from
the predicted outcome
154
A case can be influential if it has both
large leverage and a large squared residual value
155
DFBETA is a diagnostic measure used in regression analysis to
assess the influence of individual data points on the estimated coefficients of the model. It quantifies the change in each regression coefficient when a specific observation is removed from the dataset.
156
DFBETA score
A statistical measure for the calculation of the influence of an individual case on the value of a single
157
How to deal with influential cases in
1. Check for data collection or management problems. 2. Don’t do anything. 3. Delete the relevant observations. 4. Dummy out the influential cases
158
Dummying out
adding a dummy (binary) variable to a regression model to measure and isolate the effect of an influential
159
High multicollinearity
in a multivariate regression model, when two or more of the independent variables in the model are extremely highly correlated with one another, making it difficult to isolate the distinct effects of each variable
160
Signs of potential multicollinearity
* two or more of your independent variables are theoretically associated, * two or more of your independent variables are known to correlate, * the standard errors for your Beta coefficients are large, or * the R2 is unexpectedly large
161
Micronumerosity
a situation in statistical analysis where the number of observations or data points is very small relative to the number of variables being analyzed. * This condition can lead to several issues, including overfitting, unreliable estimates of model parameters, and difficulty in generalizing findings to a larger population. * When a dataset is micronumerous, there may not be enough data to adequately capture the relationships between variables
162
If you detect multicollinearity and cannot get more data, then you need to calculate and report the
model variance inflation factor (VIF).
163
Variance inflation factor (VIF)
a statistical measure to detect the contribution of each independent variable in a multiple regression model to overall multicollinearity
164
To calculate VIF, estimate an
auxiliary regression model
165
Auxiliary regression model:
a model in which one of the independent variables, Xj, becomes the dependent variable and all of the other independent variables remain independent variables