terms Flashcards
(33 cards)
b) Standard error
is a measure of how close the sample mean
estimates the population mean
Standard deviation of sampling distributions is the standard error
a) Normal distribution
a probability distribution that is symmetric about the mean, showing that data near the mean are more frequent in occurrence than data far from the mean. The normal distribution appears as a “bell curve” when graphed.
c) Adjusted R-squared
Adjusted R-squared is a modified version of R-squared that has been adjusted for the number of predictors in the model. The adjusted R-squared increases when the new term improves the model more than would be expected by chance. It decreases when a predictor improves the model by less than expected.
d) p-value
The P value is defined as the probability under the assumption of no effect or no difference (null hypothesis), of obtaining a result equal to or more extreme than what was actually observed. The P stands for probability and measures how likely it is that any observed difference between groups is due to chance.
heteroskedasticity
A violation of the assumption of homogenous variance of the
residuals (or homoskedasticity)
standard normal distribution
The standard normal distribution is a specific type of normal distribution that has a mean of 0 and a standard deviation of 1. It is often referred to as the Z-distribution. This distribution is a special case of the general normal distribution, which can have any mean and standard deviation.
residuals
Residuals in data analysis are the differences between observed values and the values predicted by a statistical or machine learning model. They are a crucial diagnostic measure used to assess the quality and validity of a model.
null hypothesis
representing a default or baseline assumption that there is no effect, no difference, or no relationship between variables being studied. serves as a starting point for hypothesis testing.
statistical significance
Statistical significance is a concept used in hypothesis testing to determine whether the observed data in a study are unlikely to have occurred by chance alone. It helps researchers decide whether to reject the null hypothesis, which typically posits that there is no effect or no difference between groups.
standard deviations
Standard deviation is a statistical measure that quantifies the amount of variation or dispersion in a set of data values. It is a crucial concept in statistics, used to understand how spread out the values in a data set are around the mean (average).
parameter
A parameter in statistics is a numerical value that describes a characteristic of an entire population. It is a fixed value that summarizes or describes an aspect of the population, such as the mean, standard deviation, or proportion.
leverage point
A leverage point is an observation with predictor values that are far from the mean of the predictor values for all observations. This makes the fitted regression model likely to pass close to these points
logistic regression
- Logistic regression is a technique that enables us to investigate
dummy variables on the left hand side.
interaction effect
An interaction effect describes a situation where the effect of one variable on the outcome variable changes depending on the value of another variable
normal distribution
continuous probability distribution that is symmetrical around its mean.
Multicollinearity
Multicollinearity is a statistical phenomenon in which two or more independent variables in a multiple regression model are highly correlated. This high correlation means that the variables share similar information, which can lead to problems in estimating the relationship between each independent variable and the dependent variable.
Z-scores
Multicollinearity is a statistical phenomenon in which two or more independent variables in a multiple regression model are highly correlated. This high correlation means that the variables share similar information, which can lead to problems in estimating the relationship between each independent variable and the dependent variable.
chi 2
The chi-squared test is a statistical hypothesis test used to determine whether there is a significant association between categorical variables.
interquartile range
The interquartile range (IQR) is a measure of statistical dispersion, which indicates the spread of the middle 50% of a dataset. It is particularly useful for understanding the variability of data and is less affected by outliers and extreme values compared to other measures like the range.
when should u use logistic regression
When the dependent variable is binary (categorical)
* When the relationship between Y and X is non-linear
What are some logistic regress advantages over Ordinary Least Squares regression?
Solves some of the problems of using OLS as an LPM:
i. Nonsensical predictions
ii. Heteroskedasticity
iii. Constant effects across all values of X
iv. Residuals are not normally distributed
v. R2 is inaccurate
* The coefficient of X is differential: The impact of X on Y depends on the value of X
* Impact is greatest around the mean of X, but much lower at the extremes of X
What is the problem of multicollinearity
Problems Caused by Multicollinearity:
Inflated Standard Errors:
Multicollinearity increases the standard errors of the coefficients. This makes the estimates less precise and the confidence intervals wider, reducing the statistical power to detect significant effects.
Unreliable Estimates:
Coefficients can become highly sensitive to small changes in the model. Adding or removing a variable can cause large changes in the coefficients of the remaining variables.
Difficulty in Assessing Individual Predictor Effects:
It becomes hard to determine the individual effect of each predictor on the dependent variable because the predictors are not independent of each other.
Reduced Interpretability:
The interpretation of regression coefficients becomes problematic. High multicollinearity makes it unclear which variables are actually influencing the dependent variable and to what extent.
when might u find multicollinearity in your data
Multicollinearity occurs when two or more predictor variables in a multiple regression model are highly correlated, meaning they contain similar information about the variance in the dependent variable.
what is an appropriate diagnostic test for identifying multicolinearity in ur data
Variance Inflation Factor (VIF):
VIF measures how much the variance of a regression coefficient is inflated due to multicollinearity. A VIF value greater than 10 is often considered indicative of high multicollinearity.