midterm 1 Flashcards
(32 cards)
Three Criteria for Causality
- Plausibility
- Time Order
- Non-Spuriousness
What are the two type of variables?
Numeric and Categorical
mean, median, mode
- mean: every number added up divided by how many numbers
- median: middle
- mode: most common
Range
maximum – minimum value
IQR
Q3 – Q1 (50% of the data)
standard deviation is what square rooted?
variance
Cluster Sample vs Stratified Sample
Cluster: randomly samples existing clusters, then samples within those clusters
Stratified: creates subgroups based on variables and samples from within those subgroups
68 – 95 – 99.7 rule for normal distribution
68% of data will fall within 1 sd of mean
95% of data will fall within 2 sd of mean
99.7% of data will fall within 3 sd of mean
how to calculate ANOVA
Test Statistic:
𝑀𝑒𝑎𝑛 𝑆𝑞𝑢𝑎𝑟𝑒 𝑏𝑒𝑡𝑤𝑒𝑒𝑛 𝐺𝑟𝑜𝑢𝑝𝑠 (𝑀𝑆𝐺)/ 𝑀𝑒𝑎𝑛 𝑆𝑞𝑢𝑎𝑟𝑒𝑑 𝐸𝑟𝑟𝑜𝑟 (𝑀𝑆𝐸)
What does MSG conceptually mean?
the amount of variation between groups. In other words, how much of the variation you see in the sample is because there are multiple groups.
If this number is high, it means that the groups are different from each other. If this number is low, it means that the group means are all very similar to the overall mean – the groups are NOT different.
What is MSE conceptually?
the amount of variation within groups. If this number is high, it means that there is a lot of variation within the groups. If this number is low, it means that all of the observations are pretty close to average for their group.
Factorial ANOVA
A technique for studying the effect of two or more categorical independent variables on a numeric dependent variable accounting for interaction effects among the independent variables
Main Effect
The overall relationship between an independent and a dependent variable
Interaction Effect
when the relationship between two variables is different depending on the value of a third variable
Tukey HSD (factorial anova)
This tests every pair to see if they are statistically significantly different
In R: TukeyHSD() function on the aov model object
multivariate
studies relationships of independent variables with multiple dependent variables
conditions for regression
- linearity
- nearly normal residuals
- constant variability
- independent observations
coefficient vs y-intercept
coefficient: Bx or mx
y-intercept: Bo
RMSE root mean square error (three steps)
- find squared error
- calculate mean of the se
- take the square root
what minimizes the RMSE?
the mean, and the line that minimizes RMSE is the best fit line - OLS (ordinary least squares)
R for regression
summary(lm(depedent variable ~ indepepdnet variable, data = dataset))
high leverage vs influencial point
high leverage: very high or low value on the independent variable (x)
influencial point: extreme in both independent (x) and dependent (y) variables
Simpson’s Paradox
where the observed relationship b/w two variables changes when the population is divided into different groups
Adjusted R squared in multiple regression
applies a penalty based on the number of parameters in the model
Bo + B1 = two parameters
Bo + B1 + B2 = three parameters