General Statistics Flashcards
(28 cards)
What is parallel slopes regression?
A special case of regression with 1 numeric and 1 categorical explanatory variable
What is Simpson’s Paradox?
Simpson’s Paradox - occurs when the trend of a model on the whole dataset is very different from the trends shown by models on subsets of the dataset. In the most extreme case, you may see a positive slope on the whole dataset, and negative slopes on every subset of that dataset (or the other way around).
Interpret this interaction regression model. What does each coefficient mean?
Height = B0 + B1*Bacteria + B2*Sun + B3*Bacteria*Sun
Without the interaction term, we can interpret B1 as the unique effect of bacteria on height. With the interaction term we can no longer do so as the effect of bacteria on height is now different for different values of Sun. Thus B1 is now interpreted as the unique effect of bacteria on Height ONLY WHEN Sun = 0.
B2 is the unique effect of the Sun when bacteria = 0
The overall effect of Bacteria on Height is now B1 + B3 * Sun. So if we have the following coefficients:
Height = 35 + 4.2*Bacteria + 9*Sun + 3.2*Bacteria*Sun
So for Sun = 0, an increase in 1 unit of bacteria results in an increase of 4.2 units in height. For Sun = 1, the effect of bacteria is now 7.4. Thus, for an increase in 1 unit of bacteria and sun = 1, we would expect an increase in 7.4 units of height.
What are the basic assumptions that linear regression makes about the data?
- Linearity of the data - relationship between x & y is linear
- Normality of residuals - the residual errors are assumed to be normally distributed
- Homogeneity of residuals variance - the residuals are assumed to have a constant variance (homoscedasticity)
- Independence of residuals error terms.
What is the difference between parametric and nonparametric statistics?
Parametric statistics are based on assumptions about the distribution of the population from which the sample was taken. Example: Student’s t-test
Nonparametric statistics are not based on assumptions about the distribution of the population. In many cases the distribution of the population is unknown. These are cases when nonparametric statistics are used. Example: Mann-Whitney-Wilcoxon test
Which statistical test is used to asses the difference in means of two groups? (Parametric and nonparametric)
Parametric - Student’s t-test
Nonparametric - Mann Whitney Wilcoxon rank test
What statistical test is used to compare the means of more than two groups? (parametric and nonparametric)
Parametric - ANOVA: extension of t-test to compare more than two groups
Nonparametric - Kruskal-Wallis rank sum test (extended version of Wilcoxon rank test)
What statistical test is used to compare the variances of two groups? (parametric and nonparametric)
Parametric - F-test for 2 groups, Bartlett’s or Levene’s for multiple groups/samples
Nonparametric -
Interpret the coefficient for X2 as if it were a categorical variable.
Yi = B0 + B1*X1i + B2*X2i + ei.
Y = 42 + 2.3*X1 + 11*X2
B2 is the average difference in Y between the category for which X2 = 0 (the reference group) and the category for which X2 = 1 (the comparison group).
So compared to when X2 = 0, we would expect Y to be 11 units greater when X2 = 1, controlling for X1.
What is a confusion matrix?
Confusion matrix is the visual representation of the Actual vs. Predicted values. This is used in logistic regression to visualize and assess the performance of the model. This term is also used a lot in machine learning.
How does linear regression relate to the generalized linear model?
Linear regression is a specialized case of the GLM. It is the specialized case where the link function is just the identity function as Y does not need to be transformed.
What is a “link function” in regression?
The link function makes the distribution of Y compatible with the right-hand side of a regression equation.
When can you use least-squares regression and/or maximum likelihood estimation to solve a GLM equation?
Least-squares and MLE will give the same result for a linear regression problem.
Can only use MLE for other types of regression under the GLM (logistic, poisson, etc)
What is a logarithm?
In it’s simplest form, a logarithm answers the question “how many of one number do we multiply to get another number?”
For example, Log2(8) is asking, “how many 2’s do we multiply to get 8?” Therefore, Log2(8) = 3
Another way of looking at this is, 2^X = 8
What is a parameter in statistics?
In statistics, a parameter is any measured quantity of a statistical population that summarizes or describes an aspect of the population, such as a mean or standard deviation.
A parameter is to a population as a statistic is to a sample.
What is the difference between the likelihood function and the probability density function?
- a probability density function expresses the probability of observing our data given the underlying distribution parameters. It assumes that the parameters are known
- The likelihood function expresses the likelihood of parameter values occurring given the observed data. It assumes that the parameters are unknown.
probability and likelihood do not mean the same thing in statistics
Probability attaches to results; likelihood attaches to hypotheses
https://www.psychologicalscience.org/observer/bayes-for-beginners-probability-and-likelihood#:~:text=The%20distinction%20between%20probability%20and,results%3B%20likelihood%20attaches%20to%20hypotheses.&text=There%20are%20only%2011%20possible,one%20of%20the%20possible%20results.
What is Maximum Likelihood Estimation?
a statistical method for estimating the parameters of a model. In MLE, the parameters are chosen to maximize the likelihood that the assumed model results in the observed data.
What is the central limit theorem?
-The sampling distribution of a statistic becomes closer to the normal distributions as the number of trails increases.
- this only applies when samples are random and independent from one another
- very useful for large populations, generate more and more samples that will give you better estimates of true mean and standard deviations
- the central limit theorem applies to all distributions including binomial and poisson, discrete and continuous
What is the difference between supervised and unsupervised learning in statistics?
Supervised learning: for each observation of a predictor you have a corresponding observation of the response variable
Unsupervised learning: you DO NOT have a corresponding observation of the response variable for each measurement of a predictor variable. In this case you try and understand the relationships between variables or between observations. Ex. cluster analysis
Semi-supervised: when you have some supervised data and some unsupervised data
What is a regression problem vs a classification problem?
Regression problem - problems with a quantitative response
Classification problem - problems with a qualitative response
However, the distinction isn’t necessarily hard and fast. For example, logistic regression is technically regression but involves a qualitative response variable and thus sort of belongs in both categories.
What is the difference between frequentist statistics and Bayesian statistics?
The frequentist believes that probability represents long term frequencies of repeatable events such as flipping a coin. Frequentists do not attach probabilities to hypotheses or unknown values.
Bayesian approach uses probabilities to represent the uncertainty in any event or hypothesis.
Bayesian approaches assign probability to events on the basis of confidence/belief. This confidence is updated in light of new evidence.
In the frequentist sense, probability can only be assigned to repeated events.
Frequentists focus on point estimates while bayesians focus on probability distributions.
Parameters are assigned a probability for bayesians while parameters are typically fixed for frequentists.
What is the relationship and difference between covariance and correlation?
Covariance refers to the systematic relationship between two random variables in which a change in the other reflects a change in one variable. Values range from -infinity to infinity, greater the number the more reliant the relationship.
Cov(X,Y) = Σ E((X – x̄) E(Y – ȳ )) / n-1
Correlation is a measure that determines the degree to which two or more random variables move in sequence. Covariance is the numerator of the correlation formula.
Correlation Coefficient = ∑(x(i)- mean(x))*(y(i)-mean(y)) / √ (∑(x(i)-mean(x))2 * ∑(y(i)-mean(y))2)
Correlation = covariance/square root of variance of x * square root of variance of y
So Covariance just means relationship between change of 2 variables and correlation handles the degree to which that change clings together. The greater the variance of either x or y, the lower the correlation. The greater the covariance, the greater the correlation.
What is an estimator?
An estimator is a rule for calculating an estimate of a given quantity based on observed data. For example, the sample mean is an estimator for the population mean.
The formula used to calculate a value from a sample is called the estimator; the value is called the estimate.
So really its a formula
What is SMOTE?
Synthetic Minority Oversampling Technique
Suppose, you’re working on a health insurance based fraud detection problem. In such problems, we generally observe that in every 100 insurance claims 99 of them are non-fraudulent and 1 is fraudulent. So a binary classifier model need not be a complex model to predict all outcomes as 0 meaning non-fraudulent and achieve a great accuracy of 99%. Clearly, in such cases where class distribution is skewed, the accuracy metric is biased and not preferable.
This is where SMOTE comes in. SMOTE is an oversampling technique where the synthetic samples are generated for the minority class. This algorithm helps to overcome the overfitting problem posed by random oversampling. It focuses on the feature space to generate new instances with the help of interpolation between the positive instances that lie together.
https://www.analyticsvidhya.com/blog/2020/10/overcoming-class-imbalance-using-smote-techniques/