Data Science Statistics II Flashcards
(95 cards)
Q: What is the null hypothesis (H0)?
A: The default assumption that there is no effect or no difference. Any observed effect is due to random chance.
Q: What is the alternative hypothesis (H1)?
A: The hypothesis that there is an effect or difference, and it’s not due to chance.
Q: When do we use a t-distribution?
A: When the sample size is 30 or fewer and the population standard deviation is unknown.
Q: What is the difference between a one-tailed and two-tailed test?
A: A one-tailed test checks for an effect in one direction. A two-tailed test checks for any difference in both directions.
Q: What is a Probability Density Function (PDF)?
A: Describes the likelihood of a continuous variable taking on a specific value range. It’s the curve of probabilities.
Q: What is a Cumulative Distribution Function (CDF)?
A: Gives the probability that a variable is less than or equal to a value. It’s the area under the PDF curve up to that point.
Q: What is the Central Limit Theorem (CLT)?
A: Regardless of the population distribution, the distribution of sample means approaches a normal distribution as the sample size increases.
Q: Precision formula and meaning
A: Precision = TP / (TP + FP); it measures how many predicted positives are actually correct.
Q: Recall formula and meaning
A: Recall = TP / (TP + FN); it measures how many actual positives were correctly predicted.
Q: What do TP, FP, FN stand for?
A: TP: True Positive, FP: False Positive, FN: False Negative.
Q: What is R-squared (R²)?
A: The proportion of the variance in the dependent variable that is predictable from the independent variable(s).
Q: What does Bayes’ Theorem do?
A: It updates the probability estimate for an event based on new evidence. Posterior = (Likelihood × Prior) / Evidence.
Q: What is a confusion matrix?
A: A table that shows predicted vs actual classifications: True Positives, True Negatives, False Positives, False Negatives.
Q: What is logistic regression used for?
A: To model binary outcome variables (yes/no, 0/1) using an S-shaped curve.
Q: What is linear regression used for?
A: To model the relationship between one or more independent variables and a continuous dependent variable.
Q: What is a Confidence Interval (CI)?
A: A range of values around a sample mean that is likely to contain the population mean with a certain confidence level (e.g., 95%).
Q: What does a p-value represent?
A: The probability of obtaining the result we observed or one even more extreme, assuming the null hypothesis is true.
Q: What is the difference between a one-tailed and two-tailed test?
A: A one-tailed test checks for an effect in one direction; a two-tailed test checks both directions.
Q: When do we use the T-distribution?
A: When sample size is small (n <= 30) and population standard deviation is unknown.
Q: What is the difference between linear and logistic regression?
A: Linear regression predicts a continuous outcome; logistic regression predicts a probability between 0 and 1 for classification.
Q: What is R² (Coefficient of Determination)?
A: It measures the proportion of variance in the dependent variable explained by the independent variable(s).
Q: What is Mean Squared Error (MSE)?
A: The average of the squares of the errors between predicted and actual values.
Q: What is a train/test split?
A: A method of dividing data into a training set to build the model and a test set to evaluate its performance.
Q: What is K-Fold Cross Validation?
A: A technique that splits the data into K subsets and trains/tests the model K times, each time using a different subset as the test set.