Data Science Statistics II Flashcards

(95 cards)

1
Q

Q: What is the null hypothesis (H0)?

A

A: The default assumption that there is no effect or no difference. Any observed effect is due to random chance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Q: What is the alternative hypothesis (H1)?

A

A: The hypothesis that there is an effect or difference, and it’s not due to chance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Q: When do we use a t-distribution?

A

A: When the sample size is 30 or fewer and the population standard deviation is unknown.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Q: What is the difference between a one-tailed and two-tailed test?

A

A: A one-tailed test checks for an effect in one direction. A two-tailed test checks for any difference in both directions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Q: What is a Probability Density Function (PDF)?

A

A: Describes the likelihood of a continuous variable taking on a specific value range. It’s the curve of probabilities.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Q: What is a Cumulative Distribution Function (CDF)?

A

A: Gives the probability that a variable is less than or equal to a value. It’s the area under the PDF curve up to that point.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Q: What is the Central Limit Theorem (CLT)?

A

A: Regardless of the population distribution, the distribution of sample means approaches a normal distribution as the sample size increases.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Q: Precision formula and meaning

A

A: Precision = TP / (TP + FP); it measures how many predicted positives are actually correct.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Q: Recall formula and meaning

A

A: Recall = TP / (TP + FN); it measures how many actual positives were correctly predicted.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Q: What do TP, FP, FN stand for?

A

A: TP: True Positive, FP: False Positive, FN: False Negative.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Q: What is R-squared (R²)?

A

A: The proportion of the variance in the dependent variable that is predictable from the independent variable(s).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Q: What does Bayes’ Theorem do?

A

A: It updates the probability estimate for an event based on new evidence. Posterior = (Likelihood × Prior) / Evidence.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Q: What is a confusion matrix?

A

A: A table that shows predicted vs actual classifications: True Positives, True Negatives, False Positives, False Negatives.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Q: What is logistic regression used for?

A

A: To model binary outcome variables (yes/no, 0/1) using an S-shaped curve.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Q: What is linear regression used for?

A

A: To model the relationship between one or more independent variables and a continuous dependent variable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Q: What is a Confidence Interval (CI)?

A

A: A range of values around a sample mean that is likely to contain the population mean with a certain confidence level (e.g., 95%).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Q: What does a p-value represent?

A

A: The probability of obtaining the result we observed or one even more extreme, assuming the null hypothesis is true.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Q: What is the difference between a one-tailed and two-tailed test?

A

A: A one-tailed test checks for an effect in one direction; a two-tailed test checks both directions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Q: When do we use the T-distribution?

A

A: When sample size is small (n <= 30) and population standard deviation is unknown.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Q: What is the difference between linear and logistic regression?

A

A: Linear regression predicts a continuous outcome; logistic regression predicts a probability between 0 and 1 for classification.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Q: What is R² (Coefficient of Determination)?

A

A: It measures the proportion of variance in the dependent variable explained by the independent variable(s).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Q: What is Mean Squared Error (MSE)?

A

A: The average of the squares of the errors between predicted and actual values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Q: What is a train/test split?

A

A: A method of dividing data into a training set to build the model and a test set to evaluate its performance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Q: What is K-Fold Cross Validation?

A

A: A technique that splits the data into K subsets and trains/tests the model K times, each time using a different subset as the test set.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q: What is Random-Fold Validation?
A: A method that randomly splits the data multiple times into train/test sets and averages the results for robust performance metrics.
26
Q: What is the difference between correlation and causation?
A: Correlation indicates a relationship between variables; causation indicates that one variable causes a change in another.
27
Q: What is Pearson correlation?
A: A measure of linear correlation between two variables, ranging from -1 to 1.
28
Q: How do you calculate R² from Pearson correlation?
A: Square the Pearson correlation coefficient: R² = r².
29
Q: What is a confusion matrix?
A: A table used to evaluate the performance of a classification model by comparing actual vs. predicted labels, showing TP, FP, FN, and TN.
30
Q: What is the formula for sensitivity (also called recall)?
A: Sensitivity = TP / (TP + FN); it measures the proportion of actual positives correctly identified.
31
Q: What is precision in classification?
A: Precision = TP / (TP + FP); it measures how many predicted positives are actually correct.
32
Q: What is the main difference between one-tailed and two-tailed hypothesis tests?
A: One-tailed tests test for an effect in one direction; two-tailed tests test for an effect in both directions.
33
Q: When should you use the T-distribution instead of the normal distribution?
A: Use the T-distribution when sample size is ≤ 30 and population standard deviation is unknown.
34
Q: What does the p-value represent?
A: It represents the probability of observing a test statistic as extreme as the one obtained, assuming the null hypothesis is true.
35
Q: Why does correlation not imply causation?
A: Because two variables moving together doesn't mean one causes the other—there could be other factors involved or it could be coincidental.
36
visualize the way to correct synonyms in row data
edt.loc[~edt['PaycheckMethod'].isin(["Mail Check", "DirectDeposit", "Direct_Deposit", "Direct Deposit"]), 'PaycheckMethod'] = "Mail Check" ~edt['PaycheckMethod'].isin([...]) ➔ selects rows not in your allowed list. .loc[...] = "Mail Check" ➔ sets only those bad rows to "Mail Check".
37
Market basket analysis
A technique to discover items frequently purchased together. Helps make product recommendations, store layout decisions, and improve inventory. Example: If many people buy "biography" + "history", place those sections closer.
38
What is this line of code doing? transactions = df.groupby('OrderID')['ProductName'].apply(list).tolist()
This gives you a list of lists, where each list is one customer’s "basket." Since you're preparing your dataset for market basket analysis using something like Apriori or FP-Growth, we need to reshape your transactional data into the right format: a list of items per order.
39
How do you apply TransactionEncoder for Apriori?
from mlxtend.preprocessing import TransactionEncoder # Step 3: Transform to one-hot encoded format encoder = TransactionEncoder() encoded_array = encoder.fit_transform(transactions) # Step 4: Turn into DataFrame onehot_df = pd.DataFrame(encoded_array, columns=encoder.columns_)
40
Cross-Selling
means recommending or placing products together to encourage customers to buy more. It’s a core retail strategy that relies on buying patterns.
41
Visualize how to use permutations to generae possible 1-to-1 rules generate
from itertools import permutations flattened = [i for t in transactions for i in t] groceries = list(set(flattened)) rules = list(permutations(groceries, 2))
42
Calculate the support from the below information: ANTECEDENT SUPPORT = 0.122449 , CONSEQUENCE SUPPORT = 0.113379 , SUPPORT = 0.108844 CONFIDENCE = 0.888889, LIFT = 7.84 CONVICTION = 7.979592 , ZHANGS_METRIC = 0.994186
Support Definition: Percentage of total transactions that contain both A and B Formula: support = count(A and B) / total number of transactions support = 108 / 993 = 0.1088
43
Calculate the Confidence from the below information: ANTECEDENT SUPPORT = 0.122449 , CONSEQUENCE SUPPORT = 0.113379 , SUPPORT = 0.108844 CONFIDENCE = 0.888889, LIFT = 7.84 CONVICTION = 7.979592 , ZHANGS_METRIC = 0.994186
Confidence Definition: Probability of B given A Formula: confidence = support(A and B) / support(A) Example: confidence = 0.1088 / 0.1224 = 0.8889
44
Calculate the LIFT from the below information: ANTECEDENT SUPPORT = 0.122449 , CONSEQUENCE SUPPORT = 0.113379 , SUPPORT = 0.108844 CONFIDENCE = 0.888889, LIFT = 7.84 CONVICTION = 7.979592 , ZHANGS_METRIC = 0.994186
Lift Definition: How much more likely B is given A compared to random chance Formula: lift = confidence / support(B) Example: lift = 0.8889 / 0.1134 = 7.84
45
Calculate the CONVICTION from the below information: ANTECEDENT SUPPORT = 0.122449 , CONSEQUENCE SUPPORT = 0.113379 , SUPPORT = 0.108844 CONFIDENCE = 0.888889, LIFT = 7.84 CONVICTION = 7.979592 , ZHANGS_METRIC = 0.994186
Conviction Definition: Likelihood that A occurs without B, normalized Formula: conviction = (1 - support(B)) / (1 - confidence) Step-by-step: (1 - 0.1134) = 0.8866 (1 - 0.8889) = 0.1111 conviction = 0.8866 / 0.1111 = 7.9796
46
Calculate the Zhang's Metrict from the below information: ANTECEDENT SUPPORT = 0.122449 , CONSEQUENCE SUPPORT = 0.113379 , SUPPORT = 0.108844 CONFIDENCE = 0.888889, LIFT = 7.84 CONVICTION = 7.979592 , ZHANGS_METRIC = 0.994186
Zhang's Metric Definition: A normalized measure of association strength between A and B Formula: zhang = (confidence - consequent support) / max(confidence * (1 - consequent support), consequent support * (1 - confidence))
47
In market basket analysis, what answers the question, how often do both items appear together?
Support Ex. Out of all sales, 10.88% included both red spotty paper cups and plates.
48
In market basket analysis, what answers the question, If someone buys item A (cups), how likely are they to also buy item B (plates)?
Confidence: Ex. Nearly 89% of people who bought cups also bought the plates.
49
In market basket analysis, what answers the question, Is this buying pattern better than random?
Lift: A customer is 7.8 times more likely to buy the plates if they already bought the cups. That’s very strong.
50
In market basket analysis, what answers the question, How often does the rule not get violated? It’s like a measure of “certainty.”
Conviction: Conviction of ~8 means this rule fails very rarely — strong support for this rule.
51
In market basket analysis, what answers the question: Is this a genuinely strong relationship? Or just a coincidence?
Zhang's metric: A Zhang score near 1 means this is a very strong, reliable association — not just a fluke.
52
What is the Binomial Probability Formula?
P(X = k) = (n choose k) × p^k × (1 - p)^(n - k) P(X = k) Probability of getting exactly k successes n -Number of total trials k- Number of successes you want p- Probability of success on one trial 1 - p-Probability of failure (n choose k)-Number of ways to pick k successes out of n trials — called a combination
53
T or F. P-Values are most likely to be calculated using the probability mass function.
False. The cumulative distribution function (CDF) provides the probability that a random variable takes on a value less than or equal to a specific value, making it essential for calculating p-values in hypothesis testing.
54
True or false: Rejecting based on a p-value is equivalent to rejecting based on the critical value
True. You selected "True" because rejecting based on a p-value does indeed correspond to rejecting based on the critical value; both methods provide a way to assess whether the results are statistically significant.
55
What is the relationship between PDF and CDF
The CDF is the integral of the PDF: CDF(x) = area under PDF curve from -∞ to x The PDF is the derivative of the CDF: PDF(x) = rate of change of CDF at x
56
What does an integral mean?
"Add up all the area under a curve from one point to another."
57
What is np >= # and n(1-p) >= #
It is a decision rule. It tells you if the sample is big enough and not too skewed. np >= # and n(1-p) >= #
58
What are decision rules?
They help you determine which test is appropriate, when to use a specific method, or whether a result is significant.
59
What are common decision rules?
1. np ≥ 10 and n(1 − p) ≥ 10 - Normal approximation to binomial 2. n ≥ 30 - Central Limit Theorem (CLT) 3. α (alpha) = 0.05 0 The cutoff for rejecting the null hypothesis 4. t-test vs z-test 5. expected counts >= 5, chi-square tests1 6. paired vs. independent tests 7. two-tailed vs. one-tailed tests 8. confidence interval contains 0 or 1 in interpreting intervales 9. R-squared value close to 1 in linear regressions
60
When do we reject the null hypothesis
If the p-value is less than alpha (α), which means “This result is too weird to be just random chance. I reject the null.”
61
For z- or t-test, when do you reject the null hpothesis?
Reject the null if: z or t > critical value This means your sample result is far from what the null hypothesis predicts.
62
When you have a confidence interval, when do you reject the null hypothesis?
If the confidence interval does not include the null value. Reject the null if: 0 not in CI for differences, or 1 not in CI for ratios (like risk ratio)
63
What are 3 reasons to reject the null hypothesis?
1. The p-value is less than alpha (α) 2. The test statistic is in the rejection region for a z- or t-test 3. The the confidence interval does not include the null value
64
What is the pooled proportion?
pooled proportion p̂ = (x₁ + x₂) / (n₁ + n₂) Because under the null hypothesis, we assume both groups are the same. So we combine ("pool") the data to estimate a common proportion.
65
What is the standard error (SE)
This tells us how much we expect the sample difference to vary by chance: SE = sqrt[ p̂ * (1 - p̂) * (1/n₁ + 1/n₂) ]
66
What is the z formula or z-observed?
It answers, how far away is the difference between the two sample proportions from what we expected — assuming both groups are equal? z observed = (p̂₁ - p̂₂) / SE
67
If we get a z observed of 1.56, what does that mean?
It means the sample difference was 1.56 standard deviations above what we'd expect if the two groups were really equal.
68
What do we do when our z-observed is less than our critical value?
We fail to reject the null hypothesis.
69
Visualize how to use itertools to generate rules
from itertools import permutations flattened = [item for trans in transactions for item in trans] items = list(set(flattened)) #converts to set to find unique items rules = list(permutations(items, 2)) #gets the permutations of each set print(len(rules)) # 72 if 9 items
70
What are three approaches to conducting market basket analysis?
1. Apriori 2. FP-Growth 3. ECLAT
71
Describe ECLAT for market basket analysis
Approach: Vertical data format, intersection of itemsets. Strength: Efficient for sparse datasets, fast intersection operations. Weakness: Data must be sparse; can use large memory on dense data.
72
Describe FP-GROWTH for market basket analysis
Approach: Uses FP-trees (Frequent Pattern Trees). Strength: Faster on large datasets, fewer scans of data. Weakness: Slightly more complex data structure, can be harder to interpret at a glance.
73
Describe Apriori for market basket analysis
Approach: Candidate generation method. Strength: Easy to interpret, easy to control with parameters (min_support, etc.). Weakness: Slower on large datasets due to repeated scans.
74
.fit_transform() vs .transform()
.fit_transform(): Learns (fits) from the data and then applies the transformation right away. .transform(): Only applies the learned transformation. It does not learn anything new.
75
Describe the steps to market basket analysis
1. load and explore the data 2. aggregate and prune the data 3. transform and transcode the data 4. Apply a method of Apriori, ECLAT, or FP-Growth 5. Apply the association rules 6. Interpret and analyze the rules
76
np.logical_and()
np.logical_and() is a NumPy function that compares two arrays element by element and returns True only when both elements are True. np.logical_and(array1, array2)
77
What line of code for numpy obeys this command: “Create a new column called jam+bread, and for each transaction, mark True only if both jam and bread were purchased.”
onehot['jam+bread'] = np.logical_and(onehot['jam'], onehot['bread'])
78
What answers “Is this association meaningful, or just a fluke?”
Lift. f lift > 1, the itemset happens more often than expected. If lift = 1, the relationship is what you'd expect by chance. If lift < 1, the itemset is less likely to appear together than random chance would suggest.
79
What association rule in MBA tells us how strongly one item implies another—with a twist? Ex. How surprised would I be if customer placed Potter in their cart and that doesn’t lead to Hunger Games?
Conviction
80
What do the below mean? Conviction = 1 Conviction > 1 Conviction < 1
Conviction = 1 → A and B are independent (no real pattern). Conviction > 1 → Stronger association. A leads to B more often than you'd expect by chance. Conviction < 1 → Negative association. A happening makes B less like
81
What is Dissociation?
Dissociation means that two things tend not to happen together. 📕 Example: People who read Twilight rarely read Dostoevsky.
82
What is Zhang's metric?
It compares how likely B is to occur because of A versus on its own. It adjusts for how common B already is.
83
True or False: If B is already super popular, then A tells us a lot and Zhang’s metric stays low.
False: If B is already super popular, then A doesn’t tell us much → Zhang’s metric stays low. If B is rare, but happens a lot after A, then Zhang’s metric is high → strong associatio
84
Complete the below: Zhang's metric closer to 1 → Zhang's metric closer to 0 →
Zhang's metric closer to 1 → Strong association Zhang's metric closer to 0 → Weak or no association Negative → Dissociation (A happening means B probably won’t)
85
What are association rules?
These show relationships between items. Like: Antecedent (the “if”) → Consequent (the “then”)
86
What are the metrics for measuring strength?
Support, confidence, lift, conviction, Zhang's metric
87
One-hot
One‑hot format (or one‑hot encoding) is a way to turn a list of categories—words, product IDs, colors, etc.—into numbers a computer can work with.
88
Visualize a category transformed into a one-hot vector
Apple. |. [1000] Banana. |. [0100] Orange. |. [0010] Kiwi. |. [0001]
89
Visualize how to use TransactionEncoder
from mlxtend.preprocessing import TransactionEncoder onehot = TransactionEncoder() data = onehot.fit_transform(transactions) data = pd.DataFrame(data, columns=onehot.columns_, dtype=int)
90
Why would you convert your columns to Boolean dytpe before running apriori?
mlxtend works fastest—and without warnings—when columns are Boolean.
91
What is an OrdinalEncoder()
Here you build an empty encoding machine and hand it the rank order you want:
92
When do you use Poisson regression?
You are predicting count data — that is, the dependent variable represents non-negative integers (0, 1, 2, 3, ...) and often reflects: Number of events in a fixed period (e.g., calls per hour, accidents per day) Frequency of occurrences (e.g., customer visits, email clicks, etc.)
93
When do you use linear regression?
You are predicting a continuous numeric outcome, such as: house prices, test scores, temperatures, revenue
94
Visualize the poisson regression
import statsmodels.api as sm import statsmodels.formula.api as smf # Assume df contains columns: 'visits' (count), 'age', 'region' model = smf.glm(formula='visits ~ age + region', data=df, family=sm.families.Poisson()).fit() print(model.summary())
95