Data Science Statistics II Flashcards

Question

Q: What is Random-Fold Validation?

Answer 1

A: A method that randomly splits the data multiple times into train/test sets and averages the results for robust performance metrics.

Answer 2

A: Correlation indicates a relationship between variables; causation indicates that one variable causes a change in another.

Answer 3

A: A measure of linear correlation between two variables, ranging from -1 to 1.

Answer 4

A: Square the Pearson correlation coefficient: R² = r².

Answer 5

A: A table used to evaluate the performance of a classification model by comparing actual vs. predicted labels, showing TP, FP, FN, and TN.

Answer 6

A: Sensitivity = TP / (TP + FN); it measures the proportion of actual positives correctly identified.

Answer 7

A: Precision = TP / (TP + FP); it measures how many predicted positives are actually correct.

Answer 8

A: One-tailed tests test for an effect in one direction; two-tailed tests test for an effect in both directions.

Answer 9

A: Use the T-distribution when sample size is ≤ 30 and population standard deviation is unknown.

Answer 10

A: It represents the probability of observing a test statistic as extreme as the one obtained, assuming the null hypothesis is true.

Answer 11

A: Because two variables moving together doesn't mean one causes the other—there could be other factors involved or it could be coincidental.

Answer 12

edt.loc[~edt['PaycheckMethod'].isin(["Mail Check", "DirectDeposit", "Direct_Deposit", "Direct Deposit"]), 'PaycheckMethod'] = "Mail Check" ~edt['PaycheckMethod'].isin([...]) ➔ selects rows not in your allowed list. .loc[...] = "Mail Check" ➔ sets only those bad rows to "Mail Check".

Answer 13

A technique to discover items frequently purchased together. Helps make product recommendations, store layout decisions, and improve inventory. Example: If many people buy "biography" + "history", place those sections closer.

Answer 14

This gives you a list of lists, where each list is one customer’s "basket." Since you're preparing your dataset for market basket analysis using something like Apriori or FP-Growth, we need to reshape your transactional data into the right format: a list of items per order.

Answer 15

from mlxtend.preprocessing import TransactionEncoder # Step 3: Transform to one-hot encoded format encoder = TransactionEncoder() encoded_array = encoder.fit_transform(transactions) # Step 4: Turn into DataFrame onehot_df = pd.DataFrame(encoded_array, columns=encoder.columns_)

Answer 16

means recommending or placing products together to encourage customers to buy more. It’s a core retail strategy that relies on buying patterns.

Answer 17

from itertools import permutations flattened = [i for t in transactions for i in t] groceries = list(set(flattened)) rules = list(permutations(groceries, 2))

Answer 18

Support Definition: Percentage of total transactions that contain both A and B Formula: support = count(A and B) / total number of transactions support = 108 / 993 = 0.1088

Answer 19

Confidence Definition: Probability of B given A Formula: confidence = support(A and B) / support(A) Example: confidence = 0.1088 / 0.1224 = 0.8889

Answer 20

Lift Definition: How much more likely B is given A compared to random chance Formula: lift = confidence / support(B) Example: lift = 0.8889 / 0.1134 = 7.84

Answer 21

Conviction Definition: Likelihood that A occurs without B, normalized Formula: conviction = (1 - support(B)) / (1 - confidence) Step-by-step: (1 - 0.1134) = 0.8866 (1 - 0.8889) = 0.1111 conviction = 0.8866 / 0.1111 = 7.9796

Answer 22

Zhang's Metric Definition: A normalized measure of association strength between A and B Formula: zhang = (confidence - consequent support) / max(confidence * (1 - consequent support), consequent support * (1 - confidence))

Answer 23

Support Ex. Out of all sales, 10.88% included both red spotty paper cups and plates.

Answer 24

Confidence: Ex. Nearly 89% of people who bought cups also bought the plates.

Answer 25

Lift: A customer is 7.8 times more likely to buy the plates if they already bought the cups. That’s very strong.

Answer 26

Conviction: Conviction of ~8 means this rule fails very rarely — strong support for this rule.

Answer 27

Zhang's metric: A Zhang score near 1 means this is a very strong, reliable association — not just a fluke.

Answer 28

P(X = k) = (n choose k) × p^k × (1 - p)^(n - k) P(X = k) Probability of getting exactly k successes n -Number of total trials k- Number of successes you want p- Probability of success on one trial 1 - p-Probability of failure (n choose k)-Number of ways to pick k successes out of n trials — called a combination

Answer 29

False. The cumulative distribution function (CDF) provides the probability that a random variable takes on a value less than or equal to a specific value, making it essential for calculating p-values in hypothesis testing.

Answer 30

True. You selected "True" because rejecting based on a p-value does indeed correspond to rejecting based on the critical value; both methods provide a way to assess whether the results are statistically significant.

Answer 31

The CDF is the integral of the PDF: CDF(x) = area under PDF curve from -∞ to x The PDF is the derivative of the CDF: PDF(x) = rate of change of CDF at x

Answer 32

"Add up all the area under a curve from one point to another."

Answer 33

It is a decision rule. It tells you if the sample is big enough and not too skewed. np >= # and n(1-p) >= #

Answer 34

They help you determine which test is appropriate, when to use a specific method, or whether a result is significant.

Answer 35

1. np ≥ 10 and n(1 − p) ≥ 10 - Normal approximation to binomial 2. n ≥ 30 - Central Limit Theorem (CLT) 3. α (alpha) = 0.05 0 The cutoff for rejecting the null hypothesis 4. t-test vs z-test 5. expected counts >= 5, chi-square tests1 6. paired vs. independent tests 7. two-tailed vs. one-tailed tests 8. confidence interval contains 0 or 1 in interpreting intervales 9. R-squared value close to 1 in linear regressions

Answer 36

If the p-value is less than alpha (α), which means “This result is too weird to be just random chance. I reject the null.”

Answer 37

Reject the null if: z or t > critical value This means your sample result is far from what the null hypothesis predicts.

Answer 38

If the confidence interval does not include the null value. Reject the null if: 0 not in CI for differences, or 1 not in CI for ratios (like risk ratio)

Answer 39

1. The p-value is less than alpha (α) 2. The test statistic is in the rejection region for a z- or t-test 3. The the confidence interval does not include the null value

Answer 40

pooled proportion p̂ = (x₁ + x₂) / (n₁ + n₂) Because under the null hypothesis, we assume both groups are the same. So we combine ("pool") the data to estimate a common proportion.

Answer 41

This tells us how much we expect the sample difference to vary by chance: SE = sqrt[ p̂ * (1 - p̂) * (1/n₁ + 1/n₂) ]

Answer 42

It answers, how far away is the difference between the two sample proportions from what we expected — assuming both groups are equal? z observed = (p̂₁ - p̂₂) / SE

Answer 43

It means the sample difference was 1.56 standard deviations above what we'd expect if the two groups were really equal.

Answer 44

We fail to reject the null hypothesis.

Answer 45

from itertools import permutations flattened = [item for trans in transactions for item in trans] items = list(set(flattened)) #converts to set to find unique items rules = list(permutations(items, 2)) #gets the permutations of each set print(len(rules)) # 72 if 9 items

Answer 46

1. Apriori 2. FP-Growth 3. ECLAT

Answer 47

Approach: Vertical data format, intersection of itemsets. Strength: Efficient for sparse datasets, fast intersection operations. Weakness: Data must be sparse; can use large memory on dense data.

Answer 48

Approach: Uses FP-trees (Frequent Pattern Trees). Strength: Faster on large datasets, fewer scans of data. Weakness: Slightly more complex data structure, can be harder to interpret at a glance.

Answer 49

Approach: Candidate generation method. Strength: Easy to interpret, easy to control with parameters (min_support, etc.). Weakness: Slower on large datasets due to repeated scans.

Answer 50

.fit_transform(): Learns (fits) from the data and then applies the transformation right away. .transform(): Only applies the learned transformation. It does not learn anything new.

Answer 51

1. load and explore the data 2. aggregate and prune the data 3. transform and transcode the data 4. Apply a method of Apriori, ECLAT, or FP-Growth 5. Apply the association rules 6. Interpret and analyze the rules

Answer 52

np.logical_and() is a NumPy function that compares two arrays element by element and returns True only when both elements are True. np.logical_and(array1, array2)

Answer 53

onehot['jam+bread'] = np.logical_and(onehot['jam'], onehot['bread'])

Answer 54

Lift. f lift > 1, the itemset happens more often than expected. If lift = 1, the relationship is what you'd expect by chance. If lift < 1, the itemset is less likely to appear together than random chance would suggest.

Answer 55

Conviction

Answer 56

Conviction = 1 → A and B are independent (no real pattern). Conviction > 1 → Stronger association. A leads to B more often than you'd expect by chance. Conviction < 1 → Negative association. A happening makes B less like

Answer 57

Dissociation means that two things tend not to happen together. 📕 Example: People who read Twilight rarely read Dostoevsky.

Answer 58

It compares how likely B is to occur because of A versus on its own. It adjusts for how common B already is.

Answer 59

False: If B is already super popular, then A doesn’t tell us much → Zhang’s metric stays low. If B is rare, but happens a lot after A, then Zhang’s metric is high → strong associatio

Answer 60

Zhang's metric closer to 1 → Strong association Zhang's metric closer to 0 → Weak or no association Negative → Dissociation (A happening means B probably won’t)

Answer 61

These show relationships between items. Like: Antecedent (the “if”) → Consequent (the “then”)

Answer 62

Support, confidence, lift, conviction, Zhang's metric

Answer 63

One‑hot format (or one‑hot encoding) is a way to turn a list of categories—words, product IDs, colors, etc.—into numbers a computer can work with.

Answer 64

Apple. |. [1000] Banana. |. [0100] Orange. |. [0010] Kiwi. |. [0001]

Answer 65

from mlxtend.preprocessing import TransactionEncoder onehot = TransactionEncoder() data = onehot.fit_transform(transactions) data = pd.DataFrame(data, columns=onehot.columns_, dtype=int)

Answer 66

mlxtend works fastest—and without warnings—when columns are Boolean.

Answer 67

Here you build an empty encoding machine and hand it the rank order you want:

Answer 68

You are predicting count data — that is, the dependent variable represents non-negative integers (0, 1, 2, 3, ...) and often reflects: Number of events in a fixed period (e.g., calls per hour, accidents per day) Frequency of occurrences (e.g., customer visits, email clicks, etc.)

Answer 69

You are predicting a continuous numeric outcome, such as: house prices, test scores, temperatures, revenue

Answer 70

import statsmodels.api as sm import statsmodels.formula.api as smf # Assume df contains columns: 'visits' (count), 'age', 'region' model = smf.glm(formula='visits ~ age + region', data=df, family=sm.families.Poisson()).fit() print(model.summary())

Data Science Statistics II Flashcards

(95 cards)