Data Science Statistics Flashcards

Question

Visualize .crosstab()

Answer 1

pd.crosstab(index=df["Gender"], columns=df["Turnover"])

Answer 2

Say you have this crosstab, You want to normalize it so that you see proportions, not raw counts. That’s where .div() comes in: crosstab_01.div(crosstab_01.sum(1), axis=0)

Answer 3

A: It explores one variable at a time, like analyzing the distribution or summary stats.

Answer 4

A: It studies the relationship between two variables, like comparing gender to turnover.

Answer 5

import statstics statistics.mean(a) # Average statistics.median(a) # Middle value statistics.mode(a) # Most frequent value statistics.stdev(a) # Standard deviation (sample) statistics.quantiles(a) # Quartile values (by default: Q1, Q2, Q3)

Answer 6

range_val = max(a) - min(a)

Answer 7

- **Quantiles** divide your data into *n* equally sized intervals. - e.g., quintiles (5 groups), deciles (10), percentiles (100) - **Quartiles** are a type of quantile that divides the data into 4 parts.

Answer 8

P(A ∩ B) = P(A) * P(B|A) If A and B are **independent**: `P(A ∩ B) = P(A) * P(B)`

Answer 9

P(A ∪ B) = P(A) + P(B) - P(A ∩ B)

Answer 10

Because it represents the number of ways to arrange zero objects—there is one way: doing nothing.

Answer 11

import math math.factorial(5) # 120

Answer 12

math.perm(n, r) # or n! / (n-r)!

Answer 13

math.comb(n, r) # or n! / [(n-r)! * r!]

Answer 14

import numpy as np np.random.rand(5) # 5 random numbers (0 to 1) np.random.seed(333) # Get reproducible results np.random.randint(1, 10, 5) # 5 integers between 1 and 10

Answer 15

- Fixed number of trials - Only 2 outcomes - Constant probability of success - Independent trials

Answer 16

Probability of ≤ 2 defects in 20 items, defect rate = 12% binom.cdf(2, 20, 0.12) # Probability of exactly 2 defects binom.pmf(2, 20, 0.12) # Mean and standard deviation binom.mean(20, 0.12) # n * p binom.std(20, 0.12) # sqrt(n * p * (1 - p))

Answer 17

Used for **rare events** in a fixed interval of time or space. ### Examples: - Number of calls per minute - Number of typos per page - Number of car accidents in a city

Answer 18

**Properties**: - Infinite possible number of events - Average rate is constant

Answer 19

Python’s built-in `statistics` library is great for **basic descriptive statistics**. Think of it like a calculator for things you’d typically do in Excel:

Answer 20

The scipy.stats library is part of the SciPy ecosystem and provides tools for probability distributions, hypothesis testing, fitting models, and more.

Answer 21

The Survival Function answers: “What is the probability that a value is greater than x?” ✅ Use sf() when you want the "tail" probability (e.g., upper-bound extreme values).

Answer 22

1. State the Alternative Hypothesis (𝐻ₐ) 2. State the Null Hypothesis (𝐻₀) 3. Choose a Significance Level (α) 4. Calculate the Test Statistic 5. Determine the Critical Value(s) 6. Compare the Test Statistic to the Critical Value 7. Make Your Decision and Interpret the Results

Answer 23

Rejecting 𝐻₀ when it’s actually true. Example: Saying the bottle is incorrectly filled when it's actually correct.

Answer 24

Failing to reject 𝐻₀ when it's actually false. Example: Saying the bottle is fine when it actually isn’t.

Answer 25

The ability to correctly detect a real effect when it exists. Higher power = less chance of a Type II Error. Ways to increase power: Increase sample size (n) Increase α Reduce variability (σ)

Answer 26

False. it simply means you don't have strong enough evidence to say it's false.

Answer 27

It is the probability of making a Type I error (rejecting a true null hypothesis). We decide the alpha.

Answer 28

You're testing for differences in both directions (e.g., not equal to).

Answer 29

You're testing in one direction only (e.g., greater than or less than).

Answer 30

The z-score that defines the boundary of the rejection region in a normal distribution. The cutoff value found in the Z-table.

Answer 31

- Compare the **means of two populations** - Conditions: - Known population variances - Large sample sizes (n ≥ 30)

Answer 32

- Compare **before and after** results (e.g., exam improvement)

Answer 33

Compare two **percentages** (e.g., % of smokers in two towns)

Answer 34

Compare **spread/variability** between two datasets

Answer 35

Use when comparing **more than 2** group means

Answer 36

- You're testing a population mean - The population **standard deviation is known** - Either: - Sample size ≥ 30 (**Central Limit Theorem**) - Or the population is **normally distributed**

Answer 37

. 1.**State the Hypotheses**: H₀ (null) vs. H₁ (alternate) 2. **Choose α** (Significance Level): e.g., 0.05 3. **Compute Test Statistic**: Use Z formula 4. **Find Z-Critical Value**: Use `norm.ppf()` or reference Z table 5. **Decision**: - Compare Z_cal to Z_critical - OR compare p_value to α

Answer 38

- Compare the **means of two populations** - Conditions: - Known population variances - Large sample sizes (n ≥ 30)

Answer 39

statistics library (built-in Python)

Answer 40

range_val = max(a) - min(a)

Answer 41

Quantiles divide data into n equal parts. Quartiles specifically divide data into 4 parts (Q1, Q2/median, Q3).

Answer 42

Probability = (favorable outcomes) / (total outcomes)

Answer 43

Events that cannot happen at the same time (e.g., heads or tails).

Answer 44

Events where the outcome of one does not affect the outcome of the other.

Answer 45

P(A) + P(not A) = 1

Answer 46

Union (A or B) = ∪ Intersection (A and B) = ∩ Complement (Not A) = A'

Answer 47

P(A ∩ B) = P(A) * P(B|A) (If independent: P(A) * P(B))

Answer 48

P(A ∪ B) = P(A) + P(B) - P(A ∩ B)

Answer 49

0! = 1 (One way to arrange zero objects.)

Answer 50

Permutations: order matters. Combinations: order does not matter.

Answer 51

import math math.factorial(n) math.perm(n, r) math.comb(n, r)

Answer 52

Generates 5 random numbers between 0 and 1.

Answer 53

Discrete: countable (e.g., emails) Continuous: measurable (e.g., height)

Answer 54

Fixed number of trials Two outcomes (success/failure) Constant probability Independent trials

Answer 55

from scipy.stats import binom binom.pmf(2, 20, 0.12) binom.cdf(2, 20, 0.12)

Answer 56

For rare events over fixed intervals (e.g., typos per page).

Answer 57

from scipy.stats import poisson poisson.pmf(3, mu=2) poisson.cdf(3, mu=2)

Answer 58

Binomial: repeated trials (coin flip). Poisson: rare, random events (calls per minute).

Answer 59

statistics: simple one-dimensional descriptive stats. scipy.stats: advanced stats like distributions and hypothesis tests.

Answer 60

The probability that a value is ≤ x in a normal distribution.

Answer 61

The probability that a value is > x in a normal distribution.

Answer 62

The probability of getting a test result as extreme or more extreme if the null hypothesis is true.

Answer 63

If the p-value < α (e.g., 0.05), reject the null hypothesis.

Answer 64

State H₀ and Hₐ Set α Calculate test statistic Find critical value Compare and conclude

Answer 65

Rejecting H₀ when it’s actually true (false positive).

Answer 66

Failing to reject H₀ when it’s actually false (false negative).

Answer 67

df = sample size - 1

Answer 68

t-test: sample size < 30 or unknown population standard deviation z-test: large sample or known standard deviation.

Answer 69

Random samples Approximate normality Population standard deviation unknown

Answer 70

from scipy import stats stats.ttest_1samp(data, population_mean)

Answer 71

Tests relationships between categorical variables (e.g., smoking vs gender).

Answer 72

import scipy.stats as stats stats.chi2_contingency(table)

Answer 73

To test if there’s a significant difference between 3 or more group means.

Answer 74

from scipy.stats import f_oneway f_oneway(group1, group2, group3)

Answer 75

More between-group variation than within-group → likely reject H₀.

Answer 76

A table showing the frequency distribution of variables for Chi-Square tests.

Answer 77

To compare the variances of two populations.

Answer 78

When comparing two related samples (e.g., before-and-after measurements on the same subjects).

Answer 79

H₀: μ_before = μ_after (no difference) Hₐ: μ_before ≠ μ_after (difference exists)

Answer 80

t = d / (sd/math.sqrt(n)) and then df = n - 1 (where n = number of pairs) where 𝑑 = mean of differences, sd = standard deviation of differences, and n = number of pairs. df = degrees of freedom

Answer 81

from scipy import stats stats.ttest_rel(before, after)

Answer 82

Answer 83

t_critical for a Paired t-test is the cutoff t-value you get from a t-distribution table based on two things: degrees of freedom (df) = n-1 significance level (a) = 0.05 or 0.01

Answer 84

from scipy.stats import t alpha = 0.05 df = 9 t_critical = t.ppf(1 - alpha/2, df) print(t_critical)

Answer 85

Degrees of Freedom = 10−1=9 If you're doing a two-tailed test at α = 0.05: Look up t_critical for df = 9 and 0.025 per tail. from the t-table: t_critical = 2.262

Answer 86

When comparing two sample proportions (e.g., % of smokers in two towns).

Answer 87

H₀: p₁ = p₂ (proportions are equal) Hₐ: p₁ ≠ p₂ (proportions are different)

Answer 88

p = x1 + x2 / n1 + n2 z = p1 - p2 / math.sqrt(p(1-p)(1/n1 + 1/n2))

Answer 89

from statsmodels.stats.proportion import proportions_ztest proportions_ztest([successes1, successes2], [n1, n2])

Answer 90

When you are testing for a specific difference (not just zero) or when proportions are assumed unequal.

Answer 91

When testing if a population variance equals a specified value.

Answer 92

X = (n-1)s2 / st2 Degrees of Freedom for Chi-Square Test df = n - 1

Answer 93

To compare two population variances.

Answer 94

f = s2 /s2 (Larger variance in numerator)

Answer 95

If F_calculated > F_critical (or F_calculated < inverse lower critical), reject H₀.

Answer 96

from scipy.stats import f F = var1 / var2 f.ppf(1 - alpha/2, df1, df2) # Upper critical f.ppf(alpha/2, df1, df2) # Lower critical

Answer 97

To test whether there are significant differences between three or more group means.

Answer 98

H₀: μ₁ = μ₂ = μ₃ (all group means equal) Hₐ: At least one mean is different

Answer 99

F = MSbetween/MSwithin

Answer 100

df_between = k - 1 df_within = N - k

Answer 101

It suggests that group means are significantly different.

Answer 102

from scipy.stats import f_oneway f_oneway(group1, group2, group3)

Answer 103

If the p-value > α (e.g., p > 0.05), fail to reject → no significant difference between means.

Answer 104

If p-value < α or if test statistic is more extreme than critical value.

Answer 105

If p-value ≥ α or if test statistic is within the acceptance region.

Answer 106

from scipy.stats import t alpha = 0.05 df = 9 t_critical = t.ppf(1 - alpha/2, df) print(t_critical)

Answer 107

the cutoff t-value you get from a t-distribution table based on two things Degrees of Freedom (df) df=n−1 (where n = number of pairs you're comparing) Significance Level (α) Common α values are 0.05 (95% confidence), 0.01 (99% confidence), etc. If it's a two-tailed test, split α into two (e.g., 0.025 in each tail).

Answer 108

Degrees of Freedom = 10−1=9 If you're doing a two-tailed test at α = 0.05: Look up t critical for df = 9 and 0.025 per tail. tcritical = 2.62

Answer 109

Find degrees of freedom (n-1) Choose one-tailed or two-tailed Look up (or calculate) t_critical for that α level and df

Answer 110

Data is normally distributed or approximately normal. Sample size is large enough for the central limit theorem. Comparing means (e.g., t-tests for two groups, ANOVA for more than two groups).

Answer 111

Data is not normally distributed or is ordinal. Small sample sizes that do not meet parametric assumptions. Comparing medians or ranks (e.g., Mann-Whitney U test, Kruskal-Wallis test).

Answer 112

scipy.stats – for t-tests, chi-square, Mann-Whitney, etc. statsmodels – for ANOVA, linear models sklearn – for predictive models, regressions, classification

Answer 113

Compare two independent groups (like t-test, but non-parametric) You’re comparing satisfaction scores from two phone services If responses are 1–5 on a Likert scale (ordinal), use Mann-Whitney U test

Answer 114

Like ANOVA, but for non-normal data

Answer 115

Chi-square test

Answer 116

It finds the "best-fit line" that minimizes the squared errors (distance from each point to the line). , and gives: Slope, Intercept, and R-squared

Answer 117

model = LinearRegression() model.fit(X, y) print("Intercept:", model.intercept_[0]) print("Slope (BMI):", model.coef_[0][0]) print("R² Score:", model.score(X, y)) # Visualize plt.scatter(X, y, alpha=0.5)

Answer 118

If the slope is 410.2, it means for each 1 unit increase in BMI, charges increase by ~$410 If the p-value < 0.05, the relationship is statistically significant If R² = 0.64, it means 64% of the variance in charges is explained by BMI

Answer 119

The goal is to use data to solve real-world problems, make informed decisions, and predict future outcomes.

Answer 120

Problem Understanding Data Preparation Exploratory Data Analysis Modeling Evaluation Deployment Feedback

Answer 121

Clearly defining the problem. This ensures you know exactly what you're trying to solve, which saves time and helps choose the right data and methods.

Answer 122

Classification (sorting into groups) Estimation (Regression) (predicting numerical values) Clustering (grouping similar items) Anomaly Detection (finding unusual data) Recommendation (suggesting items based on preferences)

Answer 123

import pandas as pd bank_train = pd.read_csv("C:/.../bank_marketing_training") crosstab_01 = pd.crosstab(bank_train['previous_outcome'], bank_train['response']) crosstab_01.plot(kind='bar', stacked = True)

Answer 124

We set baseline performance to know how good our predictions need to be. It's like knowing the average class grade to decide if your score is above or below average. This happens in the Modeling phase.

Answer 125

Cluster profiles are descriptions of groups identified in clustering, detailing what makes each group unique. Customer groups—one group might be young tech enthusiasts, another older customers who prefer traditional products

Answer 126

a pandas function that creates a cross-tabulation table (aka contingency table). It shows the frequency distribution of two (or more) categorical variables. pd.crosstab(index=df["Gender"], columns=df["Turnover"])

Answer 127

You want to normalize it so that you see proportions, not raw counts. That’s where .div() comes in: crosstab_01 = pd.crosstab(index=df["Gender"], columns=df["Turnover"]) crosstab_01.div(crosstab_01.sum(1), axis=0)

Answer 128

Understand the distribution, central tendency, or spread of that one variable

Answer 129

Analyzes the relationship between two variables Goal: Understand how one variable affects or relates to another

Answer 130

univariate

Answer 131

univariate

Answer 132

Scatter plots (both numeric) Boxplots (one numeric, one categorical) Crosstabs / GroupBy tables (both categorical) Correlation coefficients (.corr())

Answer 133

DF['column_binned'] = pd.cut(x = df['column'], bins = [0, 27, 60.01, 100], labels=["Under 27", "27 to 60", "Over 60"], right = False)

Answer 134

Partitioning the data Validating the data partition Balancing the data Establishing baseline model performance

Answer 135

Because of the lack of a priori hypotheses, data scientists need to beware of data dredging, whereby phantom spurious results are uncovered, due merely to random variation rather than real effects.

Answer 136

Baseline Models for Binary Classification Let one of the binary target classes represent positive and the other class represent negative. Let p represent the proportion of positive records in the data.

Answer 137

Statistical significance means the results you found in your data are unlikely to have happened by random chance. It’s measured using a p-value

Answer 138

a guess or theory you come up with before you look at the data.

Answer 139

False. Big Data Makes Everything Look “Statistically Significant”. When your dataset is massive (like millions of rows), even a tiny difference can come out as statistically significant — even if that difference doesn’t actually matter in the real world.

Answer 140

When you: Split your data into a training set and a test set. Build your model using the training set. Test it on the unseen test set to make sure it really works.

Answer 141

twofold cross‐validation, the data are partitioned, using random assignment, into a training data set and a test data set

Answer 142

train_test_split() from sklearn.model_selection

Answer 143

: To ensure the training and test sets are similar and not systematically different.

Answer 144

A: A technique to ensure the model sees enough examples of each class, especially the rare ones.

Answer 145

A: Creating additional copies of the minority class to balance the training data set.

Answer 146

A: Removing samples from the majority class to balance the dataset.

Answer 147

A: Because real-world data isn't balanced, and test data should reflect real conditions for valid evaluation.

Answer 148

A: A simple model like always predicting the majority class, used for comparison against more complex models.

Answer 149

A: It makes predictions based on simple rules like 'most frequent class' to serve as a performance baseline.

Answer 150

A: A model that splits data into branches based on decision rules, ending in leaf nodes with predictions.

Answer 151

: Assessing how well a model performs using metrics like accuracy, precision, and recall.

Answer 152

A: Precision = TP / (TP + FP); it measures how many predicted positives are actually correct. TP = True positive, FP = False Postive, FN = False Negative

Answer 153

: Recall = TP / (TP + FN); it measures how many actual positives were correctly predicted. TP = True positive, FP = False Postive, FN = False Negative

Answer 154

The Inverse CDF EG If CDF(1.64) ≈ 0.95, then inverse_CDF(0.95) = 1.64

Answer 155

To figure out the Z value that corresponds to α = 0.05, we look it up in a Z-table, Then we use the Z formula to find the corresponding p̂: Z = (p̂ - p₀) / √(p₀(1 - p₀)/n),

Data Science Statistics Flashcards

Learned at WGU (181 cards)