Data Science Statistics Flashcards

Learned at WGU (181 cards)

1
Q

Q: What does the Central Limit Theorem state?

A

A: Regardless of the population, the distribution of the sample means will approximate a normal distribution as the sample size increases.
Take repeated samples, calculate their means → plot those means → the result approaches a bell curve.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Python for Central Limit Theorem

A

import numpy as np
import matplotlib.pyplot as plt

samples = [np.mean(np.random.exponential(size=50)) for _ in range(1000)]
plt.hist(samples, bins=30, edgecolor=”black”)
plt.title(“Sample Means - CLT in Action”)
plt.show()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Q: What is the Probability Density Function (PDF)?

A

The Probability Density Function gives you the likelihood of a value occurring at an exact point, per unit of x.

pdf = norm.pdf(x, loc=0, scale=1)
plt.plot(x, pdf)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Q: What does the CDF tell us?

A

The Cumulative Distribution Function gives you the total probability that a variable is less than or equal to a certain value.
CDF(x=0) tells you “What’s the probability that X is less than or equal to 0?”
Area under the PDF curve up to value x.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Python for Cumulative Distribution Function (CDF)

A

from scipy.stats import norm
norm.cdf(1.96, loc=0, scale=1) # ≈ 0.975
loc is the mean of the normal distribution
scale is the standard deviation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Q: What does the Inverse CDF (or PPF) do?

A

Use the inverse CDF to determine the value of the variable associated with a specific probability. x = InvCDF(P)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Python for Inverse CDF

A

from scipy.stats import norm
norm.ppf(0.975, loc=0, scale=1) # ≈ 1.96

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Q: What is a confidence interval?

A

A: A range of values within which we expect a population parameter to fall with a certain level of confidence.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Python for Confidence Interval

A

import scipy.stats as stats
mean = np.mean(sample)
computes the 95 percent CI for the population mean with df (Degrees of freedom = n -1), loc=mean centers the interval, and scale=sem is the standard error of the mean
ci = stats.t.interval(0.95, df=len(sample)-1, loc=mean, scale=sem)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Q: What does the p-value measure?

A

A: The probability of observing your data (or more extreme) if the null hypothesis is true.

By Hand:
Use z or t-tables based on the test statistic.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Python for P-Value

A

from scipy.stats import ttest_1samp

ttest_1samp(sample, popmean=52) # Returns t-statistic and p-value

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Q: What’s the difference between a one-tailed and two-tailed test?

A

One-Tailed: Tests for an effect in one direction (e.g., greater than).
Two-Tailed: Tests for an effect in either direction.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Why Two-Tailed is Harder:

A

Because the alpha level (e.g., 0.05) is split between two tails (0.025 each), making it harder to reject the null.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Q: When is the t-distribution used instead of the normal distribution?

A

A: When sample size is small (n < 31)
When population standard deviation is unknown

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Python for t-distribution

A

from scipy.stats import t
# Get the critical t-value
t.ppf(0.975, df=29) # For 95% confidence with df = 29

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Q: What does it mean to partition your data?

A

A: It means splitting your dataset into training data (to teach your model) and test data (to check how well it learned).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Q: What does train_test_split() do in Python?

A

A: It randomly divides your dataset into training and testing sets so you can build and evaluate your model.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Q: Why should you validate your partitioned data?

A

A: To make sure both training and testing sets have a similar distribution (e.g., same class balance) — no surprises!

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Q: What is data imbalance?

A

A: When one class (like “Yes”) appears way less often than another (like “No”). This can confuse your model into always guessing the majority class.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Q: How can you balance imbalanced data?

A

A: Use techniques like:
Oversampling (make more of the rare class)
Undersampling (remove from the common class)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Q: What is a baseline model?

A

A: A simple model that always guesses the most common class. You use it to see if your actual model is better than “just guessing.”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Q: How do you create a baseline model in Python?

A

A: Use DummyClassifier to always guess the most frequent class.
from sklearn.dummy import DummyClassifier
model = DummyClassifier(strategy=”most_frequent”)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Q: What’s the goal of a baseline model?

A

A: To set a performance floor — if your real model doesn’t beat the baseline, it’s not useful.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What is pd.crosstab() in Python?

A

pd.crosstab() is a pandas function that creates a cross-tabulation table (aka contingency table). It shows the frequency distribution of two (or more) categorical variables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Visualize .crosstab()
pd.crosstab(index=df["Gender"], columns=df["Turnover"])
26
What does div() do here?
Say you have this crosstab, You want to normalize it so that you see proportions, not raw counts. That’s where .div() comes in: crosstab_01.div(crosstab_01.sum(1), axis=0)
27
Q: What is univariate analysis?
A: It explores one variable at a time, like analyzing the distribution or summary stats.
28
Q: What is bivariate analysis?
A: It studies the relationship between two variables, like comparing gender to turnover.
29
visualize calculating mean, median, mode, stdev, and quantiles in python
import statstics statistics.mean(a) # Average statistics.median(a) # Middle value statistics.mode(a) # Most frequent value statistics.stdev(a) # Standard deviation (sample) statistics.quantiles(a) # Quartile values (by default: Q1, Q2, Q3)
30
visualize calculating the range in python
range_val = max(a) - min(a)
31
Quantiles vs Quartiles
- **Quantiles** divide your data into *n* equally sized intervals. - e.g., quintiles (5 groups), deciles (10), percentiles (100) - **Quartiles** are a type of quantile that divides the data into 4 parts.
32
Rule of Multiplication (AND)
P(A ∩ B) = P(A) * P(B|A) If A and B are **independent**: `P(A ∩ B) = P(A) * P(B)`
33
Rule of Addition (OR)
P(A ∪ B) = P(A) + P(B) - P(A ∩ B)
34
Why is `0! = 1`?
Because it represents the number of ways to arrange zero objects—there is one way: doing nothing.
35
Visualize factorial in python
import math math.factorial(5) # 120
36
Visualize permutations python
math.perm(n, r) # or n! / (n-r)!
37
Visualize combinations python
math.comb(n, r) # or n! / [(n-r)! * r!]
38
Visualize using random generation in python
import numpy as np np.random.rand(5) # 5 random numbers (0 to 1) np.random.seed(333) # Get reproducible results np.random.randint(1, 10, 5) # 5 integers between 1 and 10
39
What are the conditions for a binomial distribution?
- Fixed number of trials - Only 2 outcomes - Constant probability of success - Independent trials
40
caclulate cdf, pmf, mean, std in python
Probability of ≤ 2 defects in 20 items, defect rate = 12% binom.cdf(2, 20, 0.12) # Probability of exactly 2 defects binom.pmf(2, 20, 0.12) # Mean and standard deviation binom.mean(20, 0.12) # n * p binom.std(20, 0.12) # sqrt(n * p * (1 - p))
41
What is a poisson distribution?
Used for **rare events** in a fixed interval of time or space. ### Examples: - Number of calls per minute - Number of typos per page - Number of car accidents in a city
42
What are the properties of a poisson distribution?
**Properties**: - Infinite possible number of events - Average rate is constant
43
`statistics` Library
Python’s built-in `statistics` library is great for **basic descriptive statistics**. Think of it like a calculator for things you’d typically do in Excel:
44
scipy.stats – Advanced and Powerful
The scipy.stats library is part of the SciPy ecosystem and provides tools for probability distributions, hypothesis testing, fitting models, and more.
45
Survival Function – SF (1 - CDF)
The Survival Function answers: “What is the probability that a value is greater than x?” ✅ Use sf() when you want the "tail" probability (e.g., upper-bound extreme values).
46
Steps in Hypothesis Testing
1. State the Alternative Hypothesis (𝐻ₐ) 2. State the Null Hypothesis (𝐻₀) 3. Choose a Significance Level (α) 4. Calculate the Test Statistic 5. Determine the Critical Value(s) 6. Compare the Test Statistic to the Critical Value 7. Make Your Decision and Interpret the Results
47
🔹 Type I Error (False Positive)
Rejecting 𝐻₀ when it’s actually true. Example: Saying the bottle is incorrectly filled when it's actually correct.
48
🔹 Type II Error (False Negative)
Failing to reject 𝐻₀ when it's actually false. Example: Saying the bottle is fine when it actually isn’t.
49
Power of a Test (1 - β)
The ability to correctly detect a real effect when it exists. Higher power = less chance of a Type II Error. Ways to increase power: Increase sample size (n) Increase α Reduce variability (σ)
50
T or F: "Failing to reject" H₀ does means H₀ is proven true.
False. it simply means you don't have strong enough evidence to say it's false.
51
α (alpha)
It is the probability of making a Type I error (rejecting a true null hypothesis). We decide the alpha.
52
Two-tailed test
You're testing for differences in both directions (e.g., not equal to).
53
One-tailed test
You're testing in one direction only (e.g., greater than or less than).
54
Z-critical value
The z-score that defines the boundary of the rejection region in a normal distribution. The cutoff value found in the Z-table.
55
Two-Sample Z-Test
- Compare the **means of two populations** - Conditions: - Known population variances - Large sample sizes (n ≥ 30)
56
Paired t-Test
- Compare **before and after** results (e.g., exam improvement)
57
Two-Proportions Z-Test**
Compare two **percentages** (e.g., % of smokers in two towns)
58
Two-Variances Test
Compare **spread/variability** between two datasets
59
ANOVA
Use when comparing **more than 2** group means
60
One-Sample Z-Test Conditions
- You're testing a population mean - The population **standard deviation is known** - Either: - Sample size ≥ 30 (**Central Limit Theorem**) - Or the population is **normally distributed**
61
Steps for One-Sample Z-Test
. 1.**State the Hypotheses**: H₀ (null) vs. H₁ (alternate) 2. **Choose α** (Significance Level): e.g., 0.05 3. **Compute Test Statistic**: Use Z formula 4. **Find Z-Critical Value**: Use `norm.ppf()` or reference Z table 5. **Decision**: - Compare Z_cal to Z_critical - OR compare p_value to α
62
Two-Sample Z-Test
- Compare the **means of two populations** - Conditions: - Known population variances - Large sample sizes (n ≥ 30)
63
What Python library provides basic descriptive statistics like mean, median, mode, and standard deviation?
statistics library (built-in Python)
64
How do you calculate the range of a list in Python?
range_val = max(a) - min(a)
65
What’s the difference between quantiles and quartiles?
Quantiles divide data into n equal parts. Quartiles specifically divide data into 4 parts (Q1, Q2/median, Q3).
66
Classical Probability Formula?
Probability = (favorable outcomes) / (total outcomes)
67
Mutually Exclusive Events
Events that cannot happen at the same time (e.g., heads or tails).
68
Independent Events
Events where the outcome of one does not affect the outcome of the other.
69
Complementary Events Formula
P(A) + P(not A) = 1
70
Union and Intersection Symbols in Set Theory
Union (A or B) = ∪ Intersection (A and B) = ∩ Complement (Not A) = A'
71
Rule of Multiplication for Probability
P(A ∩ B) = P(A) * P(B|A) (If independent: P(A) * P(B))
72
Rule of Addition for Probability
P(A ∪ B) = P(A) + P(B) - P(A ∩ B)
73
What is 0 factorial (0!)?
0! = 1 (One way to arrange zero objects.)
74
When do you use permutations vs combinations?
Permutations: order matters. Combinations: order does not matter.
75
Python code for factorial, permutation, combination
import math math.factorial(n) math.perm(n, r) math.comb(n, r)
76
What does np.random.rand(5) do?
Generates 5 random numbers between 0 and 1.
77
Difference between discrete and continuous data?
Discrete: countable (e.g., emails) Continuous: measurable (e.g., height)
78
Conditions for a Binomial Distribution
Fixed number of trials Two outcomes (success/failure) Constant probability Independent trials
79
Binomial Distribution: Python Example
from scipy.stats import binom binom.pmf(2, 20, 0.12) binom.cdf(2, 20, 0.12)
80
When is Poisson distribution used?
For rare events over fixed intervals (e.g., typos per page).
81
Poisson Distribution: Python Example
from scipy.stats import poisson poisson.pmf(3, mu=2) poisson.cdf(3, mu=2)
82
When to use binomial vs Poisson?
Binomial: repeated trials (coin flip). Poisson: rare, random events (calls per minute).
83
When to use statistics vs scipy.stats?
statistics: simple one-dimensional descriptive stats. scipy.stats: advanced stats like distributions and hypothesis tests.
84
What does norm.cdf(x) calculate?
The probability that a value is ≤ x in a normal distribution.
85
What does norm.sf(x) calculate?
The probability that a value is > x in a normal distribution.
86
What is the p-value?
The probability of getting a test result as extreme or more extreme if the null hypothesis is true.
87
"If p is low, the null must go." What does this mean?
If the p-value < α (e.g., 0.05), reject the null hypothesis.
88
Steps in Hypothesis Testing
State H₀ and Hₐ Set α Calculate test statistic Find critical value Compare and conclude
89
What is a Type I error?
Rejecting H₀ when it’s actually true (false positive).
90
What is a Type II error?
Failing to reject H₀ when it’s actually false (false negative).
91
What does "degrees of freedom" mean for t-tests?
df = sample size - 1
92
When do you use a t-test vs a z-test?
t-test: sample size < 30 or unknown population standard deviation z-test: large sample or known standard deviation.
93
Conditions for one-sample t-test?
Random samples Approximate normality Population standard deviation unknown
94
Python code for one-sample t-test
from scipy import stats stats.ttest_1samp(data, population_mean)
95
Chi-Square Test Purpose
Tests relationships between categorical variables (e.g., smoking vs gender).
96
Python code for Chi-Square Test (Contingency Table)
import scipy.stats as stats stats.chi2_contingency(table)
97
When to use ANOVA?
To test if there’s a significant difference between 3 or more group means.
98
Python code for ANOVA
from scipy.stats import f_oneway f_oneway(group1, group2, group3)
99
In ANOVA, what does a high F-statistic mean?
More between-group variation than within-group → likely reject H₀.
100
What’s a contingency table?
A table showing the frequency distribution of variables for Chi-Square tests.
101
What is the F-Test used for?
To compare the variances of two populations.
102
When do you use a paired t-test?
When comparing two related samples (e.g., before-and-after measurements on the same subjects).
103
Hypotheses for a Paired t-Test
H₀: μ_before = μ_after (no difference) Hₐ: μ_before ≠ μ_after (difference exists)
104
Paired t-Test Formula
t = d / (sd/math.sqrt(n)) and then df = n - 1 (where n = number of pairs) where 𝑑 = mean of differences, sd = standard deviation of differences, and n = number of pairs. df = degrees of freedom
105
Python code for paired t-test
from scipy import stats stats.ttest_rel(before, after)
106
How to interpret a paired t-test result?
Compare |t_calculated| with t_critical. If |t| > t_critical → reject H₀ If |t| ≤ t_critical → fail to reject H₀
107
What is t_critical?
t_critical for a Paired t-test is the cutoff t-value you get from a t-distribution table based on two things: degrees of freedom (df) = n-1 significance level (a) = 0.05 or 0.01
108
Calculate t_critical in python
from scipy.stats import t alpha = 0.05 df = 9 t_critical = t.ppf(1 - alpha/2, df) print(t_critical)
109
Calculate t_critical manually
Degrees of Freedom = 10−1=9 If you're doing a two-tailed test at α = 0.05: Look up t_critical for df = 9 and 0.025 per tail. from the t-table: t_critical = 2.262
110
When do you use a two-proportion z-test?
When comparing two sample proportions (e.g., % of smokers in two towns).
111
Two-Proportion Z-Test Hypotheses
H₀: p₁ = p₂ (proportions are equal) Hₐ: p₁ ≠ p₂ (proportions are different)
112
Pooled Proportion Formula for Two Proportion Z-Test
p = x1 + x2 / n1 + n2 z = p1 - p2 / math.sqrt(p(1-p)(1/n1 + 1/n2))
113
python code for two-proportion z-test
from statsmodels.stats.proportion import proportions_ztest proportions_ztest([successes1, successes2], [n1, n2])
114
When would you use the "unpooled" method in a two-proportion test?
When you are testing for a specific difference (not just zero) or when proportions are assumed unequal.
115
When do you use a Chi-Square test for variance?
When testing if a population variance equals a specified value.
116
Chi-Square Test Statistic Formula for Variance
X = (n-1)s2 / st2 Degrees of Freedom for Chi-Square Test df = n - 1
117
When do you use an F-Test?
To compare two population variances.
118
F-Test Statistic Formula
f = s2 /s2 (Larger variance in numerator)
119
How do you decide to reject H₀ in an F-Test?
If F_calculated > F_critical (or F_calculated < inverse lower critical), reject H₀.
120
Python code for F-Test
from scipy.stats import f F = var1 / var2 f.ppf(1 - alpha/2, df1, df2) # Upper critical f.ppf(alpha/2, df1, df2) # Lower critical
121
What is the purpose of ANOVA?
To test whether there are significant differences between three or more group means.
122
ANOVA Hypotheses
H₀: μ₁ = μ₂ = μ₃ (all group means equal) Hₐ: At least one mean is different
123
ANOVA F-Statistic Formula
F = MSbetween/MSwithin
124
Degrees of Freedom for ANOVA
df_between = k - 1 df_within = N - k
125
What does a high F-statistic mean in ANOVA?
It suggests that group means are significantly different.
126
Python code for one-way ANOVA
from scipy.stats import f_oneway f_oneway(group1, group2, group3)
127
When would you fail to reject H₀ in ANOVA?
If the p-value > α (e.g., p > 0.05), fail to reject → no significant difference between means.
128
When do you reject the null hypothesis?
If p-value < α or if test statistic is more extreme than critical value.
129
When do you fail to reject the null hypothesis?
If p-value ≥ α or if test statistic is within the acceptance region.
130
What is the python formula for t_critical Paired test
from scipy.stats import t alpha = 0.05 df = 9 t_critical = t.ppf(1 - alpha/2, df) print(t_critical)
131
What is the t_critical for a Paired test?
the cutoff t-value you get from a t-distribution table based on two things Degrees of Freedom (df) df=n−1 (where n = number of pairs you're comparing) Significance Level (α) Common α values are 0.05 (95% confidence), 0.01 (99% confidence), etc. If it's a two-tailed test, split α into two (e.g., 0.025 in each tail).
132
What would be the t_critical for this paired test for this example. Suppose you have 10 pairs of data (like blood pressure before/after)
Degrees of Freedom = 10−1=9 If you're doing a two-tailed test at α = 0.05: Look up t critical​ for df = 9 and 0.025 per tail. tcritical = 2.62
133
What are the steps to finding t_critical?
Find degrees of freedom (n-1) Choose one-tailed or two-tailed Look up (or calculate) t_critical for that α level and df
134
Parametric Methods
Data is normally distributed or approximately normal. Sample size is large enough for the central limit theorem. Comparing means (e.g., t-tests for two groups, ANOVA for more than two groups).
135
Non-Parametric Methods
Data is not normally distributed or is ordinal. Small sample sizes that do not meet parametric assumptions. Comparing medians or ranks (e.g., Mann-Whitney U test, Kruskal-Wallis test).
136
What are the Python libraries
scipy.stats – for t-tests, chi-square, Mann-Whitney, etc. statsmodels – for ANOVA, linear models sklearn – for predictive models, regressions, classification
137
Mann-Whitney U
Compare two independent groups (like t-test, but non-parametric) You’re comparing satisfaction scores from two phone services If responses are 1–5 on a Likert scale (ordinal), use Mann-Whitney U test
138
Kruskal-Wallis
Like ANOVA, but for non-normal data
139
You want to check if department and turnover are related, what test would you use?
Chi-square test
140
What does the Least Squares Method do, and what does it produce
It finds the "best-fit line" that minimizes the squared errors (distance from each point to the line). , and gives: Slope, Intercept, and R-squared
141
Visualize with scikit-learn to calculate the Linear Regression model
model = LinearRegression() model.fit(X, y) print("Intercept:", model.intercept_[0]) print("Slope (BMI):", model.coef_[0][0]) print("R² Score:", model.score(X, y)) # Visualize plt.scatter(X, y, alpha=0.5)
142
What does this tell you about the relationship between Medical Charges (Y) and BMI (X)? Pearson Correlation: 0.1983409688336289 Slope (Coefficient): 393.873 Intercept: 1192.937 R-Squared: 0.039
If the slope is 410.2, it means for each 1 unit increase in BMI, charges increase by ~$410 If the p-value < 0.05, the relationship is statistically significant If R² = 0.64, it means 64% of the variance in charges is explained by BMI
143
What is the goal of data science?
The goal is to use data to solve real-world problems, make informed decisions, and predict future outcomes.
144
What are the seven stages of data science?
Problem Understanding Data Preparation Exploratory Data Analysis Modeling Evaluation Deployment Feedback
145
What is this an example of? Asking "Are we predicting who buys ice cream or how much ice cream they'll buy?"
Clearly defining the problem. This ensures you know exactly what you're trying to solve, which saves time and helps choose the right data and methods.
146
List the most common data science tasks.
Classification (sorting into groups) Estimation (Regression) (predicting numerical values) Clustering (grouping similar items) Anomaly Detection (finding unusual data) Recommendation (suggesting items based on preferences)
147
Visualize constructing a Bar Graph with Overlay Using Python Load
import pandas as pd bank_train = pd.read_csv("C:/.../bank_marketing_training") crosstab_01 = pd.crosstab(bank_train['previous_outcome'], bank_train['response']) crosstab_01.plot(kind='bar', stacked = True)
148
Explain in your own words why we establish baseline performance for models.
We set baseline performance to know how good our predictions need to be. It's like knowing the average class grade to decide if your score is above or below average. This happens in the Modeling phase.
149
Cluster profiles are descriptions of groups identified in clustering, detailing what makes each group unique.
Cluster profiles are descriptions of groups identified in clustering, detailing what makes each group unique. Customer groups—one group might be young tech enthusiasts, another older customers who prefer traditional products
150
pd.crosstab()
a pandas function that creates a cross-tabulation table (aka contingency table). It shows the frequency distribution of two (or more) categorical variables. pd.crosstab(index=df["Gender"], columns=df["Turnover"])
151
div()
You want to normalize it so that you see proportions, not raw counts. That’s where .div() comes in: crosstab_01 = pd.crosstab(index=df["Gender"], columns=df["Turnover"]) crosstab_01.div(crosstab_01.sum(1), axis=0)
152
Univariate Analysis
Understand the distribution, central tendency, or spread of that one variable
153
Bivariate Analysis
Analyzes the relationship between two variables Goal: Understand how one variable affects or relates to another
154
Is this univariate or bivariate analysis? What’s the average age of employees?
univariate
155
What’s the distribution of job roles?
univariate
156
Do males and females have different annual salaries?
bivariate
157
Does commute distance affect job satisfaction?
bivariate
158
What are the best visualizations for bivariate data?
Scatter plots (both numeric) Boxplots (one numeric, one categorical) Crosstabs / GroupBy tables (both categorical) Correlation coefficients (.corr())
159
How to Perform Binning Based on Predictive Value Using Python
DF['column_binned'] = pd.cut(x = df['column'], bins = [0, 27, 60.01, 100], labels=["Under 27", "27 to 60", "Over 60"], right = False)
160
What should be done during the should be undertaken during the Setup Phase
Partitioning the data Validating the data partition Balancing the data Establishing baseline model performance
161
Describe what data dredging is
Because of the lack of a priori hypotheses, data scientists need to beware of data dredging, whereby phantom spurious results are uncovered, due merely to random variation rather than real effects.
162
Describe the two baseline models for binary classification.
Baseline Models for Binary Classification Let one of the binary target classes represent positive and the other class represent negative. Let p represent the proportion of positive records in the data.
163
What Is Statistical Significance?
Statistical significance means the results you found in your data are unlikely to have happened by random chance. It’s measured using a p-value
164
A Priori Hypothesis
a guess or theory you come up with before you look at the data.
165
T or F: Big data makes everything look statistically significant
False. Big Data Makes Everything Look “Statistically Significant”. When your dataset is massive (like millions of rows), even a tiny difference can come out as statistically significant — even if that difference doesn’t actually matter in the real world.
166
Cross validation
When you: Split your data into a training set and a test set. Build your model using the training set. Test it on the unseen test set to make sure it really works.
167
twofold cross‐validation
twofold cross‐validation, the data are partitioned, using random assignment, into a training data set and a test data set
168
What is the Python function to partition data into train and test sets?
train_test_split() from sklearn.model_selection
169
Why do we validate data partitions?
: To ensure the training and test sets are similar and not systematically different.
170
Q: What is data balancing?
A: A technique to ensure the model sees enough examples of each class, especially the rare ones.
171
Q: What is oversampling?
A: Creating additional copies of the minority class to balance the training data set.
172
Q: What is undersampling?
A: Removing samples from the majority class to balance the dataset.
173
Q: Why don't we balance the test dataset?
A: Because real-world data isn't balanced, and test data should reflect real conditions for valid evaluation.
174
Q: What is a baseline model?
A: A simple model like always predicting the majority class, used for comparison against more complex models.
175
Q: How does a dummy classifier work in sklearn?
A: It makes predictions based on simple rules like 'most frequent class' to serve as a performance baseline.
176
Q: What is a decision tree?
A: A model that splits data into branches based on decision rules, ending in leaf nodes with predictions.
177
Q: What is model evaluation?
: Assessing how well a model performs using metrics like accuracy, precision, and recall.
178
Q: Define precision.
A: Precision = TP / (TP + FP); it measures how many predicted positives are actually correct. TP = True positive, FP = False Postive, FN = False Negative
179
Q: Define recall (sensitivity).
: Recall = TP / (TP + FN); it measures how many actual positives were correctly predicted. TP = True positive, FP = False Postive, FN = False Negative
180
What answers, “What value of X gives me a cumulative probability of p?”
The Inverse CDF EG If CDF(1.64) ≈ 0.95, then inverse_CDF(0.95) = 1.64
181
What is the relationship between p^, Z and, alpha?
To figure out the Z value that corresponds to α = 0.05, we look it up in a Z-table, Then we use the Z formula to find the corresponding p̂: Z = (p̂ - p₀) / √(p₀(1 - p₀)/n),