Module 2_ 3. Probability and Statistics Flashcards
(40 cards)
Eg. Rolling a fair dice
Is it an example of continuous random variable or discrete random variable?
discrete random variable
Eg. Measuring height of a randomly picked student
Is it an example of continuous random variable or discrete random variable?
continuous random variable
What is the difference between population and sample? Explain with example.
Suppose we need to calculate the average height of people in the world.
If we go by population we will consider all the heights of 7 billion people and calculate the mean using the below formula:
μ = (1/7B) Σ hi
If we go by sample, we will consider a subset of the heights of 7 billion people (like take only 1000 heights) and calculate the mean using the below formula:
x̄ = (1/1000) Σ hi
As sample size increases,
x̄ ≈ μ
Does Gaussian distribution occur in real world? If yes give 2 examples.
YES
- SL and PL of iris flowers.
- Heights and weights of people in real world.
If X follows Gaussian distribution and has mean(μ) and variance(σ^2), then write it in mathematical form.
X ~ N(μ, σ^2)
What is the 68-95-99 rule?
- In range [μ - σ, μ + σ], 68.2% of points lie
- In range [μ - 2σ, μ + 2σ], 95% of points lie
- In range [μ - 3σ, μ + 3σ], 99.7% of points lie
What is the mathematical formulation of Gaussian distribution?
P(X=x) = P(x) = (1/σ√(2π)) exp{-(x-μ)^2/(2σ^2)}
Simplifying the above equation,
Let μ=0, σ^2=1
P(X=x) = P(x) = (1/√(2π)) exp{(-1/2)x^2}
After removing constants,
P(x) = y = exp{-x^2}
As x moves away from μ, y reduces exponentially
True or False.
PDF of Gaussian distribution is symmetric.
True
What is Kurtosis? What is the formula for Excess kurtosis?
Kurtosis - Measure of tailedness and not peakedness
Excess Kurtosis —-> Kurtosis - 3
What is standard normal variate(Z)?
Z ~ N(0,1) where μ=0 and σ^2=1
What is standardization?
Standardization is converting any gaussian distribution with a finite mean(μ) and variance(σ^2) to a standard normal variate.
x’i = (xi - μ)/σ ———————-> x’i ~ N(0,1)
Now we can say 68.2% of these converted points lie between -1 and 1
What is Kernel Density Estimation(KDE)?
For each point in the sample space, a gaussian kernel is drawn(with the point being mean) and also area with higher density of points will have higher height in PDF.
What is sampling distribution?
Lets say we take m random samples each of size n.
Lets say n=30
S1, S2, S3, …………, Sm (m-samples)
x̄1, x̄2, x̄3, …………, x̄m are the means of m samples
Then x̄i belongs to a distribution called as the “Sampling distribution of sample means”
What is Central Limit Theorem(CLT)?
If X has finite mean(μ) and variance(σ^2),
——–> S1, S2, S3,………..,Sm (m samples of size n)
——–> x̄1, x̄2, x̄3,……….., x̄m (sample means)
——–> x̄i ~ N(μ, (σ^2)/n) as n->∞
here σ^2 is the variance of original data
CLT is powerful because it works on data having any kind of distribution which has a finite mean(μ) and variance(σ^2)
Note: CLT doesn’t work for pareto distribution since it has infinite mean and variance
Also in real world when n >= 30 things start falling in place and sampling distribution of sample means becomes gaussian distribution
What are Q-Q plots? How to plot them?
Q-Q plots stand for Quantile-Quantile plots.
They can be used for comparing two distributions (X and Y) and finding out whether they have the same distribution.
Eg.Given X: x1,x2,….x500
Is X gaussian distributed?
Steps:
1.Sort xi’s and compute percentiles
——–> x1, x2, ………, x500
——–> Sort in ascending order
——–> Calculate percentiles
——–> x(1), x(2), …….., x(100)
- Y ~ N(0,1)
——–> y1, y2, ………., y1000
——–> Sort in ascending order
——–> Calculate percentiles
——–> y(1), y(2), …….., y(100) - Plot Q-Q plot using x(1), x(2), …….., x(100)
y(1), y(2), …….., y(100)
If all points lie on a straight line then we can say X and Y have similar distributions.
But we can’t conclude that X also has μ=0 and σ^2=1
Task: Order t-shirts for all employees (100k)
a. How many XL t-shirts should you order?
Domain knowledge :
height >= 180cm for XL t-shirt
height [160cm,180cm] for L t-shirt
Collect heights of 500 random employees.
heights ~ N(μ, σ^2)
Plot CDF.
Suppose from CDF we observed P(h >= 180cm) = 1%
So now we will order 1000 XL t-shirts i.e. 1% of 100K
Task: Salaries
If X ~ N(μ, σ^2),
a. Calculate how many employees make a salary >= $100K?
b. Calculate how many employees have salary between [$50K, $70K] ?
a. Plot CDF to find out.
b. Plot CDF and calculate the difference between the two percentages.
If I don’t know the distribution but i know μ is finite and variance is non-zero and finite.
Task: Salaries
μ=$40K and σ=$10K
a. What % of individuals have salary in range of [$20K, $60K] ?
Chebyshev’s Inequality Formula:
P(|X - μ| >= kσ) <= (1/k^2) ——-> P(μ - kσ < X < μ + kσ) >= 1- (1/k^2)
20K = μ - 2σ
40K = μ
60K = μ + 2σ
P($20K < X < $60K) >= 1 - (1/2^2)
P($20K < X < $60K) >= 0.75
75% of individuals have salary in range of [$20K, $60K]
Explain Bernoulli Distribution.
Bernoulli Distribution:
Eg. X ——–> r.v. for getting heads in a coin toss
- Discrete distribution which has 2 outcomes
- Probability ———> P & (1 - P)
Explain Binomial Distribution.
Binomial Distribution:
Eg. X ——–> Coin tossed n times (n=10)
- Y ——-> No. of times of getting head
- Y ∈ {0, 1, 2, …….., 10}
Y ~ Binomial(n, P) ——-> n=no. of trials & P=probability of getting heads
What is Log Normal Distribution?
X ~ log-normal(μ, σ^2) ,
If log(X) ~ normal distribution
Note: As σ^2 increases, PDF becomes more skewed.
What are the applications of Log Normal Distribution?
- Length of comments posted in internet discussions.
- User’s dwell time on online articles.
- Salaries of people
In general, human behaviour mostly follows log-normal distribution.
How to find whether X ~ log-normal(μ, σ^2) ?
x1,x2,…..,xn ———–> log(x1), log(x2), ….., log(xn) ———–>yi’s
Now we can use QQ plot to determine if yi’s follow normal/gaussian distribution or not.
If they follow then X is log-normally distributed.
What is Power-law Distribution (a.k.a. Pareto distribution) ? Give some examples.
- Follows 80-20 rule
- 80% points lie in 20 % of the region and vice versa
- Have infinite mean & variance
Eg.
1. File size distribution in internet traffic(many small files & few large files).
2. Hard disk drive error rates.