Module 2_ 3. Probability and Statistics Flashcards

Question

How to check if distribution is pareto distribution?

Answer 1

1. Q-Q plots 2. log-log plots

Answer 2

Power-law/Pareto ----(box-cox transform)----> Gaussion/Normal 1. box-cox(X) ---->lambda(λ) 2. to calculate yi - IF λ != 0, (xi-1)/λ ELSE log(xi) -----> i.e. if λ=0 If λ=0, xi ~ log-normal distribution

Answer 3

1. Co-variance 2. Pearson co-relation coefficient 3. Spearman rank co-relation coefficient

Answer 4

Co-variance(X,Y) = (1/n) Σ (xi - μx) * (yi - μy) If Co-variance(X,Y) = +ve ------------> As X increases, Y increases If Co-variance(X,Y) = -ve ------------> As X increases, Y decreases Drawbacks/Limitations: 1. If X = height in cm, Y = weight in kg, X' = height in ft, Y' = weight in lbs then Co-variance(X,Y) != Co-variance(X',Y') If we change the scale the covariance also changes which is bad

Answer 5

Px,y = Co-variance(X,Y)/σxσy where σx = √variance(X) and σy = √variance(Y) If Px,y = +ve ------------> As X increases, Y increases If Px,y = -ve ------------> As X increases, Y decreases Drawbacks/Limitations: 1. Px,y = +1 only if linear relationship exists between X & Y. So, if y=x^2, P<1 (even though its monotonically increasing). 2. Slope of straight line doesn't affect the Px,y. 3. Complex relationships are not captured. Eg. sine wave Fix ----> Spearman rank co-relation coefficient

Answer 6

X Y rx ry s1 160 52 4 3 s2 150 166 2 4 s3 170 68 5 5 s4 140 46 1 1 s5 158 51 3 2 Here we are sorting X and Y and giving them ranks in ascending order. We saw, Px,y ------> linear relationship r = Prx,ry This means Spearman rank co-relation between two variables is equal to the "Pearson co-relation" between the rank values of those two variables - If as X increases, Y increases (linear or not doesn't matter) ----> r =1 - If as X increases, Y decreases (linear or not doesn't matter) ----> r =-1 Also Spearman rank co-relation is more robust to outliers than Pearson co-relation.

Answer 7

- "Correlation" does not imply "Causation". - Just because two random variables are correlated (eg. X increases, Y increases) doesn't mean X causes Y or vice versa. Eg. Graph of nobel laureates vs chocolate consumption.

Answer 8

1. Is salary correlated with sq. footage of your home? 2. Is no. of years of education correlated with income? 3. E-commerce(Amazon): - Time spent in 24 hrs vs money spent in 24 hrs - # unique visitors in a day vs $ sales in a day 4. Medicine: - Dosage of a drug vs Reduction in blood sugar

Answer 9

- A confidence interval, in statistics, refers to the probability that a population parameter will fall between a set of values for a certain proportion of times Eg. X ~ any distribution Also X -----> heights of people -------> {x1, x2, ........, x10} ------> random sample of size 10 Estimate population mean i.e. μ of X. μ ≈ x̄ -------> where x̄ is sample mean -------> This is a point estimate. Not bad but we can do better If we say, μ ∈ [162.1, 174.9] with 95% confidence - Interval with some confidence value - Richer than previous in terms of information Note: If we repeat the sampling multiple times, each time we get a different value for x̄. In 95% of sampling experiments, μ will be between endpoints of C.I. calculated using x̄, but in 5% of cases it will not be. C.I. does not mean that μ lies in the interval with 95% probability.

Answer 10

Say X ~ N(μ, σ^2) Let μ = 168cm and σ = 5cm We know from gaussian distribution that (μ-2σ, μ+2σ) contains 95% of my observations. So we can say heights of people lie between [158, 178] with 95% confidence. Similarly other values like 90%, 80%, etc. can be found using Normal dist. tables. Eg. Suppose C=90% (1 - C)/2 = (1 - 0.9)/2 = 5% - Lie in [x', x"] with 90% confidence - x', x" can be found by looking at the normal-dist. tables. All this data is tabulated.

Answer 11

Case 1: X~ some dist. with finite μ and σ^2 Q. What is the 95% C.I. of μ? Let σ = 5cm {x1, x2, ........., x10} --------> somple of size(n) = 10 x̄ = sample mean = (1/10) Σ xi ----> n=10 As we learnt earlier using CLT, we can say, x̄ ~ N(μ, (σ/√n)) Hence we can say that, μ ∈ [x̄ - (2σ/√n), x̄ + (2σ/√n)] with 95% confidence Case 2: It we dont know σ Use students t-dist x̄ ~ t(n-1)

Answer 12

Task : Estimate 95% of C.I. for median of X using only the given sample of X S = {x1, x2,.........,xn} ------> using sampling with replacement using u(1,n) i.e. uniform random variable between 1 to n Let k = 1000 and m <=n - S1 : x1', x2', ......... , xm' ----> m1 ---> median of sample 1 - S2 : x1', x2', ......... , xm' ----> m2 ---> median of sample 2 : : : - Sk : x1', x2', ......... , xm' ----> mk ---> median of sample k ---------> m1, m2, ........, m1000 ---------> sort --------->m1'<=m2'<=m3 ........, <=m1000'(increasing order) ---------> 95% C.I = 950/1000 = 95% Therefore 95% C.I is [m25, m975]

Answer 13

Task : Given a coin, determine if the coin is biased towards heads or not - Test Statistic : Flip coin 5 times and count no. of heads = X - Perform experiment -----> H H H H H -------> X = 5 ------> This is our observation Let H0 = Coin is not biased towards heads P(observation | H0) = P(X=5 | Coin is not biased towards heads) = 1/(2^5) = 1/32 ≈ 0.03 = 3% P(observation | H0) is also called a p-value. Typically, p-value < 5% is said to be small. Here P(X=5 | H0) = 3% So there is a 3% chance of getting 5 heads in 5 flips if coin is not biased. 3% ----> quite low The observation is done practically, so it is the ground truth. Hence, our assumption i.e. H0 may be incorrect. So, we reject our null-hypothesis (H0) ----> Reject the idea that the coin in not biased H0 : Coin is not biased -----> Null hypothesis H1 : Coin is biased -----> Alternate hypothesis Rejecting H0 means accepting H1 Rejecting H1 means accepting H0 So, we accept the fact that the coin is biased towards heads.

Answer 14

Task : Determine if population mean of heights in two cities is same or not Experiment : Measure the heights of 50 random people for each city. Let μ1 and μ2 be sample means of both cities. say (162 and 167) Test statistic : μ2 - μ1 = 167 - 162 = 5cm(X) Null hypothesis (H0) : There is no difference in population mean of both cities Computing P(X=5 | H0) : 1. Take all heights of both cities and put them together in a new set (S). 2. Randomly select 50 pts from S to S1 and remaining 50 to S2. This is resampling. Since S1 and S2 are coming from the same distribution (S) randomly, this will simulate 2 cities having same population mean or simulate null-hypothesis (H0). Calculate μ1 and μ2 and also μ2 - μ1 = δ 3. Repeat 2nd step k no. of times 4. Sort δi's in increasing order δ1<=δ2<=...............<=δk (Our observed difference = 5cm) Say k =1000 and our observed difference is at δ801. So 20% of sim. difference is greater than observed difference. P(diff >= 5cm | H0) = 0.2 -----------> significant -----------> Accept H0

Answer 15

Let X1 and X2 be the two samples of size m and n respectively. Also, let Dm,n be the maximum difference in their CDFs Test Statistic : Dn,m = sup|F1,n(x) - F2,m(x)| -------> maximum diff. in their CDFs Null Hypothesis (H0) : X1 and X2 have the same distribution. If, Dm,n > c(α) √((m+n)/mn) then, We reject our null hypothesis (H0) and conclude that X1 and X2 have diff distributions else, We accept H0 Note : α and c(α) values are taken from table Eg. If we decide α=0.05 then the corresponding value for c(α) is taken from table

Answer 16

d = [2.0, 6.0, 1.2, 5.8, 20.0] Task : Pick an element amongst the n elements s.t. probability of picking an element is proportional to the di's Step1 : a. s = Σ di = 35 b. di' = di/s ---> d1' = 0.0571 ---> d2' = 0.171428 ---> d3' = 0.0343 ---> d4' = 0.1657 ---> d5' = 0.5714 Here Σ di' = 1 c. cumulative normalized sum ---> d1" = d1' = 0.0571 ---> d2" = d1" + d2 '= 0.228528 ---> d3" = d2" + d3 '= 0.262828 ---> d4" = d3" + d4 '= 0.428528 ---> d5" = d4" + d5 '= 1 Step2 : sample one value unif(0.0, 1.0) r = numpy.random.uniform(0.0, 1.0, 1) let r = 0.6 Step 3 : Proportional sampling if r <= d1" return 1 elif r <= d2" return 2 elif r <= d3" return 3 : : :

Module 2_ 3. Probability and Statistics Flashcards

(40 cards)