Data Science Flashcards
(29 cards)
How is the mean and variance of two 🎃INDEPENDENT, NORMALLY DISTRIBUTED🎃 variables calculated? calculate for both addition and subtraction
X+Y=T mean(T)= mean(X)+mean(Y) var(T)^2=var(X)^2+var(Y)^2
X-Y=Z mean(Z)= mean(X)-mean(Y) var(Z)^2=var(X)2+var(Y)^2
Write the formula of mean and variance, using expected value:
Expected value is basically the same as mean, now:
Here’s how it’s calculated:
mean=E(x) = x*p(x) for all values of x (if they are discrete we use sigma, if they are continuous, we use integral)
For variance: variance =E ((x-E(x))^ 2) (mean of this variable: (x- mean (x))^2 )
What does central limit theorem say?
The central limit theorem (CLT) states that the distribution of sample means approximates a normal distribution as the sample size gets larger,
🧨 regardless of the population’s distribution. 🧨
Sample sizes equal to or greater than 30 are often considered sufficient for the CLT to hold.
What is a latent variable?
In statistics, latent variables are variables that are not directly observed but are rather inferred through a mathematical model from other variables that are observed
What do we mean when we say the sample mean targets the population mean?
Means that it’s a good estimate of the population mean
What is the variance of the sampling distribution of means telling us? (Considering CLT)
The formula for variance of the sampling distribution of means is: population variance/n, so the larger the sample size, the smaller the variance and the closer the means of samples to the population mean. In extreme form, n is the whole population, and whatever number of samples we take, we’ll have one mean only (variance really small).
How the sampling distribution would look, if the original distribution is not normal?
It approaches a normal distribution.
What is the relation between non-normal original distribution and the suitable sample size?
The further the original distribution is from normal, the larger the sample size should be so that the sampling distribution of means approaches normal distribution (typically sample size>=30 will approach normal distribution)
If the original distribution is normal, the sampling distribution of the means will also be normal. True or False?
True
Where do we use CLT?
When sample is not normally distributed, we can still use some tools designed for normal distribution using CLT.
Summarize the 3 parts of CLT; which parts hold true for any sample size?
1-The mean of the sampling distribution of means is a fair estimate of the mean of the population from which the samples were drawn
2-The variance of the sampling distribution of means is a fair estimate to the variance of the population from which the samples were drawn, divided by n
3-If the original distribution is normal, the sampling distribution of the means will also be normal. Otherwise, if n>=30, we can still safely assume normal.
Part 1 and 2, are true for any sample size
What is the sampling distribution of the mean?
The distribution of the means of the samples taken from the original data. (each sample has a mean, sampling distribution of the means, is the distribution patterns of the means of all the samples taken)
What is the formula for calculating variance for continuous and discrete variables?
f(x) is the probability distribution function p(x) is the probability of each discrete value the variable can take 🤓 Continuous Variables: ∫ (x-mean)^2*f(x) dx 🤓 Discrete Variables: ∑ (xi-mean)^2*pi(xi)
How does adding/subtracting a constant to/from a variable change its variance?
It doesn’t change its variance
How does multiplying a variable by a constant change its variance?
The new variance= the old variance* constant^ 2
How does the variance change when we have a set of INDEPENDENT variables added together?
When they are INDEPENDENT we add the variance of each variable:
var (x±y) =var(x) + var(y)
How does the variance change when we have a set of DEPENDENT variables added/subtracted?
var (x+y) =var(x) + var(y) + 2 cov(x, y) (cov=covariance)
var (x-y) =var(x) + var(y) - 2 cov(x, y) (cov=covariance)
How is the STD of x+y calculated?
1-Calculate x+y variance: var(x) + var(y)
2- Take the square root of the x+y variance: √ (var(x) + var(y))
What is The Most Important Probability Distribution for Discrete Random Variables?
When a random variable follows a binomial distribution
What conditions should be met before being certain a distribution is binomial?
1- Probability of success in each separate trial is the same
2- Trials are independent: the result of one trial doesn’t depend on others
3- Fixed number of trials
4- Each trial can be classified as either fail or success
What is the formula for getting x number of successes with n trials in a binomial distribution? And the shorthand of it?
P(x) = [n!/x!(n-x)!] p^{x} q^{n-x}
X ~ B(N,P)
X is a binomial random variable with N trials and success probability of P
What’s the equation for the cumulative probability distribution for binomial distributions?
P(X<=x) =∑ [n!/k!(n-k)!] p^{k} q^{n-k} 0
On which parameter does the shape of Binomial probability distribution (probability vs number of trials) depend? How does it change in relation to the change of this parameter?
P, the probability of success the higher the P(closer to 1), more left skewed the probability distribution is. When P is around .5, it’s almost symmertic.
Explanation: If P is close to one, then let’s say we have 10 trials and we start by 1, meaning that: the probability of just having 1 success in 10 trials, it’s going to be really small since there’s a high chance we succeed more than 1 time. So it’s going to get bigger as we increase the number of successes. Therefore it’s going to be left skewed.
Binomial probability distribution is discrete. True or False?
True