Intro Statistics Flashcards

1
Q

central limit theorem

A

If x_bar is the mean of a random sample X1, X2, … , Xn of size n from a distribution with a finite mean mu and a finite positive variance sigma ², then the distribution of W = (x_bar -mu)/ (sigma/sqrt(n)) is N(0,1) in the limit as n approaches infinity.

This means that the variable is distributed N(mu,sigma/sqrt(n)).

2
Q

binomial distribution

A

with parameters n and p is the discrete probability distribution of the number of successes in a sequence of n independent experiments, each asking a yes–no question

P(x=k) = (n,k) * p^k * (1 - p)^(n-k)

(n,k) = n! / (k! (n - k)!)

```Mu = n*p
Sigma = n*p*(1-p)```
3
Q

Accuracy

A

the proportion of true results (both true positives and true negatives) among the total number of cases examined.[

accuracy = tp + tn / (tp + tn + fp + fn)

4
Q

Precision

A

precision (also called positive predictive value) is the fraction of relevant instances among the retrieved instances

precision = tp / (tp + fp)

5
Q

Recall

A

recall (also known as sensitivity) is the fraction of relevant instances that have been retrieved over the total amount of relevant instances

recall = tp / (tp + fn)

6
Q

type I error

A

a type I error is the rejection of a true null hypothesis (also known as a “false positive” finding)

a type I error is to falsely infer the existence of something that is not there

7
Q

type II error

A

type II error is retaining a false null hypothesis (also known as a “false negative” finding)

a type II error is to falsely infer the absence of something that is

8
Q

kullback liebler divergence

A

a measure of how one probability distribution diverges from a second, expected probability distribution

9
Q

kolmogrov smirnoff test

A

is a nonparametric test of the equality of continuous, one-dimensional probability distributions that can be used to compare a sample with a reference probability distribution (one-sample K–S test), or to compare two samples (two-sample K–S test)

10
Q

Bootstrap

A

statistical method for estimating the sampling distribution of an estimator by sampling with replacement from the original sample, most often with the purpose of deriving robust estimates of standard errors and confidence intervals of a population parameter like a mean, median, proportion, odds ratio, correlation coefficient or regression coefficient.

11
Q

Jackknife

A

The jackknife estimator of a parameter is found by systematically leaving out each observation from a dataset and calculating the estimate and then finding the average of these calculations.

12
Q

Permutation test

A

the distribution of the test statistic under the null hypothesis is obtained by calculating all possible values of the test statistic under rearrangements of the labels on the observed data points. In other words, the method by which treatme

13
Q

Two tailed test

A

appropriate if the estimated value may be more than or less than the reference value, for example, whether a test taker may score above or below the historical average

14
Q

One tailed test

A

appropriate if the estimated value may depart from the reference value in only one direction, for example, whether a machine produces more than one-percent defective products

15
Q

Assessing normality

A

Subtract mean divide by variance, compare to standard normal values — nscore

16
Q

Box plot

A

Categorical variables, shows shape of distribution, central value and variability

Median black center line
Box top bottom are first and third quartiles
Vertical lines 1.5 times IQR
Outside lines points shown

17
Q

IQR

A

Inter quartile range

Distance between first and third quartiles

18
Q

Two way table

A

two-way table presents categorical data by counting the number of observations that fall into

19
Q

Correlation coefficient

A

R = 1/(n-1) * Sum( ((x-x_mean)/std_x ) * ((y-y_mean)/std_y)) )

20
Q

ANOVA

A

Analysis of variance is a statistical method used to test differences between two or more means of variance

21
Q

Parameter

A

parameter is a number describing a population, such as a percentage or proportion.

true proportion of defective items in the entire population

22
Q

Statistic

A

is a number which may be computed from the data observed in a random sample without requiring the use of any unknown parameters, such as a sample mean.

takes a sample 300 items and observes that 15 of these are defective- computes the statistic , p_hat = 15/300 = 0.05 an estimate of the parameter p

23
Q

Biased estimator

A

statistic is systematically skewed away from the true parameter p, it is considered to be a biased estimator of the parameter

24
Q

Unbiased estimator

A

unbiased estimator will have a sampling distribution whose mean is equal to the true value of the parameter.

25
Q

Variability of statistic

A

Determined by the spread of its sampling distribution. In general, larger samples will have smaller variability

26
Q

Probability model

A

mathematical representation of a random phenomenon. It is defined by its sample space, events within the sample space, and probabilities associated with each event.

27
Q

Sample space

A

set of all possible outcomes

28
Q

Probability

A

numerical value assigned to a given event A. The probability of an event is written P(A), and describes the long-run relative frequency of the event.

Rule 1: Any probability P(A) is a number between 0 and 1 (0 < P(A) < 1).

Rule 2: The probability of the sample space S is equal to 1 (P(S) = 1).

29
Q

Probability disjoint

A

If two events have no outcomes in common

Rule 3: If two events A and B are disjoint, then the probability of either event is the sum of the probabilities of the two events:
P(A or B) = P(A) + P(B). I’ll

30
Q

Probability union

A

chance of any (one or more) of two or more events occurring is called the union of the events. The probability of the union of disjoint events is the sum of their individual probabilities.

If two events A and B are not disjoint, then the probability of their union (the event that A or B occurs) is equal to the sum of their probabilities minus the sum of their intersection.

31
Q

Probability complement

A

Rule 4: The probability that any event A does not occur is P(Ac) = 1 - P(A).

32
Q

Probability independence

A

If the outcome of the first event has no effect on the probability of the second event,

Rule 5: If two events A and B are independent, then the probability of both events is the product of the probabilities for each event:
P(A and B) = P(A)P(B)

33
Q

Probability intersection

A

chance of all of two or more events occurring

For independent events, the probability of the intersection of two or more events is the product of the probabilities.

34
Q

Conditional probability

A

event B is the probability that the event will occur given the knowledge that an event A has already occurred

If events A and B are not independent, then the probability of the intersection of A and B (the probability that both events occur) is defined by
P(A and B) = P(A)P(B|A).

From this definition, the conditional probability P(B|A) is easily obtained by dividing by P(A):

P(B|A)= P(A and B)/P(A)

35
Q

Random variable

A

is a variable whose possible values are numerical outcomes of a random phenomenon

36
Q

Law of large numbers

A

law of large numbers states that the observed random mean from an increasingly large number of observations of a random variable will always approach the distribution mean

37
Q

Properties of random variate means

A

Mu_a+by = a+b*mu_x

Mu_x+y = mu_x + mu_y

38
Q

Properties of random variate variance

A

Sigma^2_a+by = b^2*sigma^2

Sigma^2_x+y = sigma_x^2 + sigma_y^2

39
Q

Sample mean and variance

A
```Mu_x = mu
Sigma_x = sigma/sqrt(n)```

Sample mean sigma gets smaller as n goes up

If distribution of population is normal then distribution of sample mean is normal with mean mu and stdev sigma/sqrt(n)

40
Q

Tests of Significance for Two Unknown Means and Known Standard Deviations

A

two-sample z statistic

from two normal populations of size n1 and n2 with unknown means and and known standard deviations and , the test statistic comparing the means is known as the two-sample z statistic

z = ((x1 - x2) - (mu1 - mu2))/ sqrt((sigma1^2/n1) + (sigma2^2/n2))

41
Q

Tests of Significance for Two Unknown Means and Unknown Standard Deviations

A

two-sample t-statistic

t = ((x1 - x2) - (mu1 - mu2))/ sqrt((s1^2/n1) + (s2^2/n2))

confidence interval

(x1 - x2) +/- t*(sqrt(s1^2/n1 + s2^2/n2))

conservative P-values may be obtained using the t(k) distribution where k represents the smaller of n1-1 and n2-1

42
Q

Pooled t-statistic

A

same variance for both

s_p^2 = (n1 - 1)s1^2 + (n2 -1)s2^2/ (n1 + n2 - 2)

t = ((x1 - x2) - (mu1 - mu2))/ s_p * sqrt((1/n1) + (1/n2))

43
Q

sample proportion (categorical)

A

given a simple random sample of size n from a population, the number of “successes” X divided by the sample size n gives us p_hat the sample proportion

This proportion follows a binomial distribution with mean p and variance (p(1-p))/n

An approximate level C confidence interval for p is p_hat +/- z* (sqrt((p(1-p))/n) where z is the upper (1-C)/2 critical value from the standard normal distribution.

44
Q

Confidence Intervals for Unknown Mean and Known Standard Deviation

A

For a population with unknown mean mu and known standard deviation sigma, a confidence interval for the population mean, based on a simple random sample (SRS) of size n, is x_bar +/- z(sigma/sqrt(n)), where z is the upper (1-C)/2 critical value for the standard normal distribution.

45
Q

Level C

A

gives the probability that the interval produced by the method employed includes the true value of the parameter theta

46
Q

Confidence Intervals for Unknown Mean and Unknown Standard Deviation

A

For a population with unknown mean mu and unknown standard deviation, a confidence interval for the population mean, based on a simple random sample (SRS) of size n, is x_bar +/- t* (s/sqrt(n)), where t* is the upper (1-C)/2 critical value for the t distribution with n-1 degrees of freedom, t(n-1).

s = standard error (estimated stddev)

47
Q

Significance Tests for Unknown Mean and Known Standard Deviation

A

For claims about a population mean from a population with a normal distribution or for any sample with large sample size n (for which the sample mean will follow a normal distribution by the Central Limit Theorem), if the standard deviation sigma is known, the appropriate significance test is known as the z-test, where the test statistic is defined as

z = (z_bar - mu_theta)/(sigma/sqrt(n))

48
Q

Power of a test

A

the probability that a fixed level significance test will reject the null hypothesis H0 when a particular alternative value of the parameter is true.

49
Q

Significance Tests for Unknown Mean and Unknown Standard Deviation

A

claims about a population mean from a population with a normal distribution or for any sample with large sample size n (for which the sample mean will follow a normal distribution by the Central Limit Theorem) with unknown standard deviation, the appropriate significance test is known as the t-test, where the test statistic is defined as

t = (x_bar - mu_theta)/(s/sqrt(n))

50
Q

Sign test

A

o perform a sign test on matched pairs data, take the difference between the two measurements in each pair and count the number of non-zero differences n. Of these, count the number of positive differences X. Determine the probability of observing X positive differences for a B(n,1/2) distribution, and use this probability as a P-value for the null hypothesis.

51
Q

categorical test single proportion

A

To test the null hypothesis H0: p = p0 against a one- or two-sided alternative hypothesis Ha, replace p with p0 in the test statistic

z = (p - p0)/(sqrt((p0*(1-p0))/n)

52
Q

categorical sample size single proportion

A

n = (z/m)²p(1-p*).

The margin of error is maximized when p* = 0.5, in which case

n = (z*/2m)².

53
Q

Comparison of Two Proportions

A

An approximate level C confidence interval for p1 - p2 is p_hat1 - p_hat2 + zsD where z is the upper (1-C)/2 critical value from the standard normal distribution.

sD = sqrt( (p1_hat(1-p1_hat)/n1) + (p2_hat(1-p2_hat)/n2)

54
Q

Test two proportions

A

To test the null hypothesis H0: p1 = p2 against a one- or two-sided alternative hypothesis Ha, first compute a pooled estimate for the parameter =
(X1 + X2)/(n1 + n2), where X1 and X2 represent the number of “successes” in each population sample

sP = sqrt(p_hat(1-p_hat)(1/n1 + 1/n2))

z = (1 - 2)/sp follows the standard normal distribution (with mean = 0 and standard deviation = 1). The test statistic z is used to compute the P-value

55
Q

chi-squared statistic

A

chi^2 = Sum( (observed - expected)^2/expected )

56
Q

chi-squared distribution

A

random variable is said to have a chi-square distribution with m degrees of freedom if it is the sum of the squares of m independent standard normal random variables

he distribution of the chi-square test statistic based on k counts is approximately the chi-square distribution with m = k-1 degrees of freedom, denoted chi^2(k-1).

57
Q

chi-squared fitting

A

In general, if we estimate d parameters under the null hypothesis with k possible counts the degrees of freedom for the associated chi-square distribution will be k - 1 - d.

58
Q

chi-squared hypothesis test

A

use the chi-square test to test the validity of a distribution assumed for a random phenomenon. The test evaluates the null hypotheses H0 (that the data are governed by the assumed distribution) against the alternative (that the data are not drawn from the assumed distribution).

Let p1, p2, …, pk denote the probabilities hypothesized for k possible outcomes. In n independent trials, we let Y1, Y2, …, Yk denote the observed counts of each outcome which are to be

The chi-square test statistic is qk-1 =

= (Y1 - np1)² + (Y2 - np2)² + … + (Yk - npk)²
———- ———- ——–
np1 np2 npk

Reject H0 if this value exceeds the upper critical value of the (k-1) distribution, where is the desired level of significance.

59
Q

Permutations and combinations

A

Permutations are for lists (order matters)

Combinations are for groups (order doesn’t matter)

60
Q

Permutation formula

A

P(n,k)=n!/(n-k)!

You have n items and want to find the number of ways k of those items can be ordered

N pick k

61
Q

Combination formula

A

C(n,k) = n!/(k!(n-k!))

62
Q

Multinomial coefficient

A

n!/(k1!k2!k3!…*km!)

as the number of ways of depositing n distinct objects into m distinct bins, with k1 objects in the first bin, k2 objects in the second bin, and so on.

the number of distinct ways to permute a multiset of n elements, and ki are the multiplicities of each of the distinct elements