Statistics Flashcards

1
Q

What is the Central Limit Theorem? What is the importance of this? What are the assumptions for the CLT?

A

The sampling distribution of the mean is will approximately follow a normal distribution given:
Also:

  1. Data must be sampled randomly
  2. Samples should be independent of each other
  3. The sample size should be sufficiently large (sample size > 30, sometimes 50 or 100)
  4. Sample size should be not more than 10% of the population
    Also:
    The standard deviation of the sampling distribution of the mean is equal to the population standard deviation divided by the square root of the sample size
    Assumptions:
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is a normal distribution? What are the properties?

A

A distribution that has three 3 main properties. These are that the mean, median, and mode equal each other. It is symmetric around the mean, and 50 percent of the values are on each side of the mean.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is a simple random sample? What are the advantages of a simple random sample? Disadvantages?

A

A randomly selected subset of a population where each member of the population has an exactly equal chance of being selected. An advantage is that little has to be known about the population in advance to conduct this sampling method. A disadvantage is that it requires that a complete list of every element in the population be obtained

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is a population?

A

The entire group you want to draw conclusions about.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is a sample?

A

A subset of the population you will collect information from.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is systematic sampling? When should I use this?

A

A sampling method where every nth element is taken.
Ex. Taking every 1oth e

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is convenience sampling?

A

A sampling method where the data first accessed is used (hence the term convenience).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is stratified sampling?

A

A sampling method that first divides the population into groups called strata. A sample is taken from each of these strata using either random, systematic, or convenience sampling.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is cluster sampling?

A

A sampling method that first divides the population into clusters. The clusters are randomly selected, and all elements in the selected clusters are used

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How can I test if my data is normally distributed?

A

1.QQ Plot ( Quantile vs Quantile Plot), which plots theoretical quantiles against the actual quantiles of our variable.
2. Hypothesis testing
a. Most powerful when testing for a normal distibution is the shapiro-wilks test.

Note:
If I was explaining to a general audience, I would show visually with histogram.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is power?

A

The odds of rejecting the null hypothesis given it is false. (1- beta) where beta is the probaility of Type 2 error.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is significance?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is a hypothesis test? Can you take me through each step?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is a confidence interval? How do i calculate?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is Anova?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is a random variable?

A

A function that maps values to each of an experiment’s outcomes

17
Q

What are some non-parametric statistical tests?

A
  1. Kolmogorov Smirnov test, which computes the distances between the empirical distribution and the theoretical distribution and defines the test statistic as the supremum of the set of those distances.
  2. The Anderson–Darling test is a statistical test of whether a given sample of data is drawn from a given probability distribution.
18
Q

What is a Bernoulli random variable? Binomial?

A

A Bernoulli random variable takes the value 1 with probability of p and the value 0 with probability of 1 − p. It is frequently used to represent binary experiments, such as a coin toss. A binomial random variable is the sum of n independent Bernoulli random variables with probability p.

19
Q

What is a geometric random variable?

A

A geometric random variable counts the number of trials that are required to observe a single success, where each trial is independent and has success probability p. For example, this distribution can be used to model the number of times a die must be rolled in order for a six to be observed.

20
Q

What is Poision random variable?

A

A Poisson random variable counts the number of events occurring in a fixed interval of time or space, given that these events occur with an average rate λ.

21
Q

What is a Negative Binomial random variable?

A

A negative binomial random variable counts the number of successes in a sequence of independent Bernoulli trials with parameter p before r failures occur. For example, this distribution could be used to model the number of heads that are flipped before three tails are observed in a sequence of coin tosses.

22
Q

What is a uninform distribution?

A

Uniform: The uniform distribution is a continuous distribution such that all intervals of equal length on the distribution’s support have equal probability. For example, this distribution might be used to model people’s full birth dates, where it is assumed that all times in the calendar year are equally likely.

23
Q

T?

A

Student T: Student’s t-distribution, or simply the t-distribution, arises when estimating the mean of a normally distributed population in situations where the sample size is small and population standard deviation is unknown.

24
Q

Chi Squared?

A

Chi Squared: A chi-squared random variable with k degrees of freedom is the sum of k independent and identically distributed squared standard normal random variables. It is often used in hypothesis testing and in the construction of confidence intervals.

25
Q

Exponential?

A

Exponential: The exponential distribution is the continuous analogue of the geometric distribution. It is often used to model waiting times.

26
Q

F?

A

F: The F-distribution, also known as the Fisher–Snedecor distribution, arises frequently as the null distribution of a test statistic, most notably in the analysis of variance.

27
Q

Gamma?

A

Gamma: The gamma distribution is a general family of continuous probability distributions. The exponential and chi-squared distributions are special cases of the gamma distribution.

28
Q

Beta?

A

Beta: The beta distribution is a general family of continuous probability distributions bound between 0 and 1. The beta distribution is frequently used as a conjugate prior distribution in Bayesian statistics.

29
Q

What is bayes theorem?

A
30
Q

What is an ANOVA test?

A

A test that allows one to make comparisons between the means of three or more groups of data.
One-way: a statistical test that looks for differences of the mean while only considering one factor
Two-way: two factors (independent variables)
Advantages:
It allows comparisons to be made between three or more groups of data (most only allow for 2)
Assumptions:
1.Normality – that each sample is taken from a normally distributed population
2. Sample independence – that each sample has been drawn independently of the other samples
3. Variance equality – that the variance of data in the different groups should be the same
4. dependent variable is continuous
5. (for two-way): two independent variables should be in categorical, independent groups.
Purpose: A one-way ANOVA is primarily designed to enable the equality testing between three or more means. A two-way ANOVA is designed to assess the interrelationship of two independent variables on a dependent variable.