Stats Year 2 Flashcards

Question

what are ways to measure central tendency and their related ways to measure spread?

Answer 1

1. median: uses IQR, range 2. mean: often used in normal distributions, where mean = median = mode. Uses variance, or standard deviation (variance square rooted). j rmbr variance > square root; standard error: standard dev but for sample means 3. mode

Answer 2

a way to measure distribution of spread, how much an individual of a population deviate from mean (or other central tendancy, normally mean tho) - standard dev, variance, standard error for continuous - variance = (SD)^2 - SD= root variance - split into sample variance and population variance, depending on distribution

Answer 3

* sample variance (divided by n-1, to reduces bias and standardizes variance so it depends on sample size. * sample variance is the variance used to calculate spread of a sample distribution, population variance is used to calculate spread of a population distribution * larger variance/SD = more spread out data is

Answer 4

* standard deviation for sample is n-1, for population is N * SD instead of variance is to linearize the squared/quadratic part of the variance so its more comparable

Answer 5

* 68.3% of values within 1SD of mean * 95% of values within 2SD of mean

Answer 6

* large value of CV = relatively high variation * CV scales SD to allow comparisons on same scales to be made, regardless of magnitudes

Answer 7

* mean = affected greatly by outliers, takes into account of magnitude * median: not affected greatly by outliers, not affected by magnitude * median better for skewed distributions to find out most common value (or when outliers present) * mean better for symmetrical distributions to find out most common value (or when no outliers present) * symmetrical = mean = median = mode (so doesnt rlly matter) * uniform distribution: no mode

Answer 8

* turns frequency distribution into a normal distribution (or a PDF) * central limit theorem = when u sample again and again and create a distribution of the mean of multiple samples, with each sample having a large sample size, the distribution of resampling means is a NORMAL DISTRIBUTION * the MEAN of the resampled distribution (sampling distribution of Y bar) is used to estimate the true u (mean of population).

Answer 9

The sampling distribution of refers to the distribution you get when you consider the means of many different samples taken from the same population. Each sample has its own average value and if you plot the frequency of these averages, you get the sampling distribution of the sample mean.

Answer 10

* u true mean of population * Y bar = a single sample mean of a frequency distribution of a sample * Y hat: mean of the sampling distribution ( normal distribution of multiple sample means) * Y hat should be an estimate of true value u. u = Y hat

Answer 11

standard deviation of a sampling distribution. effected by precision and sampling error hence sample size effects the precision/spread/standard error of the mean derived from sampling distribution

Answer 12

same equation

Answer 13

* increase in sample size = increase in precision = decrease in spread = decrease in SE * small SE = more precise = more n * small SE = large n

Answer 14

a range of values which is likely to contain the population parameter (like the mean or proportion) with a certain level of confidence.

Answer 15

It does not mean there's a 95% probability that the specific confidence interval you calculated from your sample data contains the true population parameter. The true population parameter is either in the interval or not; the interval does not tell us about the likelihood of its location within the range. * IT IS ABOUT METHOD RELIABILITY! simply is if you were to take 100 samples, 95% of sample's parameter would be in this interval

Answer 16

process/experiment where 2+ outcomes cannot be predicted with certainty. random sampling is a type of random trial

Answer 17

proportion of times an event occurs when a random trial is repeated under same conditions

Answer 18

* probability distribution: true relative frequency of all possible values of a discrete random variable. All mutually exclusive outcomes of a random trial * probability density function: true relative frequency of all possible values of a continuous random variable

Answer 19

* events that cannot occur at the same time

Answer 20

add probabilities together if they are OR

Answer 21

union = OR intersect = AND

Answer 22

AND = intersect

Answer 23

2 equations that can be used, law of total probability or conditional probability P(A|B) = P(A U B) / P(B) for different reasons, draw a probability tree if not sure

Answer 24

H0 = testing a claim, by taking the viewpoint that there is 'no change/difference' in parameters being tested

Answer 25

H1 = taking the meaningful viewpoint: something has changed i.e: X is lower, or X is different to what is believed

Answer 26

* 1 sided: is less than/greater than * 2 sided: different to

Answer 27

* type 1: rejecting null hypothesis when null hypothesis is true (false positive) * type 2: failing to reject null hypothesis when null hypothesis is false (false negative)

Answer 28

* critical values based on significance level * significance level = 5% * determines how significant our data is, 5% states that we are confident in our data if less that 5% of the time we commit false positives (type 1 error)

Answer 29

The p-value is the probability of observing results at least as extreme as those measured in your study, given that the null hypothesis is true. p< 0.05 is strong evidence against null hypothesis, so reject null hypothesis p>= 0.05 is NOT enough evidence to reject H0

Answer 30

1) The size of the effect we are measuring is small (decreases power) 2) The sample size is too small (decreases power) 3) The variance (σ) is large (decreases power)

Answer 31

the probability of a test correctly rejecting the null hypothesis when it is false. probability of false negative occurin? Power = 1- beta * power is used to calculate sample size required in experiments. beta is how much you are willing to have false negatives. i.e if beta is 0.2, then it means you are happy having false negatives 20% of time.

Answer 32

the existence of data for which a statistical association holds for a population but is reversed in a subpopulation arises when there are hidden variables which influence correlations

Answer 33

among the theories that fit the data equally well, choose the simplest theory.

Answer 34

* correlation might be due to hidden variables, which seemingly ties factors together, even tho they are not caused by each other

Answer 35

* must fit the binomial distribution: discrete variables * only 2 outcomes: success or failure * the events and trials have to be independent of each other * events/trials (n) are finite * probability of success and failure is constant (fixed)

Answer 36

* unbiased estimation of the population proportion * p-hate becomes more precise as sample size increases

Answer 37

* use the confidence interval method, calculate 95% confidence interval and then see if p value lies within it * use the binomial equation to calculate p value,

Answer 38

* binomial distribution are not neccessarily symmetrical, they can be skewed * dependent on n, and p * if n is large and p is close to 0.5, the more the binomial distribution will start looking like a normal distribution

Answer 39

* can use chi squared test or poisson distribution (and then tested by chi squared) * we can build a proportional model: proportional model is a probability model where the frequency of occurence of events is proportional to the number of opportunities

Answer 40

* simple probability model where freq of occurence of events are proportional to the number of opportunities

Answer 41

H0: frequency of births on each day of the week ** IS PROPORTIONAL** to the number of times each day of the week occurs H1: frequency of births on each day of the week ** IS NOT PROPORTIONAL** to the number of times each day of the week occurs

Answer 42

* categorical data * more than binomial variables (not just success or fail) * ONLY to test whether there is a difference between observed and expected values * this will not allow specific comparisons between specific categories using chi squared test * degrees of freedom = categories-1- number of parameters (often 0)

Answer 43

df = categories - 1 - number of parameters (this is often 0)

Answer 44

* according to sampling distribution of the null distribution given sig level at 0.05, and df (variable) * look on stats table and find the corresponding value * if calculated chi squared critical value less than the 0.05 value, then accept H0.

Answer 45

* when any of the categories have an expected frequency < 1 * when more than 20% of the categories have an expected frequency of less than (<5)

Answer 46

* use poisson distribution and then test it using chi squared test

Answer 47

* probability distribution * the number of event's successes in a certain time/space, where success happens independently of each other and occur with equal probability * for discrete data * used to test whether events are randomly distributed in time/space * i.e if a laundromat breakd down 3 times every months on average,, what is the probability that it breaks down twice next month (can use poisson distribution to calculate)

Answer 48

The binomial distribution describes a distribution of two possible outcomes designated as successes and failures from a given number of trials. The Poisson distribution focuses only on the number of discrete occurrences over some interval. A Poisson experiment does not have a given have a given number of trials (n) as binomial experiment does. For example, whereas a binomial experiment might be used to determine how many black cars are in a random sample of 50 cars, a Poisson experiment might focus on the number of cars randomly arriving at a car wash during a 20-minute interval.

Answer 49

* It is a discrete distribution. * Each occurrence is independent of the other occurrences. * It describes discrete occurrences over an interval. * The occurrences in each interval can range from zero to infinity. * The mean number of occurrences must be constant throughout the experiment.

Answer 50

* H0: The number of extinctions per time interval has a Poisson distribution * H1: The number of extinctions per time interval does not have a Poisson distribution * calculate mean (u/lambda) first = 4.21 for all data points * for each category, calculate the probabiloty for each category using poisson distribution for all extinctions * multiply the probability for each category by the observed frequency for each category, and get the Expected value for each category * then using the chi squared test, calculate the X2 value, which is 23.93 (test statistics) * for df = no. categories - 1 - no. parameters = 8-1-1=6 parameter is 1 here because the parameter is the 'number of extinctions' * critical value for df =6, and sig = 0.05, it is 12.59 * 23.93 > 12.59, hence reject H0 (test stats is greater than crit value)

Answer 51

* test statistic is the value you calculated to compare w the critical value * critical value is the value at 0.05 significant level

Answer 52

* test for association between two or more categorical variables

Answer 53

* relative risk is the probability of an outcome in the treatment group divided by probability of the same outcome in a control group *

Answer 54

* calculate relative risk RR * value close to 1

Answer 55

* calculate 95% confidence interval * must calculate SE for ln[OR] first and then do e^, to find actual CI

Answer 56

df = (row-1)(column-1)

Answer 57

1) No more than 20% of the cells can have an expected frequency < 5 2) No cell can have a frequency < 1

Answer 58

* combine categories to increase frequence (only if combined categories still menaingful) * use fisher's exact test: only for tables that are 2x2 where x2 cotingency test cannot be used

Answer 59

* when multiple variables, and when the frequency of each cell in a table is too low to use x2 contingency test (expected frequency of cells are too low) * also tests associations between multiple variables

Answer 60

* test stats > critical value then reject H0 * if p value is less than significance level, then reject H0

Answer 61

* odds ratio and relative risk used where there is a treatment group, and a control group. For clinical trials, and 2x2 tables * contingency test: for more than 2x2 tables, and need to calculate expected values. not neccessarily a control vs treatment group situation. Tests independence, and whether association exists

Answer 62

* fisher's test provides p value NOT test stats

Answer 63

a probability density is the true relative frequency of all possible values of a continuous random variable

Answer 64

the mean of a large number of measurements randomely sampled from a non-normal distribution is approximately normally distributed

Answer 65

* calculate standard error from the sampling distribution * then calculate student's t: using sample mean - population mean / standard error * degrees of freedom is n-1, n-1 because we estimated a parameter to calculate t

Answer 66

* normal is a 2 tail or 1 tail test * hence important to note whether it is 1 tailed (0.05) or 2 tailed (0.025 on each side)

Answer 67

population mean (Y bar) (+/-) (critical value) * Standard Error

Answer 68

Assumptions 1. Data are randomely smapled from population 2. variable is normally distributed in the population

Answer 69

* sample size increases, precision increases

Answer 70

* we can use either paired samples t-test, or independent samples t-test

Answer 71

* paired is: testing if the change in mean is 0 2 groups (mean change =0) * independent: testing if there is a change in mean between 2 groups (mean 1 = mean 2)

Answer 72

Assumptions for paired samples t-test * sampling units are randomely sampled from the population * paired differences have a normal distribution in the population Assumptions for a independent samples t-test? * each of the 2 samples are all random samples from population * the numerical variable (response, dependent) is normally distributed in each population * the SD and variance are same in both populations

Answer 73

Use the F test:

Answer 74

* determines whether 2 variances are equal * H0: variance 1 = variance 2 * H1: Variance 1 does not equal to Variance 2 * F test = larger variance / smaller variance * 2 tailed test, * df1 = n1-1, df2= n2-1 * check critical value * Reject H0 if F> Critical Value

Answer 75

Levene's test for homogeneity of variances, but requires stats algorithm, more robust than F test

Answer 76

* Use a Welch's approximate T test * Welchs uses same T test where (y1-y2)/SE * Difference: SE equation is different; df equation is different

Answer 77

1. data are randomely sampled 2. samples are independent 3. difference between observed and predicted are normally distributed 4. mean and variances of errors are independent of explanatory variables 5. one source of unmeasured random variance 6. variance among groups are equal (or can be adjusted using other tests)

Answer 78

1. Basic questions: source of data? biased? is it independent? 2. graph the data: does it deviate from assumption? 3. quantity vs quality: lots of variability? sample size? 4. alternatives to normal distribution: alternative stats approaches or other distributions/ data transformations

Answer 79

1. Ignore violations of assumptions if sample size is large due to central limit theorem 2. Transform data: use mathematical transformation methods to alter the distribution. e.g using natural log, arcsin, square root 3. use a non parametric method: methods to calculate probability taht does not require response variable to be normally distributed. Have less stats power than normal distribution 4. use permutation test (bootstrapping): use computer algorithm to repeatedly randomly generate your sample to produce a null distribution with large sample size

Answer 80

1. large sample size - then yes, can be ignored 2. when all the samples are skewed towards same direction (not one sample distribution towards left, and th eother skewed towards right) - then yes, can be ignored 3. can we use a method to adjust the differences in variance/SD: i.e welch's t test - then yes, can ignore 4. if none of the above are met, then must transform

Answer 81

* Shapiro Wilk's test: * test goodness of fit your data is to a normal distribution * DOES NOT: tell you whether your data is/is not normally distributed * DOES: determine deviations from normal distribution, tell u whether ur inference from normal distribution is flawed

Answer 82

1. when data is ratios or products of variables (i.e in odds ratio) 2. frequency distribution skewed to right 3. group w larger mean also has larger SD 4. data spans several orders of magnitudes * note if any data is 0 log wont work, you must +1 to data then do transformation

Answer 83

* sign ranks test for paired samples (makes data binomial * Mann Whitney U test: comparisons of 2 groups

Answer 84

* calculate difference * asin a + or - for either the difference is greater of smaller than 0 * make null hypothesis H0: above 0 = below 0 * use binomial test to calculate P value and test for it

Answer 85

* put both groups of data into 1 coulumn, and rank values from smallest to largest. Assign rank of value from 1 * for each group: add up the sum of the ranks * calculate test statistics U1 and U2. The larger one is used as test statistics * use U distribution to find critical value at 0.05 * df = sample size of larger group * if test stats > critical value, reject H0

Answer 86

1. assumes randomely sampled data 2. to test whether the data have different distributions (not a robust test for diff in mean) 3. MWU test can be used to test diff in mean/median only is both groups have the same shape of distributions 4. MWU has low stats power because it throuws out order of magnitude. CREATES HUGE TYPE II errors

Answer 87

* bootstrapping * permutation test generates a null distribution for the association between 2 variables, by randomley and repeatedly rearranging values of one of the variables in the data

Answer 88

1. create a permuted set of data where the values of the response variables are randomely ordered 2. calculate the measure of associated for the permuted sample (the difference between means, medians...etc) 3. repeat permutation process 1000 times to create null distribution 4. from the null distribution you have created, identify the location of your actual data, and compare w critical value to see if it is actually significant.

Answer 89

check chapt 13 stats sheet

Stats Year 2 Flashcards

(140 cards)