parameter inference methods Flashcards

Question 1

Q

Pearson Chi-squared test

Answer

A

aka Chi-square test; non-parametric test in two forms:

hypothesis test for sample proportions
hypothesis test for independence between 2 categorical variables

for goodness of fit:

N samples, k categories; assume cell fittings are independent
m_i are the expected frequences from the population in each category (eg m_i = p_i*N, p_i the known population proportion)
null hypothesis is based on sample from multinomial distribution (so, IID multiple Bernoulli)
test statistic
- X = sum_{1,k} (x_i-m_i)^2 / m_i
- where x_i are the test counts
- X is Chi-square distribution (consider binomial and CLT on each term in the sum–i.e. each x_i as the number of successes of type i –> Gaussian)

for independence:

form 2-way contingency table w/ say columns as populations and rows as factor levels
test between sample frequencies and product-of-marginals frequencies using Chi-square

Question 2

Q

G-test

Answer

A

a goodness of fit test based on likelihood-ratio or maximum likelihood statistical significance test

G^2 = 2 sum_i x_i log(x_i / m_i), where

x_i are the sample cell counts for each category i
m_i are the null hypothesis cell counts
note G asymptotically follows a Chi-square distribution, under 2nd order Taylor expansion of log()

alternative to Pearson Chi-squared (with usually similar results)

Question 3

Q

goodness of fit tests

Answer

A

in general, compare actual data to expected or predicted data (as from a model); testing whether a given data sample “likely indicates” the data comes from a given population distribution
eg, Chi-Square, Kolmogorov-Smirnov, and Shapiro-Wilk
goodness of fit may be commonly used in context of categorical data, and frequencies of occurrence

Question 4

Q

contingency table

Answer

A

a tabular representation of categorical data, usually showing frequencies in some categorical phase space
the dimension of the phase space is the value of the “way” (1-way, 2-way, etc.)
usually the tables are 2-way (a multinomial proportion would be a 1-way table)

Question 5

Q

inferences on correlations

Answer

A

Pearson–there exists a kind of extension to 1-dim linear regression to test if 2 variables have non-trivial correlation; assumes normal distributions of X and Y

Spearman–transform Spearman’s rank correlation to a value whose point estimate is approximately normal; does not make normality assumptions on X and Y

Question 6

Q

inference on a population mean

Answer

A

for small samples, assuming the population is approximately normal, use Student’s t, based on unbiased sample variance
for large enough samples, can rely on the CLT (regardless of population distribution), and use z-scores

Question 7

Q

comparing two population means

Answer

A

comparing means of populations A and B
paired
- everything is regarded as the same between the populations, other than the experimental variations–pairs (a1,b1),…,(an,bn)
- the statistical model allows removing all but inherent noise in paired-differences, a1-b1, a2-b2,…,an-bn
- an ordinary one-sample t-test can then be applied on this set of differences
unpaired
standard error estimate follows from sample variances:
* if assuming var(A)=var(B)
aka pooled variance
* if assuming var(A) is not equal to var(B)
this is a Welch test
if samples sizes are smallish, use above with two-sample t-test, ie on the difference between the sample means, mu(A)-mu(B); otherwise use two-sample z-test

Question 8

Q

population proportions, binary case

Answer

A

ie the population(s) in question are categorical, with two levels
single population
- essentially, use Gaussian approximation to binomial distribution, and z-scores
- (so) standard error estimate is sqrt(np(1-p)), with p estimated from the sample proportion (i.e. we can “get away with” approximating the population parameter p with the sample proportion)
two populations, of sizes m and n
- hypothesis test, p_A=p_B (ie the binary “probabilities” between the populations are equal)
- this results in
  - simplified standard error estimate, p(1-p)(1/n+1/m)
  - a special estimator for population parameter p, p=(x+y)/(n+m) with x/y the “success” counts and n/m the number of samples
- confidence interval (ie what is the distribution of point estimate p_A-p_B)
  - take usual s.e. for proportion–p(1-p)/n for each group
  - combine s.e.’s via usual linear combo of r.v.s (as variances)
  - use z-scores

Question 9

Q

population proportions, multinomial case

Answer

A

ie the population(s) in question are categorical, with more than two levels
single population
- essentially a goodness of fit test for 1-way contingency table
- Pearson Chi-squared
- likelihood ratio / G-test
two populations
- amounts to test for independence
- essentially a 2-way contingency table, each population is one axis, with the independence assumption being each cell is the product of marginals
- use eg Chi-squared

Question 10

Q

population proportions, comparing more than two populations

Answer

A

ie testing if proportions for k>2 populations are independent
eg each population has the same factor type, with 2 or more levels
use goodness of fit test on k-way contingency table
- the test for independence amounts to each cell being “close” to its product of marginals
- Chi-squared, or G-test

Question 11

Q

Marascuillo procedure

Answer

A

in the context of comparing multiple population proportions, allows checking for “problemmatic” cells in the k-way contingency table
can simultaneously test the differences of all pairs of proportions when there are several populations under investigation
note, this may suffer from the multiple comparisons problem

Question 12

Q

population proportions, fitting to a given distribution

Answer

A

eg we suspect a population follows a Poisson distribution

so we use sample binning and create a frequency histogram
we then apply a goodness of fit test on the histogram (Chi-squared)

Question 13

Q

ANOVA

Answer

A

comparing three or more population means (can only test H_A = all the means are equal)
if the population means are different, then the variance within the samples (SSE) (ie within a given factor level) must be small compared to the variance between the samples / populations (SSTr), where
- x_ij is the jth observation from the ith population
- mu_t = total mean over all samples = avg(x_*,*)
- SSTr = sum_i n_i( avg(x_i,*) - mu_t) ^2
- SSE = sum_i sum_j (x_i,j - avg(x_i,*) )^2
assumes equal-variance normal distribution within all the factor levels
basic test intuition
- large SSE makes intra-factor-level variability high
- large SSTr makes inter-factor-level variability high
- MSTr / MSE follows an F-distribution, if the factor means are all equal
- so eg small SSE and large SSTr makes it likely the means are different

Question 14

Q

MANOVA

Answer

A

an extension of ANOVA that can handle more than 1 dependent variable
we have 3 or more populations, and two or more dependent variables in the populations
- eg 3 grocery store chains, and fat content and sugar content as dependent variables
- we want to compare the means between populations, under each of the dependent variables

Question 15

Q

tests against a suspected distribution

Answer

A

we have samples, and want a test for if those samples “fit” a suspected distribution:

histogram binning, with Chi-squared
Kolmogorov Smirnov test

Question 16

Q

parametric vs non-parametric tests

Answer

Study These Flashcards

A

a parametric test is used when the populations have known distributions; this makes assumptions about the parameters of the population distribution (usually a normal distribution)

non-parametric tests are useful for when the populations in question have unknown distributions (though the test may impose some conditions on the populations or samples)

Question 17

Q

mixed categorical and numeric cases

Answer

Study These Flashcards

A

i.e. a group of variables (usually 2), where a correlation-like measure is wanted, but one or more variables are factors
contingency tables (if all are factors)
logistic regression (ie model as metric; for mixed, factor and continuous)
ANOVA–group by the categorical(s), leaving the continuous separated by the categories (at least for 2-variable case, one a factor and the other continuous)

parameter inference methods Flashcards

(17 cards)