parameter inference methods Flashcards
Pearson Chi-squared test
aka Chi-square test; non-parametric test in two forms:
- hypothesis test for sample proportions
- hypothesis test for independence between 2 categorical variables
for goodness of fit:
- N samples, k categories; assume cell fittings are independent
- m_i are the expected frequences from the population in each category (eg m_i = p_i*N, p_i the known population proportion)
- null hypothesis is based on sample from multinomial distribution (so, IID multiple Bernoulli)
- test statistic
- X = sum_{1,k} (x_i-m_i)^2 / m_i
- where x_i are the test counts
- X is Chi-square distribution (consider binomial and CLT on each term in the sum–i.e. each x_i as the number of successes of type i –> Gaussian)
for independence:
- form 2-way contingency table w/ say columns as populations and rows as factor levels
- test between sample frequencies and product-of-marginals frequencies using Chi-square
G-test
a goodness of fit test based on likelihood-ratio or maximum likelihood statistical significance test
G^2 = 2 sum_i x_i log(x_i / m_i), where
- x_i are the sample cell counts for each category i
- m_i are the null hypothesis cell counts
- note G asymptotically follows a Chi-square distribution, under 2nd order Taylor expansion of log()
alternative to Pearson Chi-squared (with usually similar results)
goodness of fit tests
- in general, compare actual data to expected or predicted data (as from a model); testing whether a given data sample “likely indicates” the data comes from a given population distribution
- eg, Chi-Square, Kolmogorov-Smirnov, and Shapiro-Wilk
- goodness of fit may be commonly used in context of categorical data, and frequencies of occurrence
contingency table
- a tabular representation of categorical data, usually showing frequencies in some categorical phase space
- the dimension of the phase space is the value of the “way” (1-way, 2-way, etc.)
- usually the tables are 2-way (a multinomial proportion would be a 1-way table)
inferences on correlations
Pearson–there exists a kind of extension to 1-dim linear regression to test if 2 variables have non-trivial correlation; assumes normal distributions of X and Y
Spearman–transform Spearman’s rank correlation to a value whose point estimate is approximately normal; does not make normality assumptions on X and Y
inference on a population mean
- for small samples, assuming the population is approximately normal, use Student’s t, based on unbiased sample variance
- for large enough samples, can rely on the CLT (regardless of population distribution), and use z-scores
comparing two population means
- comparing means of populations A and B
- paired
- everything is regarded as the same between the populations, other than the experimental variations–pairs (a1,b1),…,(an,bn)
- the statistical model allows removing all but inherent noise in paired-differences, a1-b1, a2-b2,…,an-bn
- an ordinary one-sample t-test can then be applied on this set of differences
- unpaired
standard error estimate follows from sample variances:
* if assuming var(A)=var(B)
aka pooled variance
* if assuming var(A) is not equal to var(B)
this is a Welch test - if samples sizes are smallish, use above with two-sample t-test, ie on the difference between the sample means, mu(A)-mu(B); otherwise use two-sample z-test
population proportions, binary case
- ie the population(s) in question are categorical, with two levels
- single population
- essentially, use Gaussian approximation to binomial distribution, and z-scores
- (so) standard error estimate is sqrt(np(1-p)), with p estimated from the sample proportion (i.e. we can “get away with” approximating the population parameter p with the sample proportion)
- two populations, of sizes m and n
- hypothesis test, p_A=p_B (ie the binary “probabilities” between the populations are equal)
- this results in
- simplified standard error estimate, p(1-p)(1/n+1/m)
- a special estimator for population parameter p, p=(x+y)/(n+m) with x/y the “success” counts and n/m the number of samples
- confidence interval (ie what is the distribution of point estimate p_A-p_B)
- take usual s.e. for proportion–p(1-p)/n for each group
- combine s.e.’s via usual linear combo of r.v.s (as variances)
- use z-scores
population proportions, multinomial case
- ie the population(s) in question are categorical, with more than two levels
- single population
- essentially a goodness of fit test for 1-way contingency table
- Pearson Chi-squared
- likelihood ratio / G-test
- two populations
- amounts to test for independence
- essentially a 2-way contingency table, each population is one axis, with the independence assumption being each cell is the product of marginals
- use eg Chi-squared
population proportions, comparing more than two populations
- ie testing if proportions for k>2 populations are independent
- eg each population has the same factor type, with 2 or more levels
- use goodness of fit test on k-way contingency table
- the test for independence amounts to each cell being “close” to its product of marginals
- Chi-squared, or G-test
Marascuillo procedure
- in the context of comparing multiple population proportions, allows checking for “problemmatic” cells in the k-way contingency table
- can simultaneously test the differences of all pairs of proportions when there are several populations under investigation
- note, this may suffer from the multiple comparisons problem
population proportions, fitting to a given distribution
eg we suspect a population follows a Poisson distribution
- so we use sample binning and create a frequency histogram
- we then apply a goodness of fit test on the histogram (Chi-squared)
ANOVA
- comparing three or more population means (can only test H_A = all the means are equal)
- if the population means are different, then the variance within the samples (SSE) (ie within a given factor level) must be small compared to the variance between the samples / populations (SSTr), where
- x_ij is the jth observation from the ith population
- mu_t = total mean over all samples = avg(x_*,*)
- SSTr = sum_i n_i( avg(x_i,*) - mu_t) ^2
- SSE = sum_i sum_j (x_i,j - avg(x_i,*) )^2
- assumes equal-variance normal distribution within all the factor levels
- basic test intuition
- large SSE makes intra-factor-level variability high
- large SSTr makes inter-factor-level variability high
- MSTr / MSE follows an F-distribution, if the factor means are all equal
- so eg small SSE and large SSTr makes it likely the means are different
MANOVA
- an extension of ANOVA that can handle more than 1 dependent variable
- we have 3 or more populations, and two or more dependent variables in the populations
- eg 3 grocery store chains, and fat content and sugar content as dependent variables
- we want to compare the means between populations, under each of the dependent variables
tests against a suspected distribution
we have samples, and want a test for if those samples “fit” a suspected distribution:
- histogram binning, with Chi-squared
- Kolmogorov Smirnov test