parameter inference methods Flashcards by Vince Lerner

Pearson Chi-squared test

aka Chi-square test; non-parametric test in two forms:

hypothesis test for sample proportions
hypothesis test for independence between 2 categorical variables

for goodness of fit:

N samples, k categories; assume cell fittings are independent
m_i are the expected frequences from the population in each category (eg m_i = p_i*N, p_i the known population proportion)
null hypothesis is based on sample from multinomial distribution (so, IID multiple Bernoulli)
test statistic
- X = sum_{1,k} (x_i-m_i)^2 / m_i
- where x_i are the test counts
- X is Chi-square distribution (consider binomial and CLT on each term in the sum–i.e. each x_i as the number of successes of type i –> Gaussian)

for independence:

form 2-way contingency table w/ say columns as populations and rows as factor levels
test between sample frequencies and product-of-marginals frequencies using Chi-square

How well did you know this?

Not at all

Perfectly

G-test

a goodness of fit test based on likelihood-ratio or maximum likelihood statistical significance test

G^2 = 2 sum_i x_i log(x_i / m_i), where

x_i are the sample cell counts for each category i
m_i are the null hypothesis cell counts
note G asymptotically follows a Chi-square distribution, under 2nd order Taylor expansion of log()

alternative to Pearson Chi-squared (with usually similar results)

How well did you know this?

Not at all

Perfectly

goodness of fit tests

in general, compare actual data to expected or predicted data (as from a model); testing whether a given data sample “likely indicates” the data comes from a given population distribution
eg, Chi-Square, Kolmogorov-Smirnov, and Shapiro-Wilk
goodness of fit may be commonly used in context of categorical data, and frequencies of occurrence

How well did you know this?

Not at all

Perfectly

contingency table

a tabular representation of categorical data, usually showing frequencies in some categorical phase space
the dimension of the phase space is the value of the “way” (1-way, 2-way, etc.)
usually the tables are 2-way (a multinomial proportion would be a 1-way table)

How well did you know this?

Not at all

Perfectly

inferences on correlations

Pearson–there exists a kind of extension to 1-dim linear regression to test if 2 variables have non-trivial correlation; assumes normal distributions of X and Y

Spearman–transform Spearman’s rank correlation to a value whose point estimate is approximately normal; does not make normality assumptions on X and Y

How well did you know this?

Not at all

Perfectly

inference on a population mean

for small samples, assuming the population is approximately normal, use Student’s t, based on unbiased sample variance
for large enough samples, can rely on the CLT (regardless of population distribution), and use z-scores

How well did you know this?

Not at all

Perfectly

comparing two population means

comparing means of populations A and B
paired
- everything is regarded as the same between the populations, other than the experimental variations–pairs (a1,b1),…,(an,bn)
- the statistical model allows removing all but inherent noise in paired-differences, a1-b1, a2-b2,…,an-bn
- an ordinary one-sample t-test can then be applied on this set of differences
unpaired
standard error estimate follows from sample variances:
* if assuming var(A)=var(B)
aka pooled variance
* if assuming var(A) is not equal to var(B)
this is a Welch test
if samples sizes are smallish, use above with two-sample t-test, ie on the difference between the sample means, mu(A)-mu(B); otherwise use two-sample z-test

How well did you know this?

Not at all

Perfectly

population proportions, binary case

ie the population(s) in question are categorical, with two levels
single population
- essentially, use Gaussian approximation to binomial distribution, and z-scores
- (so) standard error estimate is sqrt(np(1-p)), with p estimated from the sample proportion (i.e. we can “get away with” approximating the population parameter p with the sample proportion)
two populations, of sizes m and n
- hypothesis test, p_A=p_B (ie the binary “probabilities” between the populations are equal)
- this results in
  - simplified standard error estimate, p(1-p)(1/n+1/m)
  - a special estimator for population parameter p, p=(x+y)/(n+m) with x/y the “success” counts and n/m the number of samples
- confidence interval (ie what is the distribution of point estimate p_A-p_B)
  - take usual s.e. for proportion–p(1-p)/n for each group
  - combine s.e.’s via usual linear combo of r.v.s (as variances)
  - use z-scores

How well did you know this?

Not at all

Perfectly

population proportions, multinomial case

ie the population(s) in question are categorical, with more than two levels
single population
- essentially a goodness of fit test for 1-way contingency table
- Pearson Chi-squared
- likelihood ratio / G-test
two populations
- amounts to test for independence
- essentially a 2-way contingency table, each population is one axis, with the independence assumption being each cell is the product of marginals
- use eg Chi-squared

How well did you know this?

Not at all

Perfectly

population proportions, comparing more than two populations

ie testing if proportions for k>2 populations are independent
eg each population has the same factor type, with 2 or more levels
use goodness of fit test on k-way contingency table
- the test for independence amounts to each cell being “close” to its product of marginals
- Chi-squared, or G-test

How well did you know this?

Not at all

Perfectly

Marascuillo procedure

in the context of comparing multiple population proportions, allows checking for “problemmatic” cells in the k-way contingency table
can simultaneously test the differences of all pairs of proportions when there are several populations under investigation
note, this may suffer from the multiple comparisons problem

How well did you know this?

Not at all

Perfectly

population proportions, fitting to a given distribution

eg we suspect a population follows a Poisson distribution

so we use sample binning and create a frequency histogram
we then apply a goodness of fit test on the histogram (Chi-squared)

How well did you know this?

Not at all

Perfectly

ANOVA

comparing three or more population means (can only test H_A = all the means are equal)
if the population means are different, then the variance within the samples (SSE) (ie within a given factor level) must be small compared to the variance between the samples / populations (SSTr), where
- x_ij is the jth observation from the ith population
- mu_t = total mean over all samples = avg(x_*,*)
- SSTr = sum_i n_i( avg(x_i,*) - mu_t) ^2
- SSE = sum_i sum_j (x_i,j - avg(x_i,*) )^2
assumes equal-variance normal distribution within all the factor levels
basic test intuition
- large SSE makes intra-factor-level variability high
- large SSTr makes inter-factor-level variability high
- MSTr / MSE follows an F-distribution, if the factor means are all equal
- so eg small SSE and large SSTr makes it likely the means are different

How well did you know this?

Not at all

Perfectly

MANOVA

an extension of ANOVA that can handle more than 1 dependent variable
we have 3 or more populations, and two or more dependent variables in the populations
- eg 3 grocery store chains, and fat content and sugar content as dependent variables
- we want to compare the means between populations, under each of the dependent variables

How well did you know this?

Not at all

Perfectly

tests against a suspected distribution

we have samples, and want a test for if those samples “fit” a suspected distribution:

histogram binning, with Chi-squared
Kolmogorov Smirnov test

How well did you know this?

Not at all

Perfectly

parametric vs non-parametric tests

a parametric test is used when the populations have known distributions; this makes assumptions about the parameters of the population distribution (usually a normal distribution)

non-parametric tests are useful for when the populations in question have unknown distributions (though the test may impose some conditions on the populations or samples)

mixed categorical and numeric cases

i.e. a group of variables (usually 2), where a correlation-like measure is wanted, but one or more variables are factors
contingency tables (if all are factors)
logistic regression (ie model as metric; for mixed, factor and continuous)
ANOVA–group by the categorical(s), leaving the continuous separated by the categories (at least for 2-variable case, one a factor and the other continuous)

Brainscape's Knowledge GenomeTM

parameter inference methods Flashcards

Brainscape's Knowledge Genome^TM