parameter inference methods Flashcards

1
Q

Pearson Chi-squared test

A

aka Chi-square test; non-parametric test in two forms:

  • hypothesis test for sample proportions
  • hypothesis test for independence between 2 categorical variables

for goodness of fit:

  • N samples, k categories; assume cell fittings are independent
  • m_i are the expected frequences from the population in each category (eg m_i = p_i*N, p_i the known population proportion)
  • null hypothesis is based on sample from multinomial distribution (so, IID multiple Bernoulli)
  • test statistic
    • X = sum_{1,k} (x_i-m_i)^2 / m_i
    • where x_i are the test counts
    • X is Chi-square distribution (consider binomial and CLT on each term in the sum–i.e. each x_i as the number of successes of type i –> Gaussian)

for independence:

  • form 2-way contingency table w/ say columns as populations and rows as factor levels
  • test between sample frequencies and product-of-marginals frequencies using Chi-square
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

G-test

A

a goodness of fit test based on likelihood-ratio or maximum likelihood statistical significance test

G^2 = 2 sum_i x_i log(x_i / m_i), where

  • x_i are the sample cell counts for each category i
  • m_i are the null hypothesis cell counts
  • note G asymptotically follows a Chi-square distribution, under 2nd order Taylor expansion of log()

alternative to Pearson Chi-squared (with usually similar results)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

goodness of fit tests

A
  • in general, compare actual data to expected or predicted data (as from a model); testing whether a given data sample “likely indicates” the data comes from a given population distribution
  • eg, Chi-Square, Kolmogorov-Smirnov, and Shapiro-Wilk
  • goodness of fit may be commonly used in context of categorical data, and frequencies of occurrence
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

contingency table

A
  • a tabular representation of categorical data, usually showing frequencies in some categorical phase space
  • the dimension of the phase space is the value of the “way” (1-way, 2-way, etc.)
  • usually the tables are 2-way (a multinomial proportion would be a 1-way table)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

inferences on correlations

A

Pearson–there exists a kind of extension to 1-dim linear regression to test if 2 variables have non-trivial correlation; assumes normal distributions of X and Y

Spearman–transform Spearman’s rank correlation to a value whose point estimate is approximately normal; does not make normality assumptions on X and Y

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

inference on a population mean

A
  • for small samples, assuming the population is approximately normal, use Student’s t, based on unbiased sample variance
  • for large enough samples, can rely on the CLT (regardless of population distribution), and use z-scores
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

comparing two population means

A
  • comparing means of populations A and B
  • paired
    • everything is regarded as the same between the populations, other than the experimental variations–pairs (a1,b1),…,(an,bn)
    • the statistical model allows removing all but inherent noise in paired-differences, a1-b1, a2-b2,…,an-bn
    • an ordinary one-sample t-test can then be applied on this set of differences
  • unpaired
    standard error estimate follows from sample variances:
    * if assuming var(A)=var(B)
    aka pooled variance
    * if assuming var(A) is not equal to var(B)
    this is a Welch test
  • if samples sizes are smallish, use above with two-sample t-test, ie on the difference between the sample means, mu(A)-mu(B); otherwise use two-sample z-test
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

population proportions, binary case

A
  • ie the population(s) in question are categorical, with two levels
  • single population
    • essentially, use Gaussian approximation to binomial distribution, and z-scores
    • (so) standard error estimate is sqrt(np(1-p)), with p estimated from the sample proportion (i.e. we can “get away with” approximating the population parameter p with the sample proportion)
  • two populations, of sizes m and n
    • hypothesis test, p_A=p_B (ie the binary “probabilities” between the populations are equal)
    • this results in
      • simplified standard error estimate, p(1-p)(1/n+1/m)
      • a special estimator for population parameter p, p=(x+y)/(n+m) with x/y the “success” counts and n/m the number of samples
    • confidence interval (ie what is the distribution of point estimate p_A-p_B)
      • take usual s.e. for proportion–p(1-p)/n for each group
      • combine s.e.’s via usual linear combo of r.v.s (as variances)
      • use z-scores
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

population proportions, multinomial case

A
  • ie the population(s) in question are categorical, with more than two levels
  • single population
    • essentially a goodness of fit test for 1-way contingency table
    • Pearson Chi-squared
    • likelihood ratio / G-test
  • two populations
    • amounts to test for independence
    • essentially a 2-way contingency table, each population is one axis, with the independence assumption being each cell is the product of marginals
    • use eg Chi-squared
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

population proportions, comparing more than two populations

A
  • ie testing if proportions for k>2 populations are independent
  • eg each population has the same factor type, with 2 or more levels
  • use goodness of fit test on k-way contingency table
    • the test for independence amounts to each cell being “close” to its product of marginals
    • Chi-squared, or G-test
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Marascuillo procedure

A
  • in the context of comparing multiple population proportions, allows checking for “problemmatic” cells in the k-way contingency table
  • can simultaneously test the differences of all pairs of proportions when there are several populations under investigation
  • note, this may suffer from the multiple comparisons problem
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

population proportions, fitting to a given distribution

A

eg we suspect a population follows a Poisson distribution

  • so we use sample binning and create a frequency histogram
  • we then apply a goodness of fit test on the histogram (Chi-squared)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

ANOVA

A
  • comparing three or more population means (can only test H_A = all the means are equal)
  • if the population means are different, then the variance within the samples (SSE) (ie within a given factor level) must be small compared to the variance between the samples / populations (SSTr), where
    • x_ij is the jth observation from the ith population
    • mu_t = total mean over all samples = avg(x_*,*)
    • SSTr = sum_i n_i( avg(x_i,*) - mu_t) ^2
    • SSE = sum_i sum_j (x_i,j - avg(x_i,*) )^2
  • assumes equal-variance normal distribution within all the factor levels
  • basic test intuition
    • large SSE makes intra-factor-level variability high
    • large SSTr makes inter-factor-level variability high
    • MSTr / MSE follows an F-distribution, if the factor means are all equal
    • so eg small SSE and large SSTr makes it likely the means are different
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

MANOVA

A
  • an extension of ANOVA that can handle more than 1 dependent variable
  • we have 3 or more populations, and two or more dependent variables in the populations
    • eg 3 grocery store chains, and fat content and sugar content as dependent variables
    • we want to compare the means between populations, under each of the dependent variables
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

tests against a suspected distribution

A

we have samples, and want a test for if those samples “fit” a suspected distribution:

  • histogram binning, with Chi-squared
  • Kolmogorov Smirnov test
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

parametric vs non-parametric tests

A

a parametric test is used when the populations have known distributions; this makes assumptions about the parameters of the population distribution (usually a normal distribution)

non-parametric tests are useful for when the populations in question have unknown distributions (though the test may impose some conditions on the populations or samples)

17
Q

mixed categorical and numeric cases

A
  • i.e. a group of variables (usually 2), where a correlation-like measure is wanted, but one or more variables are factors
  • contingency tables (if all are factors)
  • logistic regression (ie model as metric; for mixed, factor and continuous)
  • ANOVA–group by the categorical(s), leaving the continuous separated by the categories (at least for 2-variable case, one a factor and the other continuous)