parameter inference methods Flashcards

1
Q

Pearson Chi-squared test

A
  • hypothesis test for sample proportions
  • N samples, k categories
  • m_i are the expected frequences from the population in each category (eg m_i = p_i*N, p_i the known population proportion)
  • test statistic
    • X = sum_{1,k} (x_i-m_i)^2 / m_i
    • where x_i are the test counts
    • X is Chi-square distribution (consider binomial and CLT on each term in the sum)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

G-test

A

a goodness of fit test

alternative to Chi-squared (with usually similar results)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

goodness of fit tests

A
  • in general, compare actual data to expected or predicted data (as from a model)
  • eg, Chi-Square, Kolmogorov-Smirnov, and Shapiro-Wilk
  • goodness of fit may be commonly used in context of categorical data, and frequencies of occurrence
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

contingency table

A
  • a tabular representation of categorical data, usually showing frequencies in some categorical phase space
  • the dimension of the phase space is the value of the “way” (1-way, 2-way, etc.)
  • usually the tables are 2-way (a multinomial proportion would be a 1-way table)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

inferences on correlations

A

Pearson–there exists a kind of extension to 1-dim linear regression to test if 2 variables have non-trivial correlation; assumes normal distributions of X and Y

Spearman–transform Spearman’s rank correlation to a value whose point estimate is approximately normal; does not make normality assumptions on X and Y

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

inference on a population mean

A
  • for small samples, assuming the population is approximately normal, use Student’s t, based on unbiased sample variance
  • for large enough samples, can rely on the CLT (regardless of population distribution), and use z-scores
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

comparing two population means

A
  • comparing means of populations A and B
  • paired
    • everything is regarded as the same between the populations, other than the experimental variations–pairs (a1,b1),…,(an,bn)
    • the statistical model allows removing all but inherent noise in paired-differences, a1-b1, a2-b2,…,an-bn
    • an ordinary one-sample t-test can then be applied on this set of differences
  • unpaired
    standard error estimate follows from sample variances:
    () if assuming var(A)=var(B)
    aka pooled variance
    () if assuming var(A)!=var(B)
    this is a Welch test
  • if samples sizes are smallish, use above with two-sample t-test, ie on the difference between the sample means, mu(A)-mu(B); otherwise use two-sample z-test
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

population proportions, binary case

A
  • ie the population(s) in question are categorical, with two levels
  • single population
    • essentially, use Gaussian approximation to binomial distribution
    • standard error estimate is sqrt(np(1-p)), with p estimated from the sample proportion
  • two populations, of sizes m and n
    • hypothesis test, p_A=p_B (ie the binary “probabilities” between the populations are equal)
    • this results in simplified standard error estimate, p(1-p)(1/n+1/m), but with a special estimator for population parameter p
    • confidence interval (ie what is the distribution of point estimate p_A-p_B)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

population proportions, multinomial case

A
  • ie the population(s) in question are categorical, with more than two levels
  • single population
    • essentially a goodness of fit test for 1-way contingency table
    • Pearson Chi-squared
    • likelihood ratio / G-test
  • two populations
    • amounts to test for independence
    • essentially a 2-way contingency table, each population is one axis, so use eg Chi-squared
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

population proportions, comparing more than two populations

A
  • ie testing if proportions for k>2 populations are independent
  • eg each population has the same factor type, with 2 or more levels
  • use goodness of fit test on k-way contingency table
    • the test for independence amounts to each cell being “close” to its product of marginals
    • Chi-squared, or G-test
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Marascuillo procedure

A
  • in the context of comparing multiple population proportions, allows checking for “problemmatic” cells in the k-way contingency table
  • can simultaneously test the differences of all pairs of proportions when there are several populations under investigation
  • note, this may suffer from the multiple comparisons problem
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

population proportions, fitting to a given distribution

A

eg we suspect a population follows a Poisson distribution

  • so we use sample binning and create a frequency histogram
  • we then apply a goodness of fit test on the histogram (Chi-squared)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

ANOVA

A
  • comparing three or more population means (can only test H_A = all the means are equal)
  • If the population means are different, then the variance within the samples (SSE) (ie within a given factor level) must be small compared to the variance between the samples / populations (SSTr).
  • assumes equal-variance normal distribution within all the factors
  • basic test intuition
    • large SSE makes intra-factor-level variability high
    • large SSTr makes inter-factor-level variability high
    • MSTr / MSE follows an F-distribution, if the factor means are all equal
    • so eg small SSE and large SSTr makes it likely the means are different
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

MANOVA

A
  • an extension of ANOVA that can handle more than 1 dependent variable
  • we have 3 or more populations, and two or more dependent variables in the populations
    • eg 3 grocery store chains, and fat content and sugar content as dependent variables
    • we want to compare the means between populations, under each of the dependent variables
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

tests against a suspected distribution

A

we have samples, and want a test for if those samples “fit” a suspected distribution:
* histogram binning, with Chi-squared
* Kolmogorov Smirnov test

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

parametric vs non-parametric tests

A

a parametric test is used when the populations have known distributions (eg normal distributions)

non-parametric tests are useful for when the populations in question have unknown distributions

17
Q

mixed categorical and numeric cases

A
  • sometimes, correlations or modeling are needed where the independent variables are of mixed categorical / numeric types
  • linear regression–can handle mixed types straightaway, with one-hot encoding
  • ANOVA–group by the categorical(s), leaving the continuous separated by the categories