Lecture 8 Flashcards
(13 cards)
What are Randomization tests?
Permutation tests; exact tests; test that enumerates ALL of the possible outcomes that could occur in some reference besides the outcome that was actually observed.
What are 2 characteristics of a Randomization test?
- Small data IS NOT poor data.
2. If data sets are too large for exact tests, then typically MC methods or a classical parametric test will be used.
Name 2 characteristics of Fisher’s exact test.
- It had a one-sided Alternative Hypothesis.
2. It does not require large samples.
What is the Confusion Matrix?
One dichotomous variable represents Reality/truth. The other dichotomous variable represents a measurement/test/claim.
Under which 3 conditions will a Hypergeometric distribution exist?
- Total number of items (population) is fixed.
- Sample size (number of trials) is a portion of the population.
- Probability of success changes after each trial.
What is overfitting?
Something that occurs when we fit our model “too perfectly” to the data at hand (zero bias), and will thus perform poorly and unpredictably on “new” data, across different samples.
To what leads overfitting?
To large variance of parameters when the model is applied to new data/samples. There is zero/low bias and high variance.
What can we introduce when the model fits the data “too well”?
A penalty term to introduce bias to the traditional estimation.
Which 2 things can avoid overfitting?
- Implenting the Bias Variance Trade-off.
2. Implenting train-test paradigm principles.
What is a Bias Variance Trade-off?
To avoid overfitting we should introduce a little bias in the model built on the data at hand (no perfect fit anymore), it will perform better on “new” data with less variance.
Which 2 regularisation techniques are based on the Bias Variance Trade-off principle? What do they do?
- Ridge regression.
- Lasso regression.
These techniques slightly change the slope of the regression line in order to create bias.
What is a Tran-Test paradigm?
We should use training data to build the model and test data to assess the model. Both training and testing data are created (from the dataset/sample that we started with) by sophisticated partitioning and using the data.
What is Cross-Validation?
It is the most famous technique to implement these ideas, more particularly by sophisticated partitioning the data.