Finals Flashcards

Question

What do you know about the Z-score? Give the formulas and the known P-values with their Z-scores.

Answer 1

● A z-score gives you an idea of how far from the mean a data point is. But more technically it’s a measure of how many standard deviations below or above the population mean a raw score is. In order to use a z-score, you need to know the mean μ and also the population standard deviation σ. Formula: Z = ( x - μ) / σ ● When μ and σ are unknown, the Z score may be calculated using the sample mean (x*) and sample standard deviation (s) as estimates of the population values. Formula: Z = ( x - x* ) / s ● A z-score of 2.5 means that this score lies 2.5 standard deviations above the mean. A score of -0.75 means that the score lies 0.75 standard deviations below the mean. ● P(Z > 1,96) = 2,5% ● P(Z > 1,645) = 5%

Answer 2

● A way to implement Bias Variance Trade off and Train-Test paradigm ● A sophisticated way to partition the data ● It is one of the techniques used to test the effectiveness of machine learning models, it is also a resampling procedure used to evaluate a model if we have limited data. ● The sophisticated way of cross-validation: all data will alternately play a role of training and test data. ● Less sophisticated way: use 75% of the data as training data, use 25% of the data as test data.

Answer 3

- Occurs when we fit the model perfectly to the data at hand (zero bias) but it will perform poorly and unpredictably on new data, across different samples (high variance) - It has 0 or low bias and high variance.

Answer 4

It is the difference between expected and real value.

Answer 5

It's a nonparametric version of the two independent samples of the t-test. It tests the null-hypothesis stating that for randomly selected values X and Y from two populations, the probability of X being greater tha Y is equal to the probability of Y being greater than X. P( X > Y ) = P( Y > X )

Answer 6

● Training Dataset: The sample of data used to fit the model. ● Validation Dataset: The sample of data used to provide an unbiased evaluation of a model fit on the training dataset while tuning model hyperparameters. The evaluation becomes more biased as a skill on the validation dataset is incorporated into the model configuration. ● Test Dataset: The sample of data used to provide an unbiased evaluation of a final model fit on the training dataset. ● Both training and test data are created (from the dataset/sample that we started with ) by sophisticated partitioning and using the data

Answer 7

● Permutation/exact tests ● They compute exact values by enumerating all of the possible outcomes that could occur in some reference set beside the outcome that was actually obtained. ● Used only on small data

Answer 8

Simple regression Multiple regression (with an interaction term) Logistic regression

Answer 9

○ Simple linear regression is used to estimate the relationship between two quantitative variables. ○ Yi = ß0 + ß1 * Xi + ei ○ Interpretation of the formula: If we increase variable X by 1 unit, then variable Y increases by B1 ○ For numeric dependent variable Y

Answer 10

○ Multiple linear regression is used to estimate the relationship between two or more independent variables and one dependent variable Y ○ Yi = ß0 +ß1 * X1i + ß2 * X2i + ei ○ Interpretation: If we increase variable X1 by 1 unit and control for X2, then variable Y increases by ß1 ○ With interaction: Yi = ß0 +ß1 * X1 + ß2 * X2 + ß3 * X1 * X2 + ei ○ Without interaction term, the effect of X1 on Y is measured by ß1, with the interaction term, the effect of X1 and X1 * X2 on Y is measured by ß1 and ß3 ○ If ß3 is significant, you may keep the interaction term in your model.

Answer 11

○ Formula: Yi = ß0 + ß1 * X1i + ... + ßn * Xik | ○ For a BINARY dependent variable Y

Answer 12

● Occurs when individuals or groups in a study differ systematically from the population of interest leading to a systematic error in an association or outcome. ● Selection bias can arise in studies because groups of participants may differ in ways other than the interventions or exposures under investigation. When this is the case, the results of the study are biased by confounding.

Answer 13

○ Sampling bias ■ Sampling bias is a type of selection bias and involves systematic error due to a non-random sample of a population (convenience sampling, snowball sampling, quota sampling) ○ Performing poor cross-validation ■ Naive split of data set into 70% training data and 30% test data without randomizing the (order of) data first. ○ Publication bias ■ once only significant results are published in journals. If one tries to combine all these results in a follow-up study (for example a meta-analysis that combines the results of dozens of studies) one finds cases that are not representative. ○ Attrition bias ■ Caused by attrition (loss of participants), discounting trial subjects/tests that did not run to completion. It may lead to all kinds of missing values, that may distort the quality of the research ■ Lost to follow-up ● It’s a form of attrition bias, mainly occurring in medicinal ● studies over a lengthy time period. Non-Response or retention bias can be influenced by a number of factors, such as; wealth, education, altruism, initial understanding of the study and it's requirements

Finals Flashcards

(37 cards)