Finals Flashcards
(37 cards)
What does the Central Limit Theorem prove?
The sampling distribution of the mean is approximately normally distributed once σ is known and n sufficiently large.
What is a problem about the Central Limit Theorem? What can we use instead?
The standart deviation of the population, which we need to calculate CLT, is often not known. We can perform a one-sample t-test , which we need a sample standart deviation for.
What is a one-sample t-test? When should you use it? +Formula
It is used to compare a result to an expected value.
You should use this test when:
- You do not know the population mean or standard deviation.
- You have two independent, separate samples.
Formula: t = ( x̄ – μ) / (s / √n)
What can you tell me about Exploratory data analysis (EDA)? (2 bulletpoints)
- EDA is for seeing what the data can tell us beyond the formal modelling or hypothesis testing task.
- It is also the best paradigm to make statements about both validity and reliability
What scales of data are there? Explain and give an example. (a lot of text)
- Categorical (Nominal)
○ uses labels to classify cases into classes
○ gender, nationality, residence, car brand - Ordinal
○ monotonic increasing function
○ if X > Y then log(X) > log(Y)
○ PRESERVES ORDER NOT MAGNITUDE
○ ratings and rankings
○ Example: not all, slightly, fairly, much, very much - Interval
○ Y = aX + b
○ i.e. What is the exact temperature in your city? - Ratio
○ Y = aX
○ difference to ordinal: produces not only order but also makes the difference between variables known along with information on the value of true zero
○ i.e: how many children? 0, less or equal than 2 , more than 2
○ IT HAS A NATURAL ZERO POINT (total absence of the variable of interest, i.e. not having any children)
What are the properties of a reliable research tool?
A reliable research tool is consistent, stable, predictable and accurate.
What does the parallel forms reliability do? When do you use it?
It measures the correlation between two equivalent versions of a test. You use it when you have two different assessment tools or sets of questions designed to measure the same thing.
What does the test - retest reliability do?
It measures test consistency of a test measured over time.
What is the split half technique? What would ensure an acceptable level of reliability in the measurments?
It is a method used to check measuring instruments where half of the data is computed and is then correlated against the other half of the data. A correlation coefficient of 0,9 would ensure an acceptable level of reliablity in the measurments.
What is the inter - rater reliability? How is it calculated?
It is the extent to which two or more raters agree. It is calculated by COHENS KAPPA. (Formula: K = (po-pe) / (1-pe) )
How is the standard normal distribution curved? Give its 2 parameters and their values.
It is bell curved.
The parameters are the mean ( = 0 ) and the standart deviation ( = 1 ).
What is the difference between a T-distribution and a normal distribution?
A T-distribution is a normal distribution with heavier tails.
What do you know about the Monte Carlo method?
● Any problem that might be deterministic in principle can be solved by MC. It relies on repeated random sampling in order to obtain a good estimate or approximation of the exact p-value.
● MC is used when the data set does not meet the requirements necessary for parametric or asymptotic methods.
● Computing an exact p-value is possible via Exact tests, Randomization tests, but only for small data sets. MC can also work with large data sets.
● The Monte Carlo method tells you:
○ All of the possible events that could or will happen,
○ The probability of each possible outcome.
What are regularisation techniques? + Examples
They are techniques used in Bias Variance Trade-Off which create bias by slightly changing the slope of the regression line. Lasso and Ridge regression are examples.
Explain “Bias Variance Trade-Off”? What is the name of the techniques used here?
● Bias–variance tradeoff is the property of a model that the variance of the parameter estimated across samples can be reduced by increasing the bias in the estimated parameters.
● Regularization techniques are used.
What is the standart error?
It is the standard deviation of the sampling distribution.
Explain type 1 and type 2 error.
What is sampling bias? Where does is comes from? What Non Random Sample types are there? (important) Explain them. (not crucial I guess, but interesting)
Sampling bias is a type of selection bias and involves systematic error due to a non random sample of a population:
- Convenience sampling is a method where market research data is collected from a conveniently available pool of respondents. (remember WEIRD from the cognitive science lectures).
- Snowball sampling is a technique where information comes from “somewhere” so that it cannot be traced and verified, e.g. from drug addicts or gamblers. Where does the snowball come from?
- Quota sampling is a technique in which researchers choose individuals (for the sample) according to specific traits or qualities.
Explain the different variables there are in an experiment.
- Independent variable: A variable the experimenter changes or controls and is assumed to have a direct effect on the dependent variable.
- Depented variable: A variable being tested and measured in an experiment and is “dependent” on the independent variable.
- Extraneous variables: All variables, which are not the independent variable, but could still effect the results of the experiment.
What is “residual”?
It is the difference between the observed value and the mean value that a supervised learning model predicts for that observation. In other words, it is a measure of how much a regression line vertically misses a data point.
Explain prevalence, sensitivity, specificity, Positive Predictive Value, Negative Predictive Value and accuracy and give their formulas.
● Prevalence: Total number of cases of a disease existing in a population divided by the total population. P(Z) = ( TP + TN ) / ( TP + TN + FP + FN )
● Sensitivity: the proportion of people with the disease who will have a positive test result; P(T|Z) = TP / ( TP + FN [people with the disease])
● Specificity : the proportion of people without the disease who will have a
negative result; P(-T|-Z) = TN / ( TN + FP [people without the disease])
● Positive Predictive Value: the probability of patients who have a positive test result actually having the disease; P(Z|T) = ( TP + FN [people with the disease]) / ( TP )
● Negative Predictive Value: the probability that people who get a negative test result truly do not have the disease; P(-Z|-T) = ( TN + FP [people without the disease]) / ( FN )
● Accuracy: It measures the correctness of a diagnostic test on a condition. (TP + TN) / total
What happens if the standard error of the mean gets decreased?
It will also decrease the difference between the lower and the upper bound of the confidence level (α) and allow for more accurate, precise conclusion. In other words, the smaller alpha α, the more precise are the conclusions.
What are computer intensive techniques? What techniques / concepts are considered as CIT’s?
● Sometimes called resampling
● Involve intensive use of computers to compute thousands of new samples, divergent statistics or other values of interest to do inferential statistics in an “empirical” way, to improve system performance and to validate models
● Examples of CIT:
○ Bootstrapping
○ Monte Carlo Methods
○ Randomisation Test, Permutation Tests, Exact Tests
A contingency table allows for …
… exact probability statements.