Inferential statistics Flashcards
(32 cards)
Outlier
An outlying observation, or outlier, is one that appears to deviate
markedly from other members of the sample in which it occurs.
different measures:
• more than 3 SD away from the mean
• more than 1.5 times of the IQR (mild) 3 times
(extreme) ñ custom boxplot criteria
Confidence Interval (CI)
Confidence interval
we need to quantify uncertainity about the population value
• a confidence interval states our uncertainty
• confidence intervals are available for means,
differences between means, proportions, correlations…
A condence interval is a range of values surrounding the sample estimate that is likely to contain the population parameter. Vorhersage liegt bei 0.25. Wenn CI=0.95 würde würde die Vorhersage in 95 von 100 Fällen den realen Wert abdecken
CI95% = mean+/- 2SE
Ho
• H0 with one group:
- our sample proportion is not different from the
known population proportion
- our sample mean is not different from the known population mean
• H0 with two or more groups:
- groups are not different from each other regarding their population proportions or means
• alternative hypothesis: same sentences but without not
p-value
gives the probability that we observe a difference as large or even larger as seen with the sample if H0 would be true!
• Can we reject the null hypothesis?
• We will never know if the null hypothesis is true or false!!
• We almost always observe a difference!!
• One group: We need to know the population
mean/proportion!: Tells us how sure we are that our sample is not(!) different from the population
• Two groups: Tells you how often you would get the observed or a larger difference by random sampling from two populations if the means/proportions of both populations would be equal
• The p-value tells us not(!!) how sure we are that there is really a difference
CI vs p-value
- p-values are somehow only a measure of randomness
- CI’s tells you about your proportions and the probable value in the population
- CI’s are often the better measure, but unfortunately in science less frequently used
- may be because telling one number is easier than two?
Significance
• statistically significant does not necessary mean that
the observation is\important”
• just a custom threshold () for the p-value to claim
significance (mostly < 0.05)
• significant , highly significant **, extremly
significant **
Testing Numeric vs Numeric
Numeric vs Categoric
Categoric vs. Categoric
• Numeric vs Numeric
- Correlation, Regression
• Numeric vs Categoric - t-test, anova (today)
• Categoric vs Categorical - chisq-test, sher-test
Central Limit Theorem
No matter how the population is distributed: the population of sample means will approximate a Gaussian distribution if the sample size is large enough
What is large ? It depends:
• a less normal distribution more samples (100 should be enough in any case)
• more normal distribution (10 or more)
Properties of a Normal Distribution
• symmetrical bell shaped distribution • extends in both directions to infinity • mean and median are closed to each other • 95% of all values are within 2 SD • this assumption gives very wrong results if the the distribution is non-normal !! • normal data --> t.test • non-normal data, skewed or multi-modal distributions --> wilcox.test
t distribution
- Derives from the normal distribution
* t is the difference between the sample mean and the population mean, divided by the SEM
Effectsize: Cohens D
How large is the deviation between two groups in comparison to the standard deviation.
Correlation
• observe the association between two numerical
variables
• if two numerical variables are associated we say they are correlated
• the correlation coefficient is a quantity that describes the strength of the association
4 interpretations of r
Why the variables correlate so well?
• Lipid content of membranes determines insulin sensitivity?
• Insulin sensitivity affects lipid content?
• Insulin sensitivity and lipid content are controlled by
third factor?
• There is no correlation, our r is just a random finding (type 1 error)?
• We did not know the truth …
• Correlation did not mean causation!!!
Correlation: r-squared r^2
• r2 often also called coefficient of determination
• r is between -1 and 1
• r2 is between 0 and 1, smaller than r
• r2 is interpreted as the fraction of variance that is
shared between the variables
• runners: 0,192 = 0,036 meaning that only 3.6% or
the variance of time are shared by age
• students: 0,782 = 0,6084 means that 60% of the weight variance is shared by size
influence outliers
• Just one point can change everything with Pearson correlation!
Spearman Rank Correlation
- Spearman correlation is more robust against outliers!
- Correlation with one outlier is not significant!!
- Spearman correlation is calculated on ranks of values.
- It’s a non-parametric test.
- It does not assumes normal distribution of data.
- It is more conservative.
- If in doubt use Spearman correlation.
Effectsize Pearsons r and Spearmans rs
• Pearsons r and Spearmans rs are quite similar in their values
• but rs^2 is the proportion of rank variances for
• Kendalls tau is numerical different
66-75% of r or rs, don’t square it
• r of 0.1 small effect, 1% of variance
• r of 0.3 medium effect, 9% of variance
• r of 0.5 large effect, 25% of variance
partial correlation
In probability theory and statistics, partial correlation measures the degree of association between two random variables, with the effect of a set of controlling random variables removed. If we are interested in finding whether or to what extent there is a numerical relationship between two variables of interest, using their correlation coefficient will give misleading results if there is another, confounding, variable that is numerically related to both variables of
interest.
• partial correlation of body height and weight after removing the effect of sex
• correlation of shoesize and writing capabilites after removing the effect of …
Mutual Information
- Berechnet die Zahl einer bestimmten Abfolge zur Zahl aller Abfolgen (beispielsweise die Anzahl aller AC-Paare im Vergleich zu allen Paaren)
o Gibt Korrellationsinformationen wie die statistische Unabhängigkeit über die Sequenz für alle Pärchen korrelliert
o Nennt man übrigens auch Boltzmann Entropie, …
Correlation vs Regression
Correlation
• description of an undirected relationship between two or more
variables
• how strong it is
• direction is not known, not existing or we are simply not interested
• phones in household and baby deaths
Regression • description of a directed relationship between two or more variables • one variable influences the other • smoking and cancer • weight and height • model to describe the relationship • model to predict one variable
Regression; aims, types;
- looking for a trend: linear, sigmoid, exponential
- curve fitting : which model ist most similar to the data
- prediction: predict response variable Y from X
- standard curve: assays
Regression Types
• simple linear regression (numerical variables)
• multiple linear regression (numerical variables)
• logistic regression (Y is categorical)
• non-linear regression (numerical variables)
• regression trees (Y is numerical)
• classification trees (Y is categorical)
Simple linear regression
Simple Linear Regression • most common regression type • method to find a best straigth line to a cloud of data points • one variable (independent) is used to predict a second (dependent) In statistics, simple linear regression is a linear regression model with a single explanatory variable. That is, it concerns two-dimensional sample points with one independent variable and one dependent variable (conventionally, the x and y coordinates in a Cartesian coordinate system) and finds a linear function (a non-vertical straight line) that, as accurately as possible, predicts the dependent variable values as a function of the independent variables. The adjective simple refers to the fact that the outcome variable is related to a single predictor
Regression predict values
Use the equation to determine Y for certain values of X. Example what would be the values of insuline response for C20.22 values of 0, 15,16 and 17%.
Linear regression: slope and intercept
Intercept (a, alpha): Value of Y if X is zero (Y-Intercept).
Slope (b, beta )): Increase on Y by one unit of X.
Example: y = 2x + 1
beta= (Summe (xi- durchschnittx)*(yi-durchschnitty))/(summe(xi- durchschnittx)^2)