Data Types
Statistical Methods
Measures of Location
Measures of Variability
Measures of Shape
Skewness: tendency of the deviations from the mean to be larger in one direction than the other.
Kurtosis: is a measure of the relative peakedness or flatness of the curve defined by the frequency distribution. Kurtosis of normal distribution is 0.
Steps of Hypothesis Testing
Tools for bivariate analysis with continuous/categorical variables
Categorical/C -> Contingency Tables
Quantitative/Q -> Linear Correlation
Categorical/Quantitative -> ANOVA
Statistical independence in contingency tables
Two variables are independent if the columnwise and rowwise tables show respectively identical columns and rows (and equal to overall sample distributions).
Chi Square Index
The reference case of independence is useful to calculate the degree of association between the variables through an association measure.
(Chi-Squared) compares the observed frequencies with the frequencies that would be expected if the null hypothesis of statistical independence were true.
if c2 = 0, X and Y are independent considering the sample data.
For the population?
H0: Variables are independent in the population. Distribution of c2 with mean 0.
H1: Variables are Dependent in the population.
Cramer’s V
if we reject H0, and there is dependence, we can assess the strength of the relation considering Cramer’s V.
sqrt(c2/(N(min(nrow,ncol)-1)
Covariance and Correlation
Covariance: tendency of two measures to vary in the same direction (positive) or not (negative).
Correlation: standardised covariance, Covariance divided by standard deviation.
ANOVA
Analyses relationship between numerical and categorical variable.
One can understand how the numerical variable changes across the different categories of another categorical variable by comparing its within-category means.
One-way ANOVA F Test
(one-way = one categorical variable)
Is the difference in the sample means significant at the population level?
H0: the population means are equal across all c categories.
H1: not all the population means are equal (at least two differ).
F statistic for the f distribution:
F = between group variability/within group variability = [BSS/(c-1)] / [WSS/(n-c)]
Assumptions: