Data Types Flashcards
3 types of data
quantitative, qualitative, textual
Quantitative data (1 word)
numerical
Qualitative data (1 words)
categorical
2 types of quantitative data
continuous and discrete
numerical qualitative data
not mathematical, category numbers
2 types of qualitative data
ordinal (ordered), norminal (un ordered)
1st step of investigating data
examine the univariate statistics
purposes (3) of starting with univariate stats
1) detect distribution anomalies 2) get an idea of some orders of magnitude 3) see how to discretize the continuous variables, if needed
type of plot for univariate review of discrete or qualitative data
frequency table
type of plot for univariate review of continuous
box blot, taking note of extreme percentiles
purposes (3) of bivariate analysis
1) incompatible variables 2) links between dependent (target) variable and independent 3) links between independent variables
simple table for bivariate analysis
contingency table
2nd step of investigating data
Rare or Missing values
problem with rare values
can create bias in factor analysis or skew in measures of center
dealing with rare values
remove or replace with more frequent value
problem with missing values
1) may not be random, skewing data 2) aggregates over multiple variables
dealing with missing values (4 options)
1) remove records 2) remove/replace variable 3) replace value 4) treat ‘missing’ qualitative data as it’s own value
when missing values >= 15-20% of values
cannot use replace values or treat missing data as it’s own value
Statistical replacement of the missing values uses a process called
imputation
simplest method of imputation
replace missing value with most frequent value or mean/median
most widespread imputation model (simple imputation)
each missing value is replaced with an assumed value
multiple imputation
missing vales are replaced with multiple plausible values creating several complete data tables
3rd step of investigating data
Aberrant Values
define aberrant value
erroneous value: can be caused by incorrect measurement, calculation error, input error, false declaration.