data exploration Flashcards
(34 cards)
data does not
- speak for itself
- it can be biased and is not objective (based on how selected)
- the people behind it interprets the data
answers/results depend on..
question to solve and perspective
types of data sets
- cross-sectional
- time-series
- panel
cross-sectional
- many subjects/variables, one point in time
- eg sales, expenses, profit
time-series
- one subject/variable, many points in time
- eg sales over time
panel
- many subjects/variables, many points in time
- eg sales, expenses, profit over time
dimensions of data quality
- completeness
- consistency
- conformity
- accuracy
- integrity
- timeliness
completeness
comprehensive and meets expectations
consistency
across all systems/sourced from different places reflects the same information
conformity
follows set of standard data definitions like data type, size and format
accuracy
correctly reflects the real world object OR an event being described
integrity
all in a database can be traced and connected to other data
timeliness
information is available when it is expected and needed
first two steps of data cleansing/processing
- sourcing raw data
- technically correct data
sourcing raw data
What do we want and need to achieve?
What data will support this outcome?
How can we source it and ensure it is of a high quality?
technically correct data
- when can be directly recognised as belonging to a certain variable
- is stored in a data type that represents the value domain of the real-world variable
data issues
- formatting/data type
- missing values
- outliers
formatting/data type
- sex; Male, M, Boy
- month; January, 1-Jan, 1
missing values - listwise deletion
remove records with missing values in any variable
missing values - mode/median/mean imputation
- mean for continuous variables
- median for skewed continuous variables
- mode for categorical variables
missing values - model imputation
- interpolate/extrapolate
- use regression model to predict missing value
outliers - drop outlier record
completely remove record to avoid severe skewness
outliers - winsorisation
- cap your outliers data
- limit extreme values in statistical data to reduce effect of possibly spurious (false) outliers
outliers - imputation
- assign a new value
- mean or regression