Data & Management of Data Flashcards
(62 cards)
What is data integration?
Data integration is the process of integrating data from separate sources into a single, coherent data store
What does it mean for a data source to be heterogeneous?
Separate from the main store
What does it mean for a data source to be homogeneous?
Contained within one location/data store, due to reasons such as incompatibility of version or file type
Why may a data store be homogeneous?
Incompatibility, either version (both) or type (file-based systems)
Scientists may perform data integration using a common user interface, which entails…
A data manager (interface) that handles every step of the integration process, from retrieval to presentation
Scientists may perform data integration using a middleware data integration, which entails…
Using middleware software to bridge and facilitate communication between different homogeneous systems
Scientists may perform data integration using an application-based integration, which entails…
Using software applications to retrieve and integrate data
Scientists may perform data integration using a uniform data access, which entails…
Providing a consistent view of data from diverse sources without alteration - i.e. the cloud
Scientists may perform data integration using a common data store or data warehouse, which entails…
A system where data from other sources is collected and stored as a duplicate, often for data analysis, presenting that data uniformly via some kind of graph or table
What is data cleaning?
The process of detecting and removing corrupt/inaccurate records in data
What is missing data?
Data we expect to have, but is absent in our dataset
What can cause missing data? (pick two)
Human error, data type errors, incompatibility, lost records, failure to fetch
What is data missing completely at random (MCAR)?
Data is missing by pure chance, meaning the probability of a missing value is equal for all units
What is data missing at random (MAR)?
Some data is more likely to be missing, meaning the probability of a missing value is related to the observations
What is data missing not at random (MCAR)?
Data is known to have missing values
The keep as-is approach to handling missing values involves…
Keeping the data as it is given
The remove rows approach to handling missing values involves…
Removing the observations with missing values
The remove columns approach to handling missing values involves…
Removing the features with missing values
The impute values approach to handling missing values involves…
Estimating and imputing missing values via models and problem knowledge
How can we generate imputable values? Describe one general solution, and one for both MNAR and MCAR.
Central tendency (mean, median, mode), regression analysis (MNAR), interpolation (MCAR)
What are outliers?
Data points that differ significantly from other observations.
What can cause outliers?
Data errors, outstanding legitimate values, or fraudulent entries
How can we detect outliers?
Quartiles, such as lower and upper quartiles, alongside interquartile range
The do nothing approach to handling outliers involves…
Doing nothing to the data