Chapter 4 - Data Quality Flashcards
(103 cards)
integrating multiple systems and inappropriate database design are two causes of what? What is meant by inappropriate design?
Data redundancy
Inappropriate design means when transactional databases aren’t in 3rd normal form.
What types of data do the below statements describe?
1) Same or similar data elements exist in multiple places
2) Identical copies of the same information exists in multiple places
1) Redundant Data
2) Duplicate Data
Can you list the 8 Data Quality challenges?
1) Duplicate Data
2) Redundant Data
3) Invalid Data
4) Data Type Validation
5) Missing Values
6) Specification Mismatch
7) Nonparametric Data
8) Data Outliers
what must you watch out for on the data quality challenge around nonparametric data?
if the rank order of the values is significant
which data quality issue will get if you don’t ensure validation of inbound data consistently maps to its target data type?
Specification Mismatch
What do you need to do to ensure you don’t get specification mismatch?
you need to ensure that inbound data is correctly mapped to its target data type
What data manipulation technique is helpful when you have numeric data you want to split into subcategories categorize to facilitate analysis?
Recoding
increasing the ages in an age column by 4 years is also an example of recoding
Regarding data manipulation, what technique describes creating a new variable from a calculation on an existing table?
Derived variable
Why is it not a good idea to use a column to store derived variables from another column? What should you do instead?
if the nature of the variables change over time this would need constant updates. Instead, derived variables should always be embedded as code so it is calculated only when needed.
the book misses out that storing a data in a column as a derived variable does not auto-update. Hence you get the issue. It is best to embed the calculation in the query itself or code.
Which data manipulation technique would you use if you wanted a holistic view of a particular subject?
Data merging
What data manipulation technique helps to ensure data is consistent, complete and accurate?
Data Merging
What’s the difference between ETL and data blending when it comes to combining data?
ETL combines multiple sources of data into a single data set in a data warehouse database. Whereas, data blending only does this at the reporting layer.
If the IT infrastructure was struggling to do ETL effectively, what other technique can you use to combine datasets that has less impact on IT?
Data blending using a data visualization tool
A data analyst MUST understand what if they’re to use data blending techniques?
they must understand how data maps across systems
if you needed to combine the variables of several columns into a single variable column, what data manipulation technique would you use?
Concatenation
a data append combines data from different sources into a single data set, but how does it differ from data merge?
it differs by the fact that the source data structure must be the SAME. I.e. if combining two data sources together, those data sources contain exactly the same data attributes/columns.
Whereas, with data merge, the source data comes from different data sets
you have the same data being recorded in different locations and you need to combine them into a single data set, what manipulation technique does this?
Data Append
Imputation is a data manipulation technique to deal with what problem? [IMPORTANT]
missing values in a data set.
List the 5 data imputation methods for dealing with missing values numeric values
1) Removing rows containing missing values
2) Replace with zero
3) Replace with overall average
4) Replace with Most Frequent (mode)
5) Closest Value average
reduction is the process of what?
shrinking a dataset WITHOUT negatively impacting its analytical value
Removing attributes to reduce a dataset’s overall size is known as what? Why would this be done?
It’s known as dimensionality reduction. It’s done to make data analysis on big datasets more efficient
Histograms are a method of __________ __________
numerosity reduction
this reduces QUANTITATIVE data
list the 3 methods to reduce a dataset to make big data analysis more efficient?
numerosity reduction
dimensionality reduction
sampling
what manipulation technique summarizes data saving you going through it by searching?
Data aggregation calculations