Chapter 4 - Data Quality Flashcards

1
Q

integrating multiple systems and inappropriate database design are two causes of what? What is meant by inappropriate design?

A

Data redundancy
Inappropriate design means when transactional databases aren’t in 3rd normal form.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
1
Q

What types of data do the below statements describe?

1) Same or similar data elements exist in multiple places
2) Identical copies of the same information exists in multiple places

A

1) Redundant Data
2) Duplicate Data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Can you list the 8 Data Quality challenges?

A

1) Duplicate Data
2) Redundant Data
3) Invalid Data
4) Data Type Validation
5) Missing Values
6) Specification Mismatch
7) Nonparametric Data
8) Data Outliers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

what must you watch out for on the data quality challenge around nonparametric data?

A

if the rank order of the values is significant

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

which data quality issue will get if you don’t ensure validation of inbound data consistently maps to its target data type?

A

Specification Mismatch

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What do you need to do to ensure you don’t get specification mismatch?

A

you need to ensure that inbound data is correctly mapped to its target data type

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What data manipulation technique is helpful when you have numeric data you want to split into subcategories categorize to facilitate analysis?

A

Recoding

increasing the ages in an age column by 4 years is also an example of recoding

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Regarding data manipulation, what technique describes creating a new variable from a calculation on an existing table?

A

Derived variable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Why is it not a good idea to use a column to store derived variables from another column? What should you do instead?

A

if the nature of the variables change over time this would need constant updates. Instead, derived variables should always be embedded as code so it is calculated only when needed.

the book misses out that storing a data in a column as a derived variable does not auto-update. Hence you get the issue. It is best to embed the calculation in the query itself or code.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Which data manipulation technique would you use if you wanted a holistic view of a particular subject?

A

Data merging

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What data manipulation technique helps to ensure data is consistent, complete and accurate?

A

Data Merging

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What’s the difference between ETL and data blending when it comes to combining data?

A

ETL combines multiple sources of data into a single data set in a data warehouse database. Whereas, data blending only does this at the reporting layer.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

If the IT infrastructure was struggling to do ETL effectively, what other technique can you use to combine datasets that has less impact on IT?

A

Data blending using a data visualization tool

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

A data analyst MUST understand what if they’re to use data blending techniques?

A

they must understand how data maps across systems

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

if you needed to combine the variables of several columns into a single variable column, what data manipulation technique would you use?

A

Concatenation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

a data append combines data from different sources into a single data set, but how does it differ from data merge?

A

it differs by the fact that the source data structure must be the SAME. I.e. if combining two data sources together, those data sources contain exactly the same data attributes/columns.
Whereas, with data merge, the source data comes from different data sets

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

you have the same data being recorded in different locations and you need to combine them into a single data set, what manipulation technique does this?

A

Data Append

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Imputation is a data manipulation technique to deal with what problem? [IMPORTANT]

A

missing values in a data set.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

List the 5 data imputation methods for dealing with missing values numeric values

A

1) Removing rows containing missing values

2) Replace with zero

3) Replace with overall average

4) Replace with Most Frequent (mode)

5) Closest Value average

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

reduction is the process of what?

A

shrinking a dataset WITHOUT negatively impacting its analytical value

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Removing attributes to reduce a dataset’s overall size is known as what? Why would this be done?

A

It’s known as dimensionality reduction. It’s done to make data analysis on big datasets more efficient

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Histograms are a method of __________ __________

A

numerosity reduction

this reduces QUANTITATIVE data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

list the 3 methods to reduce a dataset to make big data analysis more efficient?

A

numerosity reduction
dimensionality reduction
sampling

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

what manipulation technique summarizes data saving you going through it by searching?

A

Data aggregation calculations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

if you wanted to present data but maintain privacy, which manipulation technique allows you to do this?

A

Data Aggregation allows you to do this by summarizing the data instead.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

What data manipulation technique re-arranges the data for better visibility and prepares it for use in data visualization tools

A

Transposition

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

what data manipulation technique helps to reveal patterns hidden in the original format of the data?

A

Transposition

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

What combination of data manipulation techniques greatly enhances ability to see trends, efficiently calculate dimensions, and can provide a more holistic view?

A

Transposition and Aggregation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

pick the right word for the sentences:
Use ______ to address composite structural issues
Use ______ to address distributed structural issues

Words: Parsing / String Manipulation

A

Use parsing to address composite structural issues
Use string manipulation to address distributed structural issues
(When would you use string over concatenation?)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

What stage do you need to perform string manipulation, before or after you start analysis?

A

before analysis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

data quality: influencing data quality as early as possible in WHAT part of the analytics life-cycle?

A

the Acquisition phase.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

Data quality issues can be introduced at all stages of the data life-cycle, these are places to introduce quality control checks, but where are they?

A

Data Acquisition phase

Data Manipulation Phase:
–Data Transformation (ETL)
–Data Conversion (ETL)
–Data Manipulation (ETL)

Data visualization phase:
(Final Product Preparation)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

add more questions about data quality issues

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

You’re implementing some automated validation, what do you need understand first?

A

how source data fields map to their corresponding database columns and specifically, their data-types!

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

what type of validation automation would you implement to control input form errors?

A

you would use data-type validation automation to ensure the correct data-type is used in the input form.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

What two examples of automated validation does the author provide?

A

Data-type validation
Validation of number of data points

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

How might you prevent missing data (as in, complete rows of data and not just missing values) from flowing into you analytics environment?

e.g. if source day-temp was measured 24 times a day, you should always get 24 rows.

A

Implement automated validation on the verification of the number of data points expected for a given attribute

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

List the 6-dimensions of data quality

A

1) Accuracy
2) Completeness
3) Consistency
4) Timeliness
5) Uniqueness
6) Validity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

What data quality dimension denotes how closely a given attribute matches how you intended to use it?

A

data accuracy

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

Match the data quality dimension to the questions asked:

1) Accuracy
2) Completeness
3) Consistency
4) Timeliness
5) Uniqueness
6) Validity

a) what do I intend to this use this attribute for?
b) is my attribute within its expected range?
c) is it relevant when I obtain this data?
d) how reliable is this data, i.e. is it correct? or the the same across different systems?
e) Does my data have the minimum number of attributes in order to carry out the purpose the data is collected for?
f) Can we reduce the number of systems this data attribute exists on?

A

1) Accuracy / a) what do I intend to this use this attribute for?
2) Completeness / e) Does my data have the minimum number of attributes in order to carry out the purpose the data is collected for?
3) Consistency / d) how reliable is this data, i.e. is it the same across different systems?
4) Timeliness / c) is it relevant when I obtain this data?
5) Uniqueness / f) Can we reduce the number of systems this data attribute exists on?
6) Validity / b) is my attribute within its expected range?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

when is the best time to check the validity of data?

A

At the moment of creation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

Taking the count of rows that pass or fail during an ETL load is one method to measure what?

A

data conformity

42
Q

the author lists 5 methods to validate quality, list them:

these methods are broader than validating data-types, at creation checks or manipulation during ETL

A

1) Reasonable Expectations
2) Data Profiling
3) Data Audits
4) Sampling
5) Cross-Validation

43
Q

Babs expects roughly 30 million records to be loaded monthly into the DWH. What data validation check could she do to detect issues with the ETL load?

A

Implement a Reasonable Expectation check by creating an exception report that will alert if the number of records is below a reasonably expected amount.

44
Q

whether the data in your analytic environment looks as it should. A.K.A?

A

Reasonable Expectations

45
Q

Which data quality validation uses statistical measures to check for discrepancies and is good at identifying irregular patterns within your data?

A

Data Profiling

46
Q

Results of data profiling can be used as a Reasonable Expectations test, true or false?

A

TRUE. The results can highlight patterns that can be defined as being within or outside of reasonable expectations

47
Q

which data quality validation method also uses Data Profiling techniques to help validate data integrity, but also helps in identifying security issues?

A

Data Audits

48
Q

Regular _______ ensures that data quality remains high throughout the data life-cycle

A

data profiling

49
Q

data profiling is often used at which part of the data life-cycle?

A

At the Acquisition phase

50
Q

which data quality validation method is a cost effective way to check the quality of data?

A

Data Sampling

51
Q

Cross-validation, whilst isn’t directly used for data quality validation, indirectly helps identify issues with another data quality validation method, which is it?

A

Cross-validation helps detect bias in the method of data SAMPLING

52
Q

what do you need to be mindful of when it comes to data outliers?

A

Their impact on statistical analysis

53
Q

what is the difference between the two data quality challenges of duplicate data and redundant data?

A

1) Duplicate data is identical data or values that appear more than once (maybe considered necessary/valid duplication)
2) Whereas redundant data is unnecessary duplication that can be removed safely.

think about redundancy in terms of normalization of tables, instead of record the same value multiple times, these can be ‘removed’ by creating a reference table.

54
Q

what data manipulation technique would you use if you wanted to create a master customer profile made from different database subjects about a particular customer?

A

Data Merge

55
Q

when does data merging usually occur?

A

At the Transformation part of ETL process

56
Q

What data manipulation technique is useful when you have numeric data you want to analyse by category?

A

Recoding

57
Q

regarding recoding data, what is meant by when the values are ordinal and nominal?

A

Ordinal values where the category has an inherent rank, like T-shirt size for example
Nominal values are when the categories have no particular order, like hair colour for instance.

58
Q

An analyst ideally needs to automate quality improvements at which part of the data life-cycle?

A

the data acquisition phase (collection and preparation)

59
Q

What step must you undertake before applying Imputation to deal with null values?

A

You must understand the context of the null values by explicitly checking them.

60
Q

Bizarrely, the data quality dimension of “data validity” specifically identifies what? The author also says it is interchangeable with another term which I disagree with, what is it?

A

Whether a given value falls within an expected range
The author also says data validity is AKA as data integrity!

61
Q

Data derived from categorical or ordinal rank

A

nonparametric data

62
Q

two reasons responsible for creating duplicate data

H

A

In order:

1) Humans
2) Multiple data sources

63
Q

Two ways to deal with duplicate data

`

A

1) Prevent creation in first place (best)
2) Employ duplicate resolution process

Use warning prompts in software to prevent creation

64
Q

causes of redundant data

A

Integrating multiple systems that use shared data elements
Tables not in 3NF

65
Q

Options to treat redundant data

A

Synchronize changes to shared data elements
Tables in 3NF

66
Q

values outside the valid range for a given attribute

A

Invalid data

67
Q

Invalid values are easy to check for what data?

A

Numeric and date data

68
Q

It’s trickier to check invalid data when it is what data?

What would you implement to avoid invalid data of this kind?

A

Text data

Implementing good referential integrity resolves this

69
Q

when the source column has a different datatype to the destination column

A

Specification Mismatch

70
Q

how to resolve specification mismatch

A

validate data mapping of columns

71
Q

data-type inconsistency (within the same column) during data load causes what?

A

Database load process to fail/stop

72
Q

prevention of database load failures caused by inconsistent data-types

A

Validate data-type consistency before data load

73
Q

combines multiple sources of data into a single dataset and presents it at the reporting layer

A

data blending

this is done through software/vizualizer

74
Q

Aggregating a set of value ranges into a bin and the counting how frequently the range within that bin occurs

What is this and what is it an example of?

A

A histogram
An example of numerosity reduction

because it summarizes values into bins and counts their frequency instead of the individual datapoints themselves. Thereby reducing the data

75
Q

a raw data source has multiple, distinct values combined within a single character column

e.g. m014 (sex = m, age = 0-14)

A

Composite column

considered a composite structural issue

76
Q

The are all part of the data cleaning and manipulation phase. What process typically handles these?

1) Data Validation: Ensuring data consistency and accuracy during extraction.

2) Data Screening: Identifying issues during the extraction phase.

3) De-duplication: Removing duplicate records.

4) Handling Invalid Data: Transforming invalid entries into valid ones.

5) Handling Missing Data: Addressing missing values during transformation.

A

ETL process

77
Q

verify limitations of data source AND present drop-down lists on form entry

are two ways to check for data quality at what phase of the data analysis life-cycle?

A

Data Acquisition

78
Q

Where source data comes from more than one source and you’re combining these into one table, what quality issue can arise and you should check for?

What phase of the data analysis life-cycle does this align with and what process handles this?

A

Conversion issues (data consistency)

You would check the source data doesn’t differ in value types and if it does you’d convert the data from one source so the column contains consistent data

This occurs in the Data Manipulation/Cleaning phase and the ETL process handles it

79
Q

Identifying WHEN data is acquired and HOW it is transformed can act as a quality check in what process?

What phase of the data analysis life-cycle does this happen?

A

At ETL process

Data Manipulation phase

at any ETL stage! Intermediary or final hop into the Data Warehouse

80
Q

How can data quality issues be introduced right at the last phase of the data analysis life-cycle (Data Visualization)? How might you mitigate / prevent this?

A

If the data analysis doesn’t have a good knowledge of how the data is mapped, they may end up joining tables wrongly and producing errors.

To avoid this happening, the analyist must know the data and its mapping differences between the source systems clearly.

81
Q

elements of data accuracy, consistency, uniqueness and validity taken altogether are a measure of what?

A

Data Conformity

82
Q

Moving failed rows into a staging area for remediation is an efficient way of dealing with what issue? Why?

A

Data Conformity issues
It prevents the entire load process from failing

83
Q

which of the 5 broader data quality methods can detect data inconsistencies like missing, duplicate or incorrect data types?

A

Data Profiling

84
Q

If you have to deal with a very large dataset and need to assess the quality of it, which of the 5 broader methods to validate data quality mentioned in the book would you use?

A

Sampling

and thus running descriptive

85
Q

which part of the ETL process does data conversion take place?

A

Extract process

86
Q

Which data collection methods are good for a valuable source of research and insight

A

Human in the loop (Survey’s and observation)

87
Q

Which is the more modern variant - ELT or ETL?

A

ELT

88
Q

sampling occurs where data is broken down into subgroups, like gender, and then randomly sampled from each of the groups.

A

Stratified

89
Q

sampling method where each record of data has an equal chance of being selected for the data set used in the analysis

A

Simple Random

90
Q

which out of the below methods is suited for minute-by-minute updates into the datawarehouse?

ELT or ETL

A

ELT

91
Q

data that is repeated (identical) within the same dataset

A

duplicate data

92
Q

nonparametric data can have what effect on the normal distribution?

A

it can skew it left or right

93
Q

leading spaces, trailing spaces and non printable characters in a field will mean the data is what?

A

invalid data

94
Q
A
95
Q

it combines multiple fields into a single field
it allows analysts to consolidate data sets into a single dataset

A

DATA MERGE

96
Q

to combine data from ONE set with another data set

A

DATA APPEND

97
Q

a variable is one that’s created from existing data through calculations, transformations, or other operations.

A

DERIVED

98
Q

function to join two datasets with same columns and type and delete original table afterwards?

A

INLINE APPEND

99
Q

What is a field property setting that tells the database that a field needs to be tabulated?

A

INDEXING

100
Q

internal processing field property setting that controls how much data must be looked at when processing the data for queries involving commonly sorted or filtered data?

A

INDEX FIELD

101
Q

data is entered the same way each time and as intended based on the business rules. Example of what quality dimension?

A

Consistency

102
Q

what is the most common method to prepare data for a DWH?

ETL or ELT?

A

ETL