Data Engineering Part 4 Flashcards

Question 1

Q

What is data cleaning?

Answer

A

The process of detecting and correcting errors or inconsistencies in data.

Question 2

Q

What are common data quality issues?

Answer

A

Missing values, duplicates, inconsistent formats, and outliers.

Question 3

Q

What is deduplication?

Answer

A

Identifying and removing duplicate records.

Question 4

Q

What is schema validation?

Answer

A

Ensuring that incoming data matches a predefined schema.

Question 5

Q

What is outlier detection?

Answer

A

Identifying data points that deviate significantly from others.

Question 6

Q

What is referential integrity?

Answer

A

Ensuring that foreign keys match primary keys in related tables.

Question 7

Q

What is a null constraint?

Answer

A

A rule that disallows null values in a column.

Question 8

Q

What is a uniqueness constraint?

Answer

A

A rule that ensures all values in a column are unique.

Question 9

Q

What is a check constraint?

Answer

A

A rule that enforces a specific condition on a column value.

Question 10

Q

Why is data integrity important?

Answer

A

It ensures trustworthiness and usability of data.

Question 11

Q

What is schema evolution?

Answer

A

The ability to handle changes to a data schema over time.

Question 12

Q

What is backward compatibility in schema evolution?

Answer

A

New data can be read by systems expecting the old schema.

Question 13

Q

What is forward compatibility?

Answer

A

Old data can be read by systems expecting the new schema.

Question 14

Q

Why is schema evolution important?

Answer

A

To support ongoing development and changes in data sources.

Question 15

Q

Which formats support schema evolution?

Answer

A

Avro, Parquet, Protobuf.

Question 16

Q

What is data validation?

Answer

Study These Flashcards

A

The process of checking data against rules or constraints.

Question 17

Q

What is a validation rule?

Answer

Study These Flashcards

A

A rule used to assess the correctness or quality of data.

Question 18

Q

What is anomaly detection?

Answer

Study These Flashcards

A

Identifying data patterns that do not conform to expected behavior.

Question 19

Q

What is the difference between validation and cleaning?

Answer

Study These Flashcards

A

Validation checks data quality; cleaning fixes detected issues.

Data Engineering Part 4 Flashcards

(19 cards)