Data Engineering Part 4 Flashcards
(19 cards)
What is data cleaning?
The process of detecting and correcting errors or inconsistencies in data.
What are common data quality issues?
Missing values, duplicates, inconsistent formats, and outliers.
What is deduplication?
Identifying and removing duplicate records.
What is schema validation?
Ensuring that incoming data matches a predefined schema.
What is outlier detection?
Identifying data points that deviate significantly from others.
What is referential integrity?
Ensuring that foreign keys match primary keys in related tables.
What is a null constraint?
A rule that disallows null values in a column.
What is a uniqueness constraint?
A rule that ensures all values in a column are unique.
What is a check constraint?
A rule that enforces a specific condition on a column value.
Why is data integrity important?
It ensures trustworthiness and usability of data.
What is schema evolution?
The ability to handle changes to a data schema over time.
What is backward compatibility in schema evolution?
New data can be read by systems expecting the old schema.
What is forward compatibility?
Old data can be read by systems expecting the new schema.
Why is schema evolution important?
To support ongoing development and changes in data sources.
Which formats support schema evolution?
Avro, Parquet, Protobuf.
What is data validation?
The process of checking data against rules or constraints.
What is a validation rule?
A rule used to assess the correctness or quality of data.
What is anomaly detection?
Identifying data patterns that do not conform to expected behavior.
What is the difference between validation and cleaning?
Validation checks data quality; cleaning fixes detected issues.