Data Engineering Part 5 Flashcards
(20 cards)
What is observability in data pipelines?
The ability to understand system behavior through logs, metrics, and traces.
What is pipeline monitoring?
Tracking execution status, errors, and performance of ETL jobs.
What are metrics commonly tracked in pipelines?
Job success rates, duration, latency, and data volume.
What is alerting?
Automatically notifying users when metrics or jobs exceed thresholds.
What is log aggregation?
Collecting logs from distributed systems into a searchable central repository.
What is horizontal scaling?
Adding more machines or nodes to handle increased workload.
What is vertical scaling?
Increasing the resources (CPU, RAM) of a single machine.
What is partitioning in data storage?
Splitting data into segments based on attributes like date or region.
What is parallelism in data processing?
Executing multiple tasks or jobs simultaneously.
What is throughput?
The amount of data processed per unit of time.
What is idempotence?
The property that an operation can be repeated without changing the result.
What is atomicity in data processing?
Operations are all-or-nothing — they either complete entirely or not at all.
What is fault tolerance?
The ability of a system to continue operating despite failures.
What is a retry policy?
A rule that defines how failed operations should be re-attempted.
What is backoff strategy in retries?
A method to progressively delay retry attempts after failures.
What is data lineage?
The tracking of data’s origin, transformations, and flow through systems.
Why is data lineage important?
It ensures transparency, debugging, and auditability.
What is data governance?
The policies and practices for managing data quality, access, and usage.
What is metadata in data engineering?
Descriptive information about data — structure, origin, type, etc.
What is a data catalog?
A searchable inventory of data assets and their metadata.