Chapter 3 Flashcards
Designing Data Pipelines (10 cards)
Expand the shortcut
DAG
Directed acyclic graphs
Expand the shortcut and elaborate on the process
ETL
Extraction, transformation, and load
Expand the shortcut
ELT
Extraction, load, and transformation
Name the four types of stages in a data pipeline
- Ingestion
- Transformation
- Storage
- Analysis
What is
Ingestion
In terms of data pipeline stages
The process of bringing data into the GCP environment
What is
Transformation
In terms of data pipeline stages
Transformation is the process of mapping data from the structure used in the source system
to the structure used in the storage and analysis stages of the data pipeline.
What is
“Change Data Capture”
In a change data capture approach, each change in a source system is captured and
recorded in a data store. This is helpful in cases where it is important to know all changes
over time and not just the state of the database at the time of data extraction.
What is and how works
Tumbling Window
- Engine collects all events within a fixed duration (e.g., 5 minutes).
- At the end of the window, it triggers a computation, like SUM, AVG, COUNT, etc.
- Then the window closes, and the next one starts.
At 12:00–12:05 → Count all login events → Output: COUNT = 200
What is and how works
Sliding Window
- Engine maintains overlapping windows.
- Each window slides forward (e.g., every 1 min) and spans a certain duration (e.g., 5 min).
- Each event may be included in multiple windows.
- Aggregation is triggered for each window, usually periodically.
What is and whats the difference between
Hot Path and Cold Path Ingestion
Hot Path
* Real-time or near-real-time data ingestion and processing.
* Focused on low-latency.
* Used when data needs to be acted on immediately.
Cold Path
* Batch ingestion with higher latency (minutes, hours, daily).
* Optimized for throughput and cost, not speed.
* Data is processed in bulk for historical analysis or reporting.