Chapter 3 Flashcards

Designing Data Pipelines (10 cards)

1
Q

Expand the shortcut

DAG

A

Directed acyclic graphs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Expand the shortcut and elaborate on the process

ETL

A

Extraction, transformation, and load

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Expand the shortcut

ELT

A

Extraction, load, and transformation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Name the four types of stages in a data pipeline

A
  1. Ingestion
  2. Transformation
  3. Storage
  4. Analysis
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is

Ingestion

In terms of data pipeline stages

A

The process of bringing data into the GCP environment

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is

Transformation

In terms of data pipeline stages

A

Transformation is the process of mapping data from the structure used in the source system
to the structure used in the storage and analysis stages of the data pipeline.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is

“Change Data Capture”

A

In a change data capture approach, each change in a source system is captured and
recorded in a data store. This is helpful in cases where it is important to know all changes
over time and not just the state of the database at the time of data extraction.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is and how works

Tumbling Window

A
  • Engine collects all events within a fixed duration (e.g., 5 minutes).
  • At the end of the window, it triggers a computation, like SUM, AVG, COUNT, etc.
  • Then the window closes, and the next one starts.
    At 12:00–12:05 → Count all login events → Output: COUNT = 200
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is and how works

Sliding Window

A
  • Engine maintains overlapping windows.
  • Each window slides forward (e.g., every 1 min) and spans a certain duration (e.g., 5 min).
  • Each event may be included in multiple windows.
  • Aggregation is triggered for each window, usually periodically.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is and whats the difference between

Hot Path and Cold Path Ingestion

A

Hot Path
* Real-time or near-real-time data ingestion and processing.
* Focused on low-latency.
* Used when data needs to be acted on immediately.

Cold Path
* Batch ingestion with higher latency (minutes, hours, daily).
* Optimized for throughput and cost, not speed.
* Data is processed in bulk for historical analysis or reporting.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly