Designing data pipelines Flashcards
Chapter 3
What is a graph in the concept of DAGs?
A graph is a set of nodes linked by edges.
three common types of data pipelines are:
- Data warehousing pipelines
- Stream processing pipelines
- Machine learning pipelines
when using GCP Cloud Dataproc for transformations
with cloud dataproc, transformations canbe written in spark or hadoop supported language.
when using GCP Cloud Dataflow for transformations
when using cloud dataflow you write transformations using the Apache Beam model, which provides a unified batch and stream processing model
Apache Beam has explicit support for pipeline constructs including:
- Pipelines
- PCollection
- PTransform
What is a streaming window and what are its types?
A window is a set of consecutive data points in a stream. Windows have a fixed width and a way of advancing. Windows that advance by a number of data points less than the width of the window are called sliding windows; windows that advance by the length of the window are tumbling windows.
When are sliding windows used?
Sliding windows are used when you want to show how an aggregate - such as the average of the last three values - change over time.
When would you use tumbling windows?
Tumbling windows are used when you want to aggregate data over a fixed period of time, for example, for the last minute.
GCP has several services that are commonly used components of pipelines, including:
- Cloud pub/sub
- cloud dataflow
- cloud dataproc
- cloud composer
what are messaging queues?
Messaging queues are used in distributed systems to decouple services in a pipeline. This allows one service to produce more output than the consuming service can process without adversely affecting the consuming service.