Stream Data Processing Flashcards

1
Q

Why can’t event stream data not be stored as big data?

A

It would result in one small file per event –> too many files

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Event hub

A

A buffer that buffers the events and stores them in batches.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

The different kinds of stream processing

A

1 - strema data integration

2 - stream analytics

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Stream data integration

A

Focuses on ingestion and processing of the data sources targeting ETL

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Stream analytics

A

Targets analytics use cases. Calculates aggregates and detects patterns.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Native streaming

A

Events are processed as they arrive -> lowest latency but high fault tolerance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Window

A

A certain amount of data to perform computations on

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Three types of windows

A

1 - fixed/tumbling windows
2 - sliding/hopping windows
3 - session windows

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Fixed/tumbling windows

A

Stops if the window is full, based on the count of items or the time

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Sliding/hopping windows

A

Stops based on window + sliding interval length

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Session windows

A

Sequences of temporarily related events terminated by a gap of inactivity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Which two kinds of queries are there?

A

1 - ad-hoc queries

2 - standing queries

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Standing queries

A

Queries that are stored and permanently executed

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Ad-hoc queries

A

One time questions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Main differences between batch processing and stream processing?

A

1 - The input is not controlled by the system

2 - The input timing/rate is often unknown

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Multi-query optimization

A

Using subqueries to not recompute things in different queries

17
Q

Ways to handle overload?

A
  • Back pressure
  • Load shedding
  • Distributed stream processing
18
Q

Back pressure

A

Slows down the sources to avoid data loss (for example by blocking queries)

19
Q

Load shedding variants

A

1 - random sampling-based shedding
2 - relevance-based shedding
3 - summary-based shedding

20
Q

Random-sampling-based load shedding

A

Taking a random sample and provide an output based on this (approximation)

21
Q

Relevance-based load shedding

A

Use an algorithm that understands which distances are relevant while the ones that are not relevant are removed

22
Q

Summary-based load shedding

A

Given a queue, the summary of the queue is provided, which is used as input for the next step

23
Q

distributed stream processing

A

Distribute the stream based on data flow (distributes the query) or on key range (distrbutes the data stream itself)

24
Q

Three options for delivery guarantees

A

1 - at most once –> can cause data loss on failures
2 - at least once –> might create incorrect states because data is processed multiple times
3 - exactly once