Batch and Stream Processing Flashcards

(37 cards)

1
Q

What are the three classes of system?

A
  1. Services (Online System)
  2. Batch Processing Systems (Offline Systems)
  3. Stream Processing Systems (near-real-time systems)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are services?

A

A service waits for a request and handles it as quickly as possible and sends back a response. The primary performance metrics are response time and availability

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is a Batch Processing System?

A

A batch processing system takes a large amount of input data, runs a job to process it, and produces some output data. Jobs can take from minutes to days. Batch jobs are often scheduled to run periodically. The primary performance measure of a batch job is usually throughput.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is a Stream Processing System?

A

Stream Processing is somewhere between online and offline/batch processing. A Stream processor consumes inputs and produces outputs. It aims to operate on events shortly after they happen.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

4 ideas

What is the Unix philosophy?

A
  1. Make each program do one thing well. To do a new job build afresh.
  2. Expect the output of every program to become the input to another.
  3. Design and build software, even operating systems, to be tried early.
  4. Use tools in preference to unskilled help.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is MapReduce?

A

MapReduce is a processing algorithm, taking one or more inputs and producing one or more outputs.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are the two components to MapReduce?

A

The Mapper and the Reducer

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the Mapper’s functionality in MapReduce?

A

The mapper is called once for every input record, and it extracts the key and value from the input record. Each input record is handled independently.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the Reducer’s functionality in MapReduce?

A

The MapReduce function takes the key-value pairs produced by the mappers, collects all the values belonging to the same key, and calls the reducer with an iterator over that collection of values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is the key advantage of MapReduce?

A

Parallelism

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is a MapReduce workflow?

A

A chaining together of MapReduce jobs taking the output of one job, and using as the input to the next job.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are joins in a batch processing context?

A

Resolving all occurrences of some association within a dataset. The job is processing the data for all users simultaneously.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What assumption is applied to batch processing?

A

It assumes the data is bounded and the batch knows when it has finished reading its input.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are data records in stream processing?

A

Data records are called events. Each is a small, self-contained, immutable object containing details of something that happened at some point in time.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is a messaging system?

A

A common approach for notifying consumers when new events appear. A producer sends a message containing the event, which is then pushed to consumers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How is it handled when producers send messages faster than consumers can process them?

A
  1. Drop Messages
  2. Buffer messages in a queue
  3. Backpressure (flow control)
17
Q

What is a message broker?

A

A database that is optimised for message streams.

18
Q

How is a message broker used?

A

Producers and consumers connect to it as clients. Producers write messages to the broker, consumers receive them by reading them from the broker.

19
Q

What is the advantage of a message broker?

A

It can more easily tolerate clients that come and go (connect, disconnect, and crash)

20
Q

What are the key characteristics of Message Brokers?

A
  1. Automatically delete messages when it has been successfully delivered.
  2. Queues are short.
  3. Some way of subscribing to a subset of topics
21
Q

What are the two techniques for handling a multiple consumer read?

A
  1. Load Balancing
  2. Fan-Out
22
Q

What is Load Balancing?

A

Each message is delivered to one of the consumers. This allows consumers to share the work of processing the messages.

23
Q

What is Fan-Out?

A

Each message is delivered to all of the consumers. This allows several independent consumers to subscribe to the same broadcast of messages.

24
Q

How is the receipt of messages checked?

A

A client must explicitly tell the broker when it has finished processing a message (acknowledgements). If it does not receive the acknowledgement it will deliver the message to another client.

25
What do streams do?
1. Store events somewhere 2. send an event to a user 3. process and pipeline it to another stream.
26
What is a stream operator/job?
A piece of code that processes streams to produce other derived streams
27
Why do streams have issues with fault tolerance?
Restarting a stream job from the beginning after a crash may not be viable, as compared to a batch process that can just be restarted.
28
What are some use cases for Stream Processing?
1. Fraud detection systems 2. Trading systems 3. Manufacturing systems 4. Military and Intelligence systems
29
How are complex event systems handled?
They use a declarative query language like SQL to describe the patterns of events that should be detected. These queries are submitted to a processing engine that consumes the input streams and internally maintains a state machine that performs the required matching. When a match is found, the engine emits a complex event with details of the event pattern.
30
What types of stream analytics are commonly used?
Measuring the rate of some type of event. Calculating the rolling average of a value over some time period. Comparing current statistics to previous time intervals.
31
What is the difference between Event time and processing time?
Event time is the time that the event actually took place, whereas processing time refers to the time a system observes or processes an event.
32
How many time steps are stored for an event?
3
33
What are the three time steps stored for an event?
1. The time the event occurred according to the device clock 2. The time the event was sent to the server according to the device clock 3. The time the event was received by the server according to the server clock
34
What are the three different types of time windows?
1. Tumbling Window 2. Hopping Window 3. Sliding Window
35
What are the characteristics of a Tumbling Window?
It has a fixed length and each event belongs to one time window
36
What are the main characteristics of a Hopping Window?
It has a fixed length, and windows are allowed to overlap in order to provide some smoothing. This overlap is dictated by the hop size.
37
What are the main characteristics of a sliding window?
A sliding window contains all the events that occur within some time interval. Windows don't exist if there are no events in the time period.