Batch and Stream Processing Flashcards
(37 cards)
What are the three classes of system?
- Services (Online System)
- Batch Processing Systems (Offline Systems)
- Stream Processing Systems (near-real-time systems)
What are services?
A service waits for a request and handles it as quickly as possible and sends back a response. The primary performance metrics are response time and availability
What is a Batch Processing System?
A batch processing system takes a large amount of input data, runs a job to process it, and produces some output data. Jobs can take from minutes to days. Batch jobs are often scheduled to run periodically. The primary performance measure of a batch job is usually throughput.
What is a Stream Processing System?
Stream Processing is somewhere between online and offline/batch processing. A Stream processor consumes inputs and produces outputs. It aims to operate on events shortly after they happen.
4 ideas
What is the Unix philosophy?
- Make each program do one thing well. To do a new job build afresh.
- Expect the output of every program to become the input to another.
- Design and build software, even operating systems, to be tried early.
- Use tools in preference to unskilled help.
What is MapReduce?
MapReduce is a processing algorithm, taking one or more inputs and producing one or more outputs.
What are the two components to MapReduce?
The Mapper and the Reducer
What is the Mapper’s functionality in MapReduce?
The mapper is called once for every input record, and it extracts the key and value from the input record. Each input record is handled independently.
What is the Reducer’s functionality in MapReduce?
The MapReduce function takes the key-value pairs produced by the mappers, collects all the values belonging to the same key, and calls the reducer with an iterator over that collection of values.
What is the key advantage of MapReduce?
Parallelism
What is a MapReduce workflow?
A chaining together of MapReduce jobs taking the output of one job, and using as the input to the next job.
What are joins in a batch processing context?
Resolving all occurrences of some association within a dataset. The job is processing the data for all users simultaneously.
What assumption is applied to batch processing?
It assumes the data is bounded and the batch knows when it has finished reading its input.
What are data records in stream processing?
Data records are called events. Each is a small, self-contained, immutable object containing details of something that happened at some point in time.
What is a messaging system?
A common approach for notifying consumers when new events appear. A producer sends a message containing the event, which is then pushed to consumers.
How is it handled when producers send messages faster than consumers can process them?
- Drop Messages
- Buffer messages in a queue
- Backpressure (flow control)
What is a message broker?
A database that is optimised for message streams.
How is a message broker used?
Producers and consumers connect to it as clients. Producers write messages to the broker, consumers receive them by reading them from the broker.
What is the advantage of a message broker?
It can more easily tolerate clients that come and go (connect, disconnect, and crash)
What are the key characteristics of Message Brokers?
- Automatically delete messages when it has been successfully delivered.
- Queues are short.
- Some way of subscribing to a subset of topics
What are the two techniques for handling a multiple consumer read?
- Load Balancing
- Fan-Out
What is Load Balancing?
Each message is delivered to one of the consumers. This allows consumers to share the work of processing the messages.
What is Fan-Out?
Each message is delivered to all of the consumers. This allows several independent consumers to subscribe to the same broadcast of messages.
How is the receipt of messages checked?
A client must explicitly tell the broker when it has finished processing a message (acknowledgements). If it does not receive the acknowledgement it will deliver the message to another client.