Kafka Streaming Flashcards

Question

Structured APIs in Spark and Their error analysis matrix

Answer 1

Answer 2

Datasets. | Features of both were combined into one.

Answer 3

Firstly they are declarative, so reads better and then we are telling Spark what to do and letting Spark figure out optimizations for that particular pipeline.

Answer 4

Spark can be run in three modes 1) Standalone mode - It runs on top of Hadoop. For computations Spark and Hadoop run in parallel for the Spark jobs submitted to the cluster. 2) Hadoop YARN/Mesos - Apache Spark runs on Mesos or YARN (Yet Another Resource Navigator, one of the key features of second generation Hadoop) without any root access or pre installation. It integrates Spark on top of already present Hadoop stack. 3) SIMR (Spark in Map Reduce) - This is an add-on to the standalone deployment where Spark jobs can be launched by the user and they can use the Spark shell without any administrator access.

Answer 5

1) Request and Response 2) Batch processing 3) Stream processing

Answer 6

1) Time 2) State 3) Stream-Table duality 4) Time Windows

Answer 7

Notion of time helps in finding choke or bottleneck point. 1) Event time - When did the event occur 2) Log append time - This is the time that the event was received by Kafka and was written to log file 3) Processing time - This is the time at which stream-processing application received the event in order to perform some calculation.

Answer 8

Stream processing becomes interesting when we have operations that involve multiple events like counting the number of events by type, sum, average, joining two streams to create an enriched stream etc. To be able to do that we need to preserve some information in form of "state".

Answer 9

Usually a bad practice would be to store the state in a local in memory hash map, but if the stream processing application crashes the results will change. So we should reliably store in some DB.

Answer 10

Stream-Table duality means that the stream can be viewed as table and vice versa. Stream as Table ------------------------ A stream can be considered a changelog of a table, where each record in the stream captures a state change in table A stream is thus a table in disguise, and it can easily turned into table by replaying the changelog from beginning to end. Table as Stream -------------------------- A table is a snapshot, of the latest values for each key in a stream A table is thus a stream in disguise, it can easily be turned into stream by iterating all the records

Answer 11

The stream processing engine needs to divide data into data records of time buckets. i.e to window the stream by time. We may need this for joining and aggregration operations. 1) Hopping time windows 2) Tumbling time windows

Answer 12

1) Hopping time windows Windows based on time intervals They model fixed size (possibly overlapping windows) Defined by two properties, the window's size and advance interval or hop. Advance interval defines how much the window moves relative to the previous one. Hopping windows can overlap and a data record may belong to more than one windows 2) Tumbling time windows A special case of hopping time windows They model fixed sized, non-overlapping, gap less windows. Defined by single property window size A tumbling window is a hopping window whose window size is equal to advance interval Tumbling windows never overlap one record will only be present in one window.

Answer 13

Step 1 - Map and partition the data so that in can be processed parallely on different processors Step 2 - Process the data on the processor side to maintian local state Step 3 - Write the results to a new topic with single partition so that it is consumed by only one consumer Step 4 - That single topic partition will then be processed by single cosumer that calculates top 10. Step 5 - Calculate top 10 for some window and then write it to a Top 10 topic.

Answer 14

Sometimes stream processing will require joining with exeternal database for various purposes such as validating the event based on some business rules stored in database or enrichment of event with data of the users who clicked it.

Answer 15

When we join two streams we are joining the entire history, trying to match events from one stream to events in other stream that have same keys and happened in same time windows. This is why streaming join is also called "Windowed join" For example if we have two streams Search Stream and Click Stream, then we are assuming that search button is clicked seconds after query was entered into our search engine. So we keep few seconds long window on each stream and match the results from each window.

Answer 16

Out of sequence events can occur in IoT type scenarios like device lost WiFi connection for some time and sends few hours worth of events when it reconnects. Also happened in PBSS CDR systems. The application has to recognize that the event is out of sequence. Define a time period which it will attempt to reconcile out of sequence events Have an in-band capability to reconcile this event. Be able to update results.

Answer 17

It allows scaling by allowing multiple threads within one instance of application and by supporting load balancing between distributed instances of the application.

Answer 18

It splits the topology in tasks. One partition is processed as one task.

Answer 19

StreamBuilder class is used to create topology. 1) Create properties Properties props = new Properties(); props.put(StreamsConfig.APPLICATION_ID, "wordCount"); props.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092"); props.put(StreamsConfig.KEY_SERDE_CLASS_CONFIG, Serdes.String().getClass().getName()); props.put(StreamsConfig.VALUE_SERDE_CLASS_CONFIG, Serdes.String().getClass().getName()); SERDE is Serializer Deserializer ``` 2) Create topology using StreamBuilder and Stream KStreamBuilder builder = new KStreamBuilder(); KStream stream = builder.stream("word-input"); ``` KStream counts = stream.flatMapValues(value -> pattern.split(value.toLowercase())).map((key, value) -> new KeyValue(value, value)).filter((key, value) -> (!value.equals("the"))).groupByKey().count("CountStore").mapValues(value -> Long.toString(value)).toStream(); counts.to("wordcount-output"); 3) Start the stream processing engine KafkaStreams streams = new KafkaStreams(builder, props); streams.start(); ``` // Application should run infinitely. But for demos sake we close it after a while streams.close(); ```

Answer 20

1) Ingest - Where the goal is to convert data and send it from one system to another 2) Low millisecond actions - Any application that requires immediate actions, fraud detection fall in this category 3) Asynchronous Microservices - These microservices perform simple action on behalf of a larger business process, such as updating the inventory of store 4) Near real time data analytics - These streaming applications perform complex aggregations and joins in order to slice the data and generate interesting business relevant insights.

Kafka Streaming Flashcards

(44 cards)