Kafka Streaming Flashcards

1
Q

Spark vs Hadoop

A

Hadoop is good for doing mapR on data in batches. In hadoop batch processing is done on data that have been stored over a period of time. So there is lag in processing.

While Spark in real time processing, the data is being processed on as it is being received. So there is no lag in Spark. Spark is also able to do batch processing 100 times faster than Hadoop MapReduce. Spark was built on top of MapReduce and it extends MapReduce model to efficiently use more types of computations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Some of Spark features

A

1) Polyglot programming - many language implementations available
2) Speed
3) Powerful caching
4) Multiple Formats - Spark provides multiple data sources such as Parquet, JSON, Hive, Cassandra apart from CSV, RDBMS etc. Data source API provides pluggable mechanism to access structured data through Spark SQL.
5) Lazy Evaluation - Spark is lazy and does computation only when it is needed. This is the reason for speed. For transformations, Spark adds them to DAG of computation and only when driver requests some data, does the DAG actually gets evaluated.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are Spark Data sources?

A

The data source API provides a pluggable mechanism for accessing structured data through Spark SQL. Datasource API is used to read and store structured and semi structured data into Spark SQL. Datasources can be more than just simple pipes that convert data and pull it into Spark.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is RDD (Resilient Distributed Dataset)?

A

RDD is a fundamental data structure of Spark. It is an immutable distributed collection of objects. Each dataset in RDD is divided into logical partitions which may be computed on different nodes of cluster.
It has certain characteristics:
1) Logical abstraction for distributed dataset in entire cluster, those may be divided in partitions
2) Resilient and Immutable - It is possible to recreate RDD in any point in time in execution cycle. Each stage of DAG will be RDD.
3) Compile time safety
4) Unstructured/Structured data
5) Lazy - Action starts the execution just like Java streams

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are DataFrames?

A

A DataFrame is a Dataset organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood. DataFrames can be constructed from a wide array of sources such as: structured data files, tables in Hive, external databases or existing RDDs.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Apache Spark Components

A
Spark Core
Spark Streaming
Spark SQL
GraphX
MLlib (Machine Learning)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is Spark Core?

A

Spark Core is the base engine for large-scale parallel and distributed data processing. The core is the distributed execution engine and the Java, Scala, and Python APIs offer a platform for distributed ETL application development. Further, additional libraries which are built atop the core allow diverse workloads for streaming, SQL, and machine learning. It is responsible for:

1) Memory management and fault recovery
2) Scheduling, distributing and monitoring jobs on a cluster
3) Interacting with storage systems

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is Spark Streaming?

A

Spark Streaming is the component of Spark which is used to process real-time streaming data. Thus, it is a useful addition to the core Spark API. It enables high-throughput and fault-tolerant stream processing of live data streams. The fundamental stream unit is DStream which is basically a series of RDDs (Resilient Distributed Datasets) to process the real-time data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is Spark SQL?

A

Spark SQL is a new module in Spark which integrates relational processing with Spark’s functional programming API. It supports querying data either via SQL or via the Hive Query Language. For those of you familiar with RDBMS, Spark SQL will be an easy transition from your earlier tools where you can extend the boundaries of traditional relational data processing.

Spark SQL integrates relational processing with Spark’s functional programming. Further, it provides support for various data sources and makes it possible to weave SQL queries with code transformations thus resulting in a very powerful tool.

The following are the four libraries of Spark SQL.

Data Source API
DataFrame API
Interpreter & Optimizer
SQL Service

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Stream vs Table

A

Stream is an unbounded collection of facts, facts are these immutable things that keep coming at you.

A table is a collection of evolving facts. Kinda represents the latest fact about something.

Lets say I am using an music streaming service and I created profile and set my current location to X. It will be event in profile update event stream. And then after sometime I move and stream a song from location Y and change that as current location. Then it will also be event in profile update event stream. But in table, as it is collection of evolving facts, new profile update event kind overrides the previous. And so in table I will have the latest location, i.e. Location Y.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Which is the primary unit of working with data in Spark?

A

RDD (Resilient Distributed Data) - It is immutable. Any transformation you apply results in RDD in memory and stays immutable and then gets used in next stage of pipeline. This in memory design is what makes Spark faster than Hadoop MapReduce.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Types of nodes in Spark

A

1) Driver - Driver is like brain or orchestrator, where the query plan is generated
2) Executor - there can be 100s or 1000s of executors / workers
3) Shuffle service

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

In what ways to Spark nodes communicate?

A

1) Broadcast - When you have information that is small enough like 1GB or less. Broadcast is information that you have on driver and you want to put that information on all nodes.
2) Take - The information that the executors have and you want to bring that information to your driver to print it out or write it to file. Problem is that there are many many executors and driver is one. Driver can go out of memory or can take too much time for driver to process that data.
3) DAG Action (Directed Acyclic Graph) - DAG are like instructions like what should be done and in what order. DAGs are passed from driver to executor and are small so this is very inexpensive operation.
4) Shuffle - is like if you want to do join, or group by, or reduce. In like RDBMS if there are two large tabes and you want to do join, then you need to partition the table and you need to order that data, so that the two sides can find the match. So in distributed system shuffle, there are mappers which write to multiple shuffle temp locations. These temp locations are broken into partitions. And then you have reduce side where each is reading their partition from their mapper. So there is lot of communication that happens (mapper * reducer). So it is most expensive.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What makes Spark, Spark?

A

1) RDDs
2) DAGs
3) FlumeJava APIs
4) Long Lived Jobs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is DAG?

A

DAG represents that steps that need to happen to get from point A to point B. Like you need to join two tables, and group by at the end and then do some aggregation. That can be multi step DAG. The main thing Spark provides is that DAG lives in the engine.
DAG contains:
1) Source
2) RDDs - which are formed after each transformation
3) Transformations - the edges of DAG
4) Actions (end nodes) - like foreach, print, count, take

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Examples of Actions

A

Actions are the ones that start processing as Spark is lazy

1) foreach
2) reduce
3) take
4) count

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Examples of transformation

A

1) Map
2) Reduce By Key
3) Group By Key
4) Join By Key

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Why is DAG important?

Hint: Think failures

A

Suppose while computing a particular stage there is an error, then Spark will say, ok there was error in this stage, let me grab the result of the previous stage and let me recompute where I failed and then continue with the next stages.

19
Q

What is Flume Java API?

A

The idea of the API was to write distributed code like we write non-distributed code. Regular code that will run in local machine and the same code if put on cluster will run in distributed mode.

20
Q

What are some issues with distributed and parallel jobs?

Hint: Work division

A

1) Too much or Too less parallelism - Done too much then the cost (initial and post) of dividing adds up and exceeds the execution cost. If done too less we are not utilizing proper resources
2) Work skew - One process or thread or core or machine is doing much more work and others are idle. So the next step gets delayed due to that.
3) Cartesian Join - Is when you join two tables with many to many relations. Lets say for Key A you have 1000 items in table A and you have 1000 items with Key A in table B. Then if you do join A X B, you end up with 1000 * 1000 = 1M records.

21
Q

Which is quick way to fix work skew?

A

Quick way to fix skew is one magic function (Hash - mod). # -> %
So you add Salt to improve distribution.
Salt is nothing but adding little bit dirt to improve distribution.

22
Q

How to fix Cartesian Joins?

A

1) Nested structures - Like if you have Bob in table A and purchases in table B, then if you join both tables you are left with three rows with Bob and cats. Rather than that you can have nested structure of cat and then you have single row Bob and cat field has multiple entries. This became popular with Hive and Hadoop when joins became expensive
2) Windowing
3) Reduce By Key

23
Q

Why use RDDs?

A

1) Control and flexibility - low level so offers more control
2) Low level API
3) Compile time safety
4) Encourages “how to” rather than “what to”.

24
Q

What is problems with RDD?

Hint: performance optimization

A

Because we are telling Spark exactly how to do and not what to do, Spark cannot optimize for us. The lambda functions that we provide to compute are hidden/opaque to Spark. it cannot see inside the lambda functions.
The solution is to use DataSets and DataFrames.

25
Q

Structured APIs in Spark and Their error analysis matrix

A

SQL | DataFrames | DataSets

Syntax Errors Runtime | Compile Time | Compile Time

Analysis Errors Runtime | Runtime | Compile Time

26
Q

After 2016, Spark DataFrames + Datasets = ?

A

Datasets.

Features of both were combined into one.

27
Q

Why use Structured APIs?

A

Firstly they are declarative, so reads better and then we are telling Spark what to do and letting Spark figure out optimizations for that particular pipeline.

28
Q

In how many modes can Spark run?

A

Spark can be run in three modes

1) Standalone mode - It runs on top of Hadoop. For computations Spark and Hadoop run in parallel for the Spark jobs submitted to the cluster.
2) Hadoop YARN/Mesos - Apache Spark runs on Mesos or YARN (Yet Another Resource Navigator, one of the key features of second generation Hadoop) without any root access or pre installation. It integrates Spark on top of already present Hadoop stack.
3) SIMR (Spark in Map Reduce) - This is an add-on to the standalone deployment where Spark jobs can be launched by the user and they can use the Spark shell without any administrator access.

29
Q

Which are different programming paradigms

A

1) Request and Response
2) Batch processing
3) Stream processing

30
Q

Which are some key concepts unique to stream processing?

A

1) Time
2) State
3) Stream-Table duality
4) Time Windows

31
Q

How many different types of times are possible in stream processing?
Hint: Processing time etc

A

Notion of time helps in finding choke or bottleneck point.

1) Event time - When did the event occur
2) Log append time - This is the time that the event was received by Kafka and was written to log file
3) Processing time - This is the time at which stream-processing application received the event in order to perform some calculation.

32
Q

Concept of state in stream processing?

A

Stream processing becomes interesting when we have operations that involve multiple events like counting the number of events by type, sum, average, joining two streams to create an enriched stream etc.
To be able to do that we need to preserve some information in form of “state”.

33
Q

Where should the state of stream processing application we stored?

A

Usually a bad practice would be to store the state in a local in memory hash map, but if the stream processing application crashes the results will change. So we should reliably store in some DB.

34
Q

What is the meaning of Stream-Table duality?

A

Stream-Table duality means that the stream can be viewed as table and vice versa.

A stream can be considered a changelog of a table, where each record in the stream captures a state change in table
A stream is thus a table in disguise, and it can easily turned into table by replaying the changelog from beginning to end.

A table is a snapshot, of the latest values for each key in a stream
A table is thus a stream in disguise, it can easily be turned into stream by iterating all the records

35
Q

What are time windows in stream processing?

A

The stream processing engine needs to divide data into data records of time buckets. i.e to window the stream by time. We may need this for joining and aggregration operations.

1) Hopping time windows
2) Tumbling time windows

36
Q

Which windowing types are supported in Kafka Streaming?

A

1) Hopping time windows
Windows based on time intervals
They model fixed size (possibly overlapping windows)
Defined by two properties, the window’s size and advance interval or hop.
Advance interval defines how much the window moves relative to the previous one.
Hopping windows can overlap and a data record may belong to more than one windows
2) Tumbling time windows
A special case of hopping time windows
They model fixed sized, non-overlapping, gap less windows.
Defined by single property window size
A tumbling window is a hopping window whose window size is equal to advance interval
Tumbling windows never overlap one record will only be present in one window.

37
Q

What pipeline design can be maintain to find top 10 trades or words

A

Step 1 - Map and partition the data so that in can be processed parallely on different processors
Step 2 - Process the data on the processor side to maintian local state
Step 3 - Write the results to a new topic with single partition so that it is consumed by only one consumer
Step 4 - That single topic partition will then be processed by single cosumer that calculates top 10.
Step 5 - Calculate top 10 for some window and then write it to a Top 10 topic.

38
Q

When do we need stream-table join?

A

Sometimes stream processing will require joining with exeternal database for various purposes such as validating the event based on some business rules stored in database or enrichment of event with data of the users who clicked it.

39
Q

How to join Stream-Stream?

A

When we join two streams we are joining the entire history, trying to match events from one stream to events in other stream that have same keys and happened in same time windows. This is why streaming join is also called “Windowed join”

For example if we have two streams Search Stream and Click Stream, then we are assuming that search button is clicked seconds after query was entered into our search engine. So we keep few seconds long window on each stream and match the results from each window.

40
Q

How can we handle out of sequence events and when does it occur?

A

Out of sequence events can occur in IoT type scenarios like device lost WiFi connection for some time and sends few hours worth of events when it reconnects. Also happened in PBSS CDR systems.

The application has to recognize that the event is out of sequence.
Define a time period which it will attempt to reconcile out of sequence events
Have an in-band capability to reconcile this event.
Be able to update results.

41
Q

What is Kafka Streams scaling model?

A

It allows scaling by allowing multiple threads within one instance of application and by supporting load balancing between distributed instances of the application.

42
Q

How does Kafka Stream split topology for parallel execution?

A

It splits the topology in tasks. One partition is processed as one task.

43
Q

Which class is used to create Kafka Streams topology?Sample API code for Kafka Stream topology?

A

StreamBuilder class is used to create topology.

1) Create properties
Properties props = new Properties();
props.put(StreamsConfig.APPLICATION_ID, “wordCount”);
props.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, “localhost:9092”);
props.put(StreamsConfig.KEY_SERDE_CLASS_CONFIG, Serdes.String().getClass().getName());
props.put(StreamsConfig.VALUE_SERDE_CLASS_CONFIG, Serdes.String().getClass().getName());

SERDE is Serializer Deserializer

2) Create topology using StreamBuilder and Stream
KStreamBuilder builder = new KStreamBuilder();
KStream stream = builder.stream("word-input");

KStream counts = stream.flatMapValues(value -> pattern.split(value.toLowercase())).map((key, value) -> new KeyValue(value, value)).filter((key, value) -> (!value.equals(“the”))).groupByKey().count(“CountStore”).mapValues(value -> Long.toString(value)).toStream();

counts.to(“wordcount-output”);

3) Start the stream processing engine
KafkaStreams streams = new KafkaStreams(builder, props);
streams.start();

// Application should run infinitely. But for demos sake we close it after a while
streams.close();
44
Q

Which types of stream processing applications can be there in terms of choosing the streaming framework?

A

1) Ingest - Where the goal is to convert data and send it from one system to another
2) Low millisecond actions - Any application that requires immediate actions, fraud detection fall in this category
3) Asynchronous Microservices - These microservices perform simple action on behalf of a larger business process, such as updating the inventory of store
4) Near real time data analytics - These streaming applications perform complex aggregations and joins in order to slice the data and generate interesting business relevant insights.