Kafka Streaming Flashcards
Spark vs Hadoop
Hadoop is good for doing mapR on data in batches. In hadoop batch processing is done on data that have been stored over a period of time. So there is lag in processing.
While Spark in real time processing, the data is being processed on as it is being received. So there is no lag in Spark. Spark is also able to do batch processing 100 times faster than Hadoop MapReduce. Spark was built on top of MapReduce and it extends MapReduce model to efficiently use more types of computations.
Some of Spark features
1) Polyglot programming - many language implementations available
2) Speed
3) Powerful caching
4) Multiple Formats - Spark provides multiple data sources such as Parquet, JSON, Hive, Cassandra apart from CSV, RDBMS etc. Data source API provides pluggable mechanism to access structured data through Spark SQL.
5) Lazy Evaluation - Spark is lazy and does computation only when it is needed. This is the reason for speed. For transformations, Spark adds them to DAG of computation and only when driver requests some data, does the DAG actually gets evaluated.
What are Spark Data sources?
The data source API provides a pluggable mechanism for accessing structured data through Spark SQL. Datasource API is used to read and store structured and semi structured data into Spark SQL. Datasources can be more than just simple pipes that convert data and pull it into Spark.
What is RDD (Resilient Distributed Dataset)?
RDD is a fundamental data structure of Spark. It is an immutable distributed collection of objects. Each dataset in RDD is divided into logical partitions which may be computed on different nodes of cluster.
It has certain characteristics:
1) Logical abstraction for distributed dataset in entire cluster, those may be divided in partitions
2) Resilient and Immutable - It is possible to recreate RDD in any point in time in execution cycle. Each stage of DAG will be RDD.
3) Compile time safety
4) Unstructured/Structured data
5) Lazy - Action starts the execution just like Java streams
What are DataFrames?
A DataFrame is a Dataset organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood. DataFrames can be constructed from a wide array of sources such as: structured data files, tables in Hive, external databases or existing RDDs.
Apache Spark Components
Spark Core Spark Streaming Spark SQL GraphX MLlib (Machine Learning)
What is Spark Core?
Spark Core is the base engine for large-scale parallel and distributed data processing. The core is the distributed execution engine and the Java, Scala, and Python APIs offer a platform for distributed ETL application development. Further, additional libraries which are built atop the core allow diverse workloads for streaming, SQL, and machine learning. It is responsible for:
1) Memory management and fault recovery
2) Scheduling, distributing and monitoring jobs on a cluster
3) Interacting with storage systems
What is Spark Streaming?
Spark Streaming is the component of Spark which is used to process real-time streaming data. Thus, it is a useful addition to the core Spark API. It enables high-throughput and fault-tolerant stream processing of live data streams. The fundamental stream unit is DStream which is basically a series of RDDs (Resilient Distributed Datasets) to process the real-time data.
What is Spark SQL?
Spark SQL is a new module in Spark which integrates relational processing with Spark’s functional programming API. It supports querying data either via SQL or via the Hive Query Language. For those of you familiar with RDBMS, Spark SQL will be an easy transition from your earlier tools where you can extend the boundaries of traditional relational data processing.
Spark SQL integrates relational processing with Spark’s functional programming. Further, it provides support for various data sources and makes it possible to weave SQL queries with code transformations thus resulting in a very powerful tool.
The following are the four libraries of Spark SQL.
Data Source API
DataFrame API
Interpreter & Optimizer
SQL Service
Stream vs Table
Stream is an unbounded collection of facts, facts are these immutable things that keep coming at you.
A table is a collection of evolving facts. Kinda represents the latest fact about something.
Lets say I am using an music streaming service and I created profile and set my current location to X. It will be event in profile update event stream. And then after sometime I move and stream a song from location Y and change that as current location. Then it will also be event in profile update event stream. But in table, as it is collection of evolving facts, new profile update event kind overrides the previous. And so in table I will have the latest location, i.e. Location Y.
Which is the primary unit of working with data in Spark?
RDD (Resilient Distributed Data) - It is immutable. Any transformation you apply results in RDD in memory and stays immutable and then gets used in next stage of pipeline. This in memory design is what makes Spark faster than Hadoop MapReduce.
Types of nodes in Spark
1) Driver - Driver is like brain or orchestrator, where the query plan is generated
2) Executor - there can be 100s or 1000s of executors / workers
3) Shuffle service
In what ways to Spark nodes communicate?
1) Broadcast - When you have information that is small enough like 1GB or less. Broadcast is information that you have on driver and you want to put that information on all nodes.
2) Take - The information that the executors have and you want to bring that information to your driver to print it out or write it to file. Problem is that there are many many executors and driver is one. Driver can go out of memory or can take too much time for driver to process that data.
3) DAG Action (Directed Acyclic Graph) - DAG are like instructions like what should be done and in what order. DAGs are passed from driver to executor and are small so this is very inexpensive operation.
4) Shuffle - is like if you want to do join, or group by, or reduce. In like RDBMS if there are two large tabes and you want to do join, then you need to partition the table and you need to order that data, so that the two sides can find the match. So in distributed system shuffle, there are mappers which write to multiple shuffle temp locations. These temp locations are broken into partitions. And then you have reduce side where each is reading their partition from their mapper. So there is lot of communication that happens (mapper * reducer). So it is most expensive.
What makes Spark, Spark?
1) RDDs
2) DAGs
3) FlumeJava APIs
4) Long Lived Jobs
What is DAG?
DAG represents that steps that need to happen to get from point A to point B. Like you need to join two tables, and group by at the end and then do some aggregation. That can be multi step DAG. The main thing Spark provides is that DAG lives in the engine.
DAG contains:
1) Source
2) RDDs - which are formed after each transformation
3) Transformations - the edges of DAG
4) Actions (end nodes) - like foreach, print, count, take
Examples of Actions
Actions are the ones that start processing as Spark is lazy
1) foreach
2) reduce
3) take
4) count
Examples of transformation
1) Map
2) Reduce By Key
3) Group By Key
4) Join By Key
Why is DAG important?
Hint: Think failures
Suppose while computing a particular stage there is an error, then Spark will say, ok there was error in this stage, let me grab the result of the previous stage and let me recompute where I failed and then continue with the next stages.
What is Flume Java API?
The idea of the API was to write distributed code like we write non-distributed code. Regular code that will run in local machine and the same code if put on cluster will run in distributed mode.
What are some issues with distributed and parallel jobs?
Hint: Work division
1) Too much or Too less parallelism - Done too much then the cost (initial and post) of dividing adds up and exceeds the execution cost. If done too less we are not utilizing proper resources
2) Work skew - One process or thread or core or machine is doing much more work and others are idle. So the next step gets delayed due to that.
3) Cartesian Join - Is when you join two tables with many to many relations. Lets say for Key A you have 1000 items in table A and you have 1000 items with Key A in table B. Then if you do join A X B, you end up with 1000 * 1000 = 1M records.
Which is quick way to fix work skew?
Quick way to fix skew is one magic function (Hash - mod). # -> %
So you add Salt to improve distribution.
Salt is nothing but adding little bit dirt to improve distribution.
How to fix Cartesian Joins?
1) Nested structures - Like if you have Bob in table A and purchases in table B, then if you join both tables you are left with three rows with Bob and cats. Rather than that you can have nested structure of cat and then you have single row Bob and cat field has multiple entries. This became popular with Hive and Hadoop when joins became expensive
2) Windowing
3) Reduce By Key
Why use RDDs?
1) Control and flexibility - low level so offers more control
2) Low level API
3) Compile time safety
4) Encourages “how to” rather than “what to”.
What is problems with RDD?
Hint: performance optimization
Because we are telling Spark exactly how to do and not what to do, Spark cannot optimize for us. The lambda functions that we provide to compute are hidden/opaque to Spark. it cannot see inside the lambda functions.
The solution is to use DataSets and DataFrames.