Spark Architecture Flashcards

1
Q

Cluster Definition & Components (5)

A

Group of computers used to process big data. Multiple spark applications can be running on a cluster at the same time.

  1. Driver
  2. Worker
  3. Executor
  4. Task
  5. Core
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Driver

(definition & 3 responsibilities)

(relationship to cluster)

A

Machine in which application runs. It sits on a node within a cluster.

(1) Maintains information about application
(2) responds to users programs
(3) analyzes, distributes, and schedules work across executors.

In single cluster, there will be only one driver.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Executor

(definition & 2 responsibilities)

A

Holds a partition of data, a collection of rows sitting on a single physical machine.

Responsible for carrying work assigned by driver:

(1) executing code assigned by driver
(2) reports status of a computation back to the driver.

First level of parallelization.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Task

(def. & what creates a task)

(where is it assigned)

A

Created by drivers to process a partition of data. Task are assigned to a slot/core.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Lazy Evaluation

(definition & optimization)

A

Task is not executed immediately, executed once an action is called. Allows for optimization of pipelines, executing multiple computations a once.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Narrow Transformations

(def & 3 examples)

A

One input partition contributes to one output partition.

examples:

.select()

.cast()

.union()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Wide Transformations

(def & 3 ex.)

A

One input parition contributes to multiple output partitions. Wide transformation triggers a shuffle.

examples:

.distinct()

.groupBy()

.join()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Shuffle

A

Data distributed over multiple executors.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Catalyst Optimizer / Query Optimization

(what is does & 7 components)

A

the core for sparks power and speed, automatically finds the most efficient plan for applying transformation and actions

  1. user input: query, dataframe
  2. unresolved logical plan: plan for transformation waiting to be validated against the catalog
  3. analysis: column names validated against catalog (table metadata), unresolved plan becomes the logical plan
  4. logical optimization: first stage of optimization, determining the most efficient sequence of commands
  5. physical plan: each represents the query engines actions after all optimizations have been applied
  6. cost model: each physical plan is evaluated according to its cost model, the model with the best performance is selected to create the selected physical plan
  7. code generation: selected physical plan in complied to Java bytecode and executed.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Caching

A

Place DataFrame into temporary storage across executors in your cluster to make subsequent reads faster

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Spark API

(4 Components)

A
  1. Spark SQL + DataFrames: structured data processing
  2. Streaming: streaming data
  3. MLib: scalable machine learning library
  4. Core API: execution engine, all other functionality built on. Supports, Java, Scala, Python, R, SQL.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Worker

(Definition)

A

Host the executor process. Has a fixed number of executors allocated at any point in time.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Core/Slot

A

Splits the work within an executor, a task is assigned to it. Second level of parallelization.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Transformations

(definition & examples)

(what are the two types of transformations?)

A

How Spark expresses business logic. Instructions for modifying a DataFrame.

Transformation Examples:

.select()

.distinct()

.groupBy()

.sum()

.filter()

.limit()

Two types: Narrow and Wide

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Actions

(definition & 3 types of actions and examples)

A

Statements computed and executed when encountered in the developer’s code. Methods that trigger computation.

Action Types:

  1. view data in console - .show(), .count()
  2. write to output data sources - saveAsTextFile()
  3. collect data to native objects - .collect()
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

UnsafeRow (definition)

A

Tungsten Binary Format, in-memory storage format for Spark SQL and DataFrames. The format of shuffled data.

17
Q

Adaptice Query Execution (AQE)

(definition and 3 dynamic features)

A

re-optimizes and adjust query plans based on runtime stats.

  1. dynamically coalescing shuffle partitions
  2. dynamically switching join strategies
  3. dynamically optimizing skew joins
18
Q

Partition Guidelines (3)

(best practices for when & how to use)

A
  1. err on the side of too many small partitions rather than large partitions
  2. don’t allow partition size to increase to > 200 MB per 8GB of core total memory
  3. calculate shuffle partition size by dividing largest shuffle stage input by the target partition size

4TB total data / 200 MB = 20K shuffle partition count

19
Q

Streaming Use Cases (6)

A
  1. notifications
  2. real-time reporting
  3. incremental ETL
  4. update servers in real time
  5. real-time decision making
  6. online ML
20
Q

advantages of stream processing (2)

A
  1. lower latency: lag time in response
  2. more efficient updating
21
Q

micro batch processing

continuous processing

(definition and how it works)

A

micro batch processing: waits to accumulate smallbatches of input data, then process each batch in parallel. Higher throughput, but higher latency.

continuous streaming: each node continually listens to messages from other nodes and outputs new updates to child nodes. Provides lowest possible latency, but lower maximum throughput.

Generally streaming applications that are large scaled tend to prioritize throughput, so traditionally Spark focused on micor-batch processing. However, Structured Streaming supports bth.

22
Q

Predicate Pushdown

(component of query optimization)

A

Filter statements are known as predicates. A query performance can be imrpoved by reducing the amount of data using a filter.

If you “push down” parts of the query to were the data is stored, greatly reduce network traffic (cost implications in cloud) and increase data reading reading speed.