BigData - Spark Flashcards

1
Q

What is spark

A

“Engine” for distributed
data processing over a cluster:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What does it mean that RDDs are resilient and immutable?

A

Resilient: Meaning able to be recreated from history
Immutable: Can not be modified

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

RDD in Spark

A

Spark Distributes the data in RDD, to different nodes across the cluster to achieve paralleliezation
The data in RDD: is partitioned to different nodes and are done parallel

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

what is a Partition?

A

Is a batch of data operated on in parallel

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Explain - Dependencies between RDDs: Narrow dependencies.

A

Narrow dependencies
Each partition of the parent RDD(s) is used by at most one
partition of the child RDD.
(1) operations with narrow dependencies are “pipelined” locally on one cluster node;
(2) if one partition is lost, recomputation is also local.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Explain - Dependencies between RDDs: Wide dependencies.

A

A parent partition has multiple child partitions depending on it.
(1) operations with wide dependencies require data from all parent partitions to be shuffled on the network (MapReduce-like);
(2) if one partition is lost, you may need to recompute the entire program!

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Spark streaming: Batching

A

Streaming input data is batched (discretized):
At small time interval (every 1 second);
The smaller this interval, the lower the latency;
Each batch is processed with Spark.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

D-Stream and RDD

A

Each D-Steam periodically generates an RDD, either from live data or by transforming the RDD generated by a parent D-Stream
Then groups all records from past sliding windows into the RDD

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Spark Streaming: Guarantees

A

D-Streams (and RDDs) track their lineage (the graph of transformations)
- at the level of partitions within each RDD
- when a node fails, its RDD partitions are rebuilt on other machines from the original input data stored in the cluster.
D-Streams provide consistent, exactly-once processing.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q
A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly