RDD (CC, Distributed Processing) Flashcards

Question 1

Q

What is the Background (RDD)?

Answer

A

RDDs are a distributed memory abstraction that allows programmers to perform in-memory computations on large clusters in a fault-tolerant manner.

RDDs are read-only, partitioned collections of records.

RDDs can be created through deterministic operations (transformations) on data in stable storage or other RDDs.

Question 2

Q

What is the Problem (RDD)?

Answer

A

They were motivated by the inefficiency of current computing frameworks in handling iterative algorithms and interactive data mining tools.

Question 3

Q

What was the Solution to the problem (RDD)? (Data,Persist,Fault Tolerance)

Answer

A

RDDs enable efficient data reuse in a broad range of applications by providing fault-tolerant, parallel data structures.
RDDs allow users to explicitly persist intermediate results in memory and control their partitioning.
Fault tolerance is achieved by logging the transformations used to build a dataset (its lineage) rather than the actual data.

Question 4

Q

What are the applications (RDD)?

Answer

A

Apache Spark

Question 5

Q

What are the strengths and Weaknesses (RDD)? (Fault-Tolerance, Asychronous Fine Grained)

Answer

A

Advantages:
- RDDs offer efficient fault tolerance by logging transformations and enabling quick recovery of lost data without costly replication.

Limitations:
- RDDs are best suited for batch applications that apply the same operation to all elements of a dataset, and less suitable for applications with asynchronous fine-grained updates to shared state.

RDD (CC, Distributed Processing) Flashcards

(5 cards)