RDD (CC, Distributed Processing) Flashcards
(5 cards)
What is the Background (RDD)?
RDDs are a distributed memory abstraction that allows programmers to perform in-memory computations on large clusters in a fault-tolerant manner.
RDDs are read-only, partitioned collections of records.
RDDs can be created through deterministic operations (transformations) on data in stable storage or other RDDs.
What is the Problem (RDD)?
They were motivated by the inefficiency of current computing frameworks in handling iterative algorithms and interactive data mining tools.
What was the Solution to the problem (RDD)? (Data,Persist,Fault Tolerance)
- RDDs enable efficient data reuse in a broad range of applications by providing fault-tolerant, parallel data structures.
- RDDs allow users to explicitly persist intermediate results in memory and control their partitioning.
- Fault tolerance is achieved by logging the transformations used to build a dataset (its lineage) rather than the actual data.
What are the applications (RDD)?
Apache Spark
What are the strengths and Weaknesses (RDD)? (Fault-Tolerance, Asychronous Fine Grained)
Advantages:
- RDDs offer efficient fault tolerance by logging transformations and enabling quick recovery of lost data without costly replication.
Limitations:
- RDDs are best suited for batch applications that apply the same operation to all elements of a dataset, and less suitable for applications with asynchronous fine-grained updates to shared state.