RDD (CC, Distributed Processing) Flashcards

(5 cards)

1
Q

What is the Background (RDD)?

A

RDDs are a distributed memory abstraction that allows programmers to perform in-memory computations on large clusters in a fault-tolerant manner.

RDDs are read-only, partitioned collections of records.

RDDs can be created through deterministic operations (transformations) on data in stable storage or other RDDs.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the Problem (RDD)?

A

They were motivated by the inefficiency of current computing frameworks in handling iterative algorithms and interactive data mining tools.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What was the Solution to the problem (RDD)? (Data,Persist,Fault Tolerance)

A
  • RDDs enable efficient data reuse in a broad range of applications by providing fault-tolerant, parallel data structures.
  • RDDs allow users to explicitly persist intermediate results in memory and control their partitioning.
  • Fault tolerance is achieved by logging the transformations used to build a dataset (its lineage) rather than the actual data.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are the applications (RDD)?

A

Apache Spark

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are the strengths and Weaknesses (RDD)? (Fault-Tolerance, Asychronous Fine Grained)

A

Advantages:
- RDDs offer efficient fault tolerance by logging transformations and enabling quick recovery of lost data without costly replication.

Limitations:
- RDDs are best suited for batch applications that apply the same operation to all elements of a dataset, and less suitable for applications with asynchronous fine-grained updates to shared state.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly