MapReduce (CC, Distributed Processing) Flashcards

(5 cards)

1
Q

What is the Background (MapReduce)?

A
  • MapReduce is a programming model used for generating and
    processing large datasets.
  • It was developed at Google and introduced in February 2003.
  • It can process terabytes of data within seconds, a significant improvement over previous approaches.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the problem (MapReduce)?

A
  • Processing large volumes of data was a significant challenge before MapReduce.
  • Handling terabytes of data required complex programming, and execution could take hours or even days.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What was the Solution to the Problem (MapReduce)?

A
  • MapReduce addresses this by allowing users to specify map and reduce functions.
  • Map handles filtering and sorting,
  • Reduce handles summary operations.
  • The system then automatically parallelizes the computation across large-scale clusters of machines, handling machine failures and scheduling inter-machine communication.
  • The input file is divided into chunks, and map tasks run in parallel on these chunks. Reduce tasks then process the output from map tasks to create the final output. Any machine failure or server downtime is handled automatically.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are the applications (MapReduce)? (ELD)

A
  • MapReduce is effective for tasks of any complexity performed on large data volumes.
  • Initial uses included large-scale indexing for Google web search, extracting information from data, and various machine learning tasks.
  • It’s suitable for tasks that can be divided into map and reduce functions and need to be performed on terabytes of data.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are the strengths and limitations (MapReduce)?

A

Advantages:
- (Abstraction/Transparency) It hides the details of parallelization, fault tolerance, locality optimization, and load balancing.
- Incorporates elements of resilience.
- Incorporates elements of Fault-tolerance.
- MapReduce conserves network bandwidth by processing data locally whenever possible.

Disadvantages:
- Due to its Batch only design (optimised for static large-scale datasets), it can’t perform real-time processing.
- MapReduce can’t handle large data on a single CPU due to primary memory limitations. Using secondary storage increases I/O time.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly