MapReduce - Week 7 Flashcards
Batch Processing
jobs that can run without end user interaction, or can be scheduled to run as resources permit
Examples of batch processing for data sets that build over time
Web crawling
Transaction logs, for analysing trends
Equipment logs, for predicting faults
Huge data sets that may need to be processed on parallel architectures
Who originally developed map reduce
What two functions make up map reduce?
map and reduce
MapReduce - map function definition
map(key1, value1) -> [(key2, value2)]
Given a key and a value, generates a collection of key value pairs
MapReduce - reduce function definition
reduce(key2, [value2]) -> [(key3,value3)]
given a key key2 output by map, and a collection of all the values value2 associated with that key, return a new collection of key-value pairs
Word count with map reduce - what do the two functions do?
Map takes a document, and returns a set of word counts for that document.
e.g.
“the map operation given…” -> {“the”: 1, “map”:1, …}
Reduce takes outputs from map and collates them into one thing
{“the”: [1,1], “map”:[1,1} -> {“the”: 2, “map”: 2}
MapReduce provider, extensions and competitors
Hadoop, …
Extensions: Cloudera
Competitors: Apache Spark
AWS EC2
Purchase of virtual machines of different capabilities, with different operating systems and for different periods
IaaS
AWS S3
Purchase of storage that is accessed through a simple file system style interface
IaaS
EMR (Elastic Map Reduce)
The ability to run scalable applications written using the map reduce programming model over EC2 and S3 infrastructure
PaaS
How is S3 used for AWS MapReduce?
The input to the map/reduce problem
The Jar that contains the program
The output from the execution of the program
Logging information
Use Map reduce or RDB for single batch tasks?
MapReduce, perhaps the effort of loading the data into a relational database isn’t worth it
Use Map Reduce or RDB for data using online transactional processing and analytical tasks?
Map reduce won’t help with the OLTP tasks.
A relational database is more flexible and may be able to handle both, though often different systems are used for OLTP and analytics to avoid contention for resources.
Use Map Reduce or RDB for data that needs fine-grained access control
MapReduce itself doesn’t provide much in the way of security - the hosting environment does that.
Certain relational databases will provide fine-grained access control
MapReduce - what is a job?
The unit of work to be performed (the data and the program)
Can consist of several map and reduce tasks
MapReduce - what is a split?
A part of the input (e.g. a 64mb filesystem block)
MapReduce - what is a task?
Map or reduce functions created and run for each split/partition
MapReduce - what is a task tracker?
Tracks the progress of each of the map or reduce tasks on a node
Keeps the job tracker informed of progress
MapReduce - what is a job tracker?
Coordinates the different tasks comprising a job
MapReduce - How many map tasks are created?
One for each split (each part of the input, e.g. a 64mb filesystem block)
MapReduce - What are the steps of the map function?
- The map task runs on the split, creating key-value pairs
- The output of the map is partitioned into groups, for sending to reduce functions, typically by hashing.
- The partitions are sorted by key
- The outputs are written to the local file system
- The task manager notifies the job manager that the task is completed
MapReduce - What are the steps of the reduce function?
- Relevant map partitions are copied to the associated reduce nodes
- Data from different maps is merged to produce the inputs for individual reduce operations
- The reduce task is run
- Outputs are written to the distributed filesystem
What are the MapReduce performance issues to look out for?
Memory usage - The space required by the code within map and reduce
Skew - The likelihood that data is not distributed evenly across reduce nodes
Intermediate result size - The amount of data that is produced by map or reduce compared to the size of their inputs