MapReduce - Week 7 Flashcards by Ollie Ursell

Batch Processing

jobs that can run without end user interaction, or can be scheduled to run as resources permit

How well did you know this?

Not at all

Perfectly

Examples of batch processing for data sets that build over time

Web crawling
Transaction logs, for analysing trends
Equipment logs, for predicting faults

Huge data sets that may need to be processed on parallel architectures

How well did you know this?

Not at all

Perfectly

Who originally developed map reduce

Google

How well did you know this?

Not at all

Perfectly

What two functions make up map reduce?

map and reduce

How well did you know this?

Not at all

Perfectly

MapReduce - map function definition

map(key1, value1) -> [(key2, value2)]

Given a key and a value, generates a collection of key value pairs

How well did you know this?

Not at all

Perfectly

MapReduce - reduce function definition

reduce(key2, [value2]) -> [(key3,value3)]

given a key key2 output by map, and a collection of all the values value2 associated with that key, return a new collection of key-value pairs

How well did you know this?

Not at all

Perfectly

Word count with map reduce - what do the two functions do?

Map takes a document, and returns a set of word counts for that document.
e.g.
“the map operation given…” -> {“the”: 1, “map”:1, …}

Reduce takes outputs from map and collates them into one thing
{“the”: [1,1], “map”:[1,1} -> {“the”: 2, “map”: 2}

How well did you know this?

Not at all

Perfectly

MapReduce provider, extensions and competitors

Hadoop, …

Extensions: Cloudera

Competitors: Apache Spark

How well did you know this?

Not at all

Perfectly

AWS EC2

Purchase of virtual machines of different capabilities, with different operating systems and for different periods

IaaS

How well did you know this?

Not at all

Perfectly

AWS S3

Purchase of storage that is accessed through a simple file system style interface

IaaS

How well did you know this?

Not at all

Perfectly

EMR (Elastic Map Reduce)

The ability to run scalable applications written using the map reduce programming model over EC2 and S3 infrastructure

PaaS

How well did you know this?

Not at all

Perfectly

How is S3 used for AWS MapReduce?

The input to the map/reduce problem

The Jar that contains the program

The output from the execution of the program

Logging information

How well did you know this?

Not at all

Perfectly

Use Map reduce or RDB for single batch tasks?

MapReduce, perhaps the effort of loading the data into a relational database isn’t worth it

How well did you know this?

Not at all

Perfectly

Use Map Reduce or RDB for data using online transactional processing and analytical tasks?

Map reduce won’t help with the OLTP tasks.

A relational database is more flexible and may be able to handle both, though often different systems are used for OLTP and analytics to avoid contention for resources.

How well did you know this?

Not at all

Perfectly

Use Map Reduce or RDB for data that needs fine-grained access control

MapReduce itself doesn’t provide much in the way of security - the hosting environment does that.

Certain relational databases will provide fine-grained access control

How well did you know this?

Not at all

Perfectly

MapReduce - what is a job?

Study These Flashcards

The unit of work to be performed (the data and the program)

Can consist of several map and reduce tasks

MapReduce - what is a split?

Study These Flashcards

A part of the input (e.g. a 64mb filesystem block)

MapReduce - what is a task?

Study These Flashcards

Map or reduce functions created and run for each split/partition

MapReduce - what is a task tracker?

Study These Flashcards

Tracks the progress of each of the map or reduce tasks on a node

Keeps the job tracker informed of progress

MapReduce - what is a job tracker?

Study These Flashcards

Coordinates the different tasks comprising a job

MapReduce - How many map tasks are created?

Study These Flashcards

One for each split (each part of the input, e.g. a 64mb filesystem block)

MapReduce - What are the steps of the map function?

Study These Flashcards

The map task runs on the split, creating key-value pairs
The output of the map is partitioned into groups, for sending to reduce functions, typically by hashing.
The partitions are sorted by key
The outputs are written to the local file system
The task manager notifies the job manager that the task is completed

MapReduce - What are the steps of the reduce function?

Study These Flashcards

Relevant map partitions are copied to the associated reduce nodes
Data from different maps is merged to produce the inputs for individual reduce operations
The reduce task is run
Outputs are written to the distributed filesystem

What are the MapReduce performance issues to look out for?

Study These Flashcards

Memory usage - The space required by the code within map and reduce

Skew - The likelihood that data is not distributed evenly across reduce nodes

Intermediate result size - The amount of data that is produced by map or reduce compared to the size of their inputs

Write the Basket analysis average problem as MapReduce pseudocode

A list of the prices of the baskets a user has purchased, e.g. 001 - £26 002 - £30 001 - £40 002 - £35 A simple analysis involves getting the average a user has spent Map receives a key (identifying the location of an input split) and a value (a section of memory, a list of the users and values) It emits those user and value pairs The reduce function gets a customer id and a list of basket values for that customer from the map function, it emits the average of these values with the key of the customer id it received

MapReduce - summarisation patterns

Summarization aims to send as little information to reducers as possible, e.g. "hello hello hello" -> {"hello", 3} instead of "hello hello hello" -> {"hello" 1, "hello" 1, "hello" 1} To do this numerically the operation must be both: associative - (axb)xc = ax(bxc) commutative - axb = bxa

MapReduce - combiners

An optional reducer that is local to a map, and that summarises its results in some way Must have inputs and outputs that are semantically compatible with the output of a map In word count, the normal reducer can be used as a combiner, in this case it has the same effect as summarisation

MapReduce - inverted index pattern

supports the construction of an index from the contents of a document to the document Map returns emits a key for each token, and a value of the document identifier Reduce is application specific, might compute tf-idf or write to an index structure

MapReduce - Filtering pattern

Discard some of the information in the input Sampling - producing representative examples Top-k - choosing the best examples Distinct - removing duplicates For word count - may only be interested in counting occurrences of dictionary words

MapReduce - Join Pattern

Implements a relational join In Hadoop a job can be associated with two mappers that read from different inputs; then the infrastructure does most of the join Need to label tuples with their table table, so we can distinguish them in the reduce. For full explanation see "Join Pattern" slide in the "Map Reduce Programming - Week 7" notes

MapReduce - Week 7 Flashcards

(30 cards)