Module 7(a+b) - Hadoop MapReduce Flashcards by Alex Jabbour

What is lambda calculus?

a formal system in mathematical logic for expressing computation based on function abstraction using variable binding and substitution

How well did you know this?

Not at all

Perfectly

What does treating functions "anonymously" mean?

Not bounding the function to an identifier, or a name

How well did you know this?

Not at all

Perfectly

MapReduce performs \_\_\_\_\_\_\_ computation on \_\_\_\_\_\_\_ volumes of data

parallel

large

How well did you know this?

Not at all

Perfectly

In Hadoop MapReduce, are components allowed to share data arbitrarily? why is it like this in terms of scalability?

Components are not allowed to share data arbitrarily. 
 
The overhead required to keep data synchronized across components would hurt the system's scalability

How well did you know this?

Not at all

Perfectly

Are data elements in MapReduce immutable or mutable?

Data elements in MapReduce are immutable

How well did you know this?

Not at all

Perfectly

How does communication occur in MapReduce? (with the assistance of the hadoop system)

By generating new outputs, which are then forwarded by the Hadoop system to the next phase of execution

How well did you know this?

Not at all

Perfectly

How many times does a SINGLE MapReduce program transform lists of input data to lists of output data?
 
 Explain

Twice. MapReduce uses two different list processing idioms: "map" and "reduce"
 
 They're inspired by functional programming paradigms

How well did you know this?

Not at all

Perfectly

If MapReduce was a black box, what would be the input and output of this box?

input: lists of input data elements, as a file (which is loaded using HDFS) 
 
output: lists of output data elements, as a file (which is generated using HDFS)

How well did you know this?

Not at all

Perfectly

The first phase of a MapReduce program is "mapping" how does it work?

A list of data elements are provided (loaded from a file using HDFS), one at a time to a function called the "mapper" which transforms each element individually to one output data element, or sometimes zero or more outputs. It does this by applying a function on each element in the list, and storing the output in a list iself

How well did you know this?

Not at all

Perfectly

What is the primitive purpose of the "reducer" in MapReduce? How does the reducing process work?

The "reducer" lets the system aggregate values together.

- Reducer function receives an iterator of input values from an input list

- Combines these values together, returning a single output value

Example: combute the sum of elements in a list

How well did you know this?

Not at all

Perfectly

Describe the overall data flow of MapReduce fromthe mapper to the reducer

- Mapper pre-loads local input data

- Mapper loads Immediate data from input arrays

- Values are exchanged and shuffled (in between mapper and reducer)

- Reducing process generates the outputs from the data inherited

- Outputs are stored from reducers

Overall, workflow ismap->shuffle->reduce

How well did you know this?

Not at all

Perfectly

What is the interface that is used forinputs to be loaded from the file system in Hadoop

Hadoop Distributed File System (HDFS)

How well did you know this?

Not at all

Perfectly

Suppose we have 2 nodes which are both running a MapRedue program which inherits inputs from a file. These 2 programs are running in parallel.

1. Describe the detailed workflow of the program in terms of method calls for each node (around 8 steps)

2. The point at which they would typically interact

For both Nodes, files are loaded into input using HDFS
Inputs are split up using the split(), which converts it to an array
The inherited array is passed into aRecordReader() which breaks up the data into (key, value) pairs
The (key, value) pairs are passed into the map() function which performs the lambda function on all the inputs(this is the mapper)
The output of the map function is passed into the partitioner, which shuffles the outputs across the 2 nodes (this is where they interact)
The output is passed into the sort() method to organize the data
The sorted output is passed into the reducer to combine the cluster of outputs
Outputs are written to the filesystem using HDFS

How well did you know this?

Not at all

Perfectly

What is an “inputSplit” in Hadoop MapReduce?
What does it corresponds to with respect to the input file?
What does a record in a file correspond to?

InputSplit is a unit of work which is assigned to one map task. It is simply an element in the list of items that is passed in as input
Usually corresponds to a chunk of an input file (or a word)
Each record in a file corresponds to exactly one input split.

The framework takes care of dealing with record boundaries

How well did you know this?

Not at all

Perfectly

What is meant by "inputFormat" in MapReduce? What is it a factory for?

The “inputFormat” determines how the input files are parsed, and defines the input splits (how the records are sperated)

It is the factory for RecordReader objects.
Ex: TextInputFormat, SequenceFileInputFormat

How well did you know this?

Not at all

Perfectly

What is the RecordReader in MapReduce?

RecordReader loads data from an InputSplit and creates key-value pairsfor the mapper (breaks the data ino key-value pairs).

How well did you know this?

Not at all

Perfectly

What is the Partitioner in MapReduce? Where is it placed in the architecture?

The Partitioner determines which partition that a given key-value pair should go to.
Partitioner sits in between the mapper and the reducer.
The default partitioner simply hashes the key emitted by the mapper.

How well did you know this?

Not at all

Perfectly

What is the OutputFormat in MapReduce? What is it a factory of?

The OutputFormat determines how the output files are formatted. It is a factory for RecordWriter objects.
Ex: TextOutputFormat, SequenceFileOutputFormat

How well did you know this?

Not at all

Perfectly

What is the "RecordWriter" in MapReduce?

Writes records (such as key-value pairs) into output files

How well did you know this?

Not at all

Perfectly

How does Hadoop primarily achieve fault tolerance?

Study These Flashcards

Restarting tasks

- Creating replicas

What is a TaskTracker and a JobTracker?

Study These Flashcards

Tasktracker - individual task nodes

Jobtracker - the head node of the system

How does Hadoop know to restart tasks and maintain synchronization?

Study These Flashcards

The TaskTrackers (individual task nodes) are always in constant communication with the JobTracker (head node of the system)

The JobTracker will know if a task needs to be restarted, and will be able to assign it accordingly

What happens if a TaskTracker fails to communicate with a JobTracker for a period of time (lets say 1 minute)

Study These Flashcards

If there is no communication for 1 minute, then the JobTracker will assume that this TaskTracker has crashed

Suppose that MapReduce is in the mapping phase, and a TaskTracker fails, what happens to the map tasks of the failed task node?

Study These Flashcards

The TaskTrackers which are still running will be asked to re-execute the all the map tasks which where run by the failing TaskTracker

Suppose that MapReduce is in the reducing phase and a TaskTracker fails. What happens to the reduce tasks of the failed reducer?

The other TaskTrackers that didn't fail will re-execute the reduce tasks that failed

Where is the input and output of MapReduce typically stored?

In the file system

MapReduce must be able to tolerate faults ______ and have no ______ form restarting tasks. The system will be able to ______ handle internal components failing/restarting

smoothly side-effects autonomously

What is a typical performance problem with the Hadoop system when it comes to dividing tasks across many nodes?

It is possible for a few slow nodes to rate-limit (bottleneck) the rest of the program. These nodes are known as stragglers.

In MapReduce, individual tasks do not know where their inputs came from. (they have no context). Tasks are run in isolation. What is the reason for this?

- If one node fails, it doesn't necessarily take down the system with it - Since there is no context needed to execute a task, the system can assign the same task to numerous nodes in parallel and improve fault tolerance

What is the purpose of the "mapper" in MapReduce?

The mapper processes the input data. It transforms each element individually to one output data element, or sometimes zero or more output elements.

Ex: convert each line of input text to uppercase

Sometimes Hadoop executes the same task numerous times in parallel when there are more compute resources than required. Why does this happen? and when?

- For fault tolerance. - As most of the tasks in a job are coming to a close, the Hadoop system will schedule redundant copies of the remaining tasks across several nodes which have no other work to do.

This ensures that if 1 node fails, there are others who can finish the job. It is known as speculative execution

In MapReduce, does the input need to be in one file for the system to work? What about the output? How many output files are there per reducer?

There is no such requirement for the system. The input can be scattered across one or more files. Same with the output. There is 1 output file per reducer.

What does the pseudocode for the primitive mapper (without hashmap) function look like?

mapper(position, line):

for each word in line:

emit (word, 1)

What exactly does the term "emit" mean in the mapper or reducer?

Create an output document associating key with value

What does the pseudocode for the primitive reducer (without hashmap) look like?

reducer (word, values):
sum := 0
for each value in values:
sum := sum + value
emit (word, sum)

The mapper can aggregate the frequency for each word in a document using a hashmap (associative array). Write the pseudocode for the mapper which emits a word with its frequency in the document.

What is the tradeoff with using this approach?

(pseudocode in the image) | Tradeoff: this uses more memory than the primitive approach (for every word)

What is the purpose of a combiner with respect to the mapper? When is it executed in a typical MapReduce program?

A combiner is used to aggregate counters across all the documents processed by a map task - similar purpose as the mapper with a hashmap (associative array). The combiner is executed after the map stage and before the shuffle phase

In MapReduce, the combiner is used to aggregate counters across words in the document (similar to a frequency map). Write the Pseudocode for the combiner assuming that it interfaces with the primitive mapper

The combiner and Mapper classes are included

What does "selection" do with regard to the mapper function in MapReduce?

Is a reducer required? How does it work?

What does the pseudocode look like?

- Selection returns a subset of the input elements that satisfy some predicate (ex: x < 10). It is basically a filtering scheme - Only a mapper is required (no reducer is needed). - The framework will generate one output per map task.

What does Projection do in MapReduce? What is the Reducer needed for in this context? What does the pseudocode look like?

- Projection returns a subset of the fields of each input element (ex: [x,y,z] to [x,y] ) - The reducer is needed to eliminate duplicates

What is the cross-correlation problem statement? If the input is size N, what would be the output size? Is MapReduce helpful?

- There is a list of items, for each possible pair of items, calculate the number of pairs which these items co-occur. - For an input of size N, the output will be size N^2 - MapReduce allows for this problem to get scaled out

The cross-correlation problem has 2 approaches. One of them is faster and one of them is slower. What are they?

Pairs Approach: Slower, and uses less memory

Stripes Approach: Faster and uses more memory

Describe the "Stripes" approach to solving the cross-correlation problem

The faster approach. Uses a frequency hashmap to store the count of an item in the input array - therefore it requiress more memory. Is more complicated than the Pairs approach

Describe the Pairs approach to solve the cross-correlation problem

Simple approach that does not use any memory structure. Is slow.

What is the inputs and outputs of the mapper and reducer steps while executing a cross-correlation problem to the following 2 tuples (as files) using the Pairs Approach:

R1 = "a dog" R2 = "a cat"

What is the inputs and outputs of the mapper and reducer steps while executing a cross-correlation problem to the following 2 tuples (as files) using the Stripes Approach:

R1 = "a dog"

R2 = "a cat"

What is the inputs and outputs of the mapper and reducer steps while executing a cross-correlation problem to the following 2 tuples (as files) using the Stripes Approach:

R1 = "a big dog" R2 = "a small cat"

What is the inputs and outputs of the mapper and reducer steps while executing a cross-correlation problem to the following 2 tuples (as files) using the Pairs Approach:

R1 = "a big dog"

R2 = "a small cat"

The Hadoop MapReduce framework takes care of ______ tasks, ______ them and ______ the failed tasks

scheduling monitoring re-executes

Module 7(a+b) - Hadoop MapReduce Flashcards

(49 cards)