Module 7(a+b) - Hadoop MapReduce Flashcards
<p>What is lambda calculus?</p>
<p>a formal system in mathematical logic for expressing computation based on function abstraction using variable binding and substitution</p>
<p>What does treating functions "anonymously" mean?</p>
<p>Not bounding the function to an identifier, or a name</p>
<p>MapReduce performs \_\_\_\_\_\_\_ computation on \_\_\_\_\_\_\_ volumes of data</p>
parallel
large
<p>In Hadoop MapReduce, are components allowed to share data arbitrarily? why is it like this in terms of scalability?</p>
<p>Components are not allowed to share data arbitrarily.<br></br>
<br></br>
The overhead required to keep data synchronized across components would hurt the system's scalability</p>
<p>Are data elements in MapReduce immutable or mutable?</p>
<p>Data elements in MapReduce are immutable</p>
<p>How does communication occur in MapReduce? (with the assistance of the hadoop system)</p>
<p>By generating new outputs, which are then forwarded by the Hadoop system to the next phase of execution</p>
<p>How many times does a SINGLE MapReduce program transform lists of input data to lists of output data?
<br></br>
<br></br>Explain</p>
<p>Twice. MapReduce uses two different list processing idioms: "map" and "reduce"
<br></br>
<br></br>They're inspired by functional programming paradigms</p>
<p>If MapReduce was a black box, what would be the input and output of this box?</p>
<p><strong>input</strong>: lists of input data elements, as a file (which is loaded using HDFS)<br></br>
<br></br>
<strong>output</strong>: lists of output data elements, as a file (which is generated using HDFS)</p>
<p>The first phase of a MapReduce program is "mapping" how does it work?</p>
<p>A list of data elements are provided (loaded from a file using HDFS), one at a time to a function called the "mapper" which transforms each element individually to one output data element, or sometimes zero or more outputs. It does this by applying a function on each element in the list, and storing the output in a list iself</p>

<p>What is the primitive purpose of the "reducer" in MapReduce? How does the reducing process work?</p>
<p>The "reducer" lets the system aggregate values together.</p>
<p>- Reducer function receives an iterator of input values from an input list</p>
<p>- Combines these values together, returning a single output value</p>
<p>Example: combute the sum of elements in a list</p>

<p>Describe the overall data flow of MapReduce fromthe mapper to the reducer</p>
<p>- Mapper pre-loads local input data</p>
<p>- Mapper loads Immediate data from input arrays</p>
<p>- Values are exchanged and shuffled (in between mapper and reducer)</p>
<p>- Reducing process generates the outputs from the data inherited</p>
<p>- Outputs are stored from reducers</p>
<p>Overall, workflow ismap->shuffle->reduce</p>

<p>What is the interface that is used forinputs to be loaded from the file system in Hadoop</p>
Hadoop Distributed File System (HDFS)
<p>Suppose we have 2 nodes which are both running a MapRedue program which inherits inputs from a file. These 2 programs are running in parallel.</p>
<p>1. Describe the detailed workflow of the program in terms of method calls for each node (around 8 steps)</p>
<p>2. The point at which they would typically interact</p>
- For both Nodes, files are loaded into input using HDFS
- Inputs are split up using the split(), which converts it to an array
- The inherited array is passed into aRecordReader() which breaks up the data into (key, value) pairs
- The (key, value) pairs are passed into the map() function which performs the lambda function on all the inputs(this is the mapper)
- The output of the map function is passed into the partitioner, which shuffles the outputs across the 2 nodes (this is where they interact)
- The output is passed into the sort() method to organize the data
- The sorted output is passed into the reducer to combine the cluster of outputs
- Outputs are written to the filesystem using HDFS

What is an “inputSplit” in Hadoop MapReduce?
What does it corresponds to with respect to the input file?
What does a record in a file correspond to?
- InputSplit is a unit of work which is assigned to one map task. It is simply an element in the list of items that is passed in as input
- Usually corresponds to a chunk of an input file (or a word)
- Each record in a file corresponds to exactly one input split.
The framework takes care of dealing with record boundaries
<p>What is meant by "inputFormat" in MapReduce? What is it a factory for?</p>
The “inputFormat” determines how the input files are parsed, and defines the input splits (how the records are sperated)
It is the factory for RecordReader objects.
Ex: TextInputFormat, SequenceFileInputFormat
<p>What is the RecordReader in MapReduce?</p>
<p>RecordReader loads data from an InputSplit and creates key-value pairsfor the mapper (breaks the data ino key-value pairs).</p>
What is the Partitioner in MapReduce? Where is it placed in the architecture?
- The Partitioner determines which partition that a given key-value pair should go to.
- Partitioner sits in between the mapper and the reducer.
- The default partitioner simply hashes the key emitted by the mapper.
<p>What is the OutputFormat in MapReduce? What is it a factory of?</p>
The OutputFormat determines how the output files are formatted. It is a factory for RecordWriter objects.
Ex: TextOutputFormat, SequenceFileOutputFormat
<p>What is the "RecordWriter" in MapReduce?</p>
Writes records (such as key-value pairs) into output files
<p>How does Hadoop primarily achieve fault tolerance?</p>
- Restarting tasks
- Creating replicas
<p>What is a TaskTracker and a JobTracker?</p>
Tasktracker - individual task nodes
Jobtracker - the head node of the system
How does Hadoop know to restart tasks and maintain synchronization?
The TaskTrackers (individual task nodes) are always in constant communication with the JobTracker (head node of the system)
The JobTracker will know if a task needs to be restarted, and will be able to assign it accordingly
<p>What happens if a TaskTracker fails to communicate with a JobTracker for a period of time (lets say 1 minute)</p>
<p>If there is no communication for 1 minute, then the JobTracker will assume that this TaskTracker has crashed</p>
<p>Suppose that MapReduce is in the mapping phase, and a TaskTracker fails, what happens to the map tasks of the failed task node?</p>
<p>The TaskTrackers which are still running will be asked to re-execute the all the map tasks which where run by the failing TaskTracker</p>
Suppose that MapReduce is in the reducing phase and a TaskTracker fails. What happens to the reduce tasks of the failed reducer?
What is a typical performance problem with the Hadoop system when it comes to dividing tasks across many nodes?
It is possible for a few slow nodes to rate-limit (bottleneck) the rest of the program. These nodes are known as stragglers.
What is the purpose of the "mapper" in MapReduce?
The mapper processes the input data. It transforms each element individually to one output data element, or sometimes zero or more output elements.
Ex: convert each line of input text to uppercase
Sometimes Hadoop executes the same task numerous times in parallel when there are more compute resources than required. Why does this happen? and when?
This ensures that if 1 node fails, there are others who can finish the job. It is known as speculative execution
What does the pseudocode for the primitive mapper (without hashmap) function look like?
mapper(position, line):
for each word in line:
emit (word, 1)
What exactly does the term "emit" mean in the mapper or reducer?
Create an output document associating key with value
What does the pseudocode for the primitive reducer (without hashmap) look like?
reducer (word, values):
sum := 0
for each value in values:
sum := sum + value
emit (word, sum)
The mapper can aggregate the frequency for each word in a document using a hashmap (associative array). Write the pseudocode for the mapper which emits a word with its frequency in the document.
What is the tradeoff with using this approach?

In MapReduce, the combiner is used to aggregate counters across words in the document (similar to a frequency map). Write the Pseudocode for the combiner assuming that it interfaces with the primitive mapper
The combiner and Mapper classes are included

What does "selection" do with regard to the mapper function in MapReduce?
Is a reducer required? How does it work?
What does the pseudocode look like?


The cross-correlation problem has 2 approaches. One of them is faster and one of them is slower. What are they?
Pairs Approach: Slower, and uses less memory
Stripes Approach: Faster and uses more memory
Describe the "Stripes" approach to solving the cross-correlation problem
The faster approach. Uses a frequency hashmap to store the count of an item in the input array - therefore it requiress more memory. Is more complicated than the Pairs approach

Describe the Pairs approach to solve the cross-correlation problem
Simple approach that does not use any memory structure. Is slow.

What is the inputs and outputs of the mapper and reducer steps while executing a cross-correlation problem to the following 2 tuples (as files) using the Pairs Approach:
R1 = "a dog" R2 = "a cat"
What is the inputs and outputs of the mapper and reducer steps while executing a cross-correlation problem to the following 2 tuples (as files) using the Stripes Approach:
R1 = "a dog"
R2 = "a cat"

What is the inputs and outputs of the mapper and reducer steps while executing a cross-correlation problem to the following 2 tuples (as files) using the Stripes Approach:
R1 = "a big dog" R2 = "a small cat"
What is the inputs and outputs of the mapper and reducer steps while executing a cross-correlation problem to the following 2 tuples (as files) using the Pairs Approach:
R1 = "a big dog"
R2 = "a small cat"
