Week 6 - Data Management in MapReduce Systems Flashcards

1
Q

What is Map Reduce (3 items)

A

1) distributed computing
2) Map
3) Reduce

(key/value pairs).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Map Reduce - Map

A

Map takes a set of data and converts it into another set of data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Map Reduce - Reduce

A

Reduce takes the output from a map as an input and combines those data tuples into a smaller set of tuples

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is Hadoop

A

1) open-source framework
2) process big data
3) On clusters of computer
4) simple programming models.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are the two main phases of Map Reduce

A

1) Map Phase

2) Reduce Phase

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What does the master note/ master process do?

A

1) Keeps tract settle the cluster
2) Address local machines
3) Decides which machines run what for each phase

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Where does each worker store it’s results?

A

On it’s local disk

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the hidden phase between the Map phase and Reduce phase?

A

Data shuffle phase or data transfer phase (lot of data transfer)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Can the Reduce phase happen while the Map phase is still going on?

A

No, the Mapper has to finish first

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Hadoop is what

A

An open source implication of the map reduce paradigm

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Hadoop file system is called

A

HDFS

Hadoop Distributed File System

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What does HDFS provide

A

1) Single name space for entire cluster

2) Replicated data 3x fault tolerance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Two items for MapReduce Framework

A

1) Executes user jobs as “map” and “reduce” functions

2) Manages work and distributes & fault-tolerance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

HDFS is make up of these two elemets

A

1) NameNode

2) DataNode

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

HDSF what size of blocks

A

128MB

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

HDSF are block replicated

A

Yes, over several DataNodes

17
Q

HDSF Optimized for large or small files

A

Large sequential reads

18
Q

HDSF are file read and write?

A

No append only

19
Q

HDSF DataNodes are what

A

Each Node is a machine

20
Q

HDSF NameNode

A

Stores the meta data about machines and locations

21
Q

Centralized nameNode contains

A

1) Filename
2) number of Replicas
3) Block-ids
4) More

22
Q

What do you need to write a program in MapReduce

A

1) Data type
2) Map Function
3) Reduce Function

23
Q

MapReduce - Data type

A

Key-value Records

24
Q

MapReduce - Map Function

A

(Key,1 VALUE1) ->

list(Key2, VALUE2)

25
MapReduce - Reduce Function
(Key2, list(VALUE2) -> | list(Key3, VALUE3)
26
MapReduce - Can Reduce run in parallel and independantly?
Yes But no sharing data, Map Functions work the same way
27
Should Mappers be placed on the same node or Rack as their input block?
Yes, it minimized network use
28
Where do the mappers save their output
Local disk (mainly for fault tolerance and recovery)
29
Advantage of storing Mapper output on local disks.
1) allows having more reducers than nodes | 2) Allows recovery if a reducer crashes
30
In Hadoop if a task crashes (map)
1) Retry on another node 2) OK for a map because it had no dependencies 3) OK for a reduce because map outputs are on disk
31
What if a machine(node) crashes
1) Re-launch its current tasks on other nodes (machines) | 2) Re-launch any maps the node ran previously (necessary because their output files were lost at the same time)
32
Do you always need the reduce phase?
No, sometimes the task is simple, and it takes advantage of parallelism (reading files in parallel)
33
Can you use projection in Hadoop?
Yes, you can only select a few columns if that is what you want (reduce phase might now be necessary)
34
two of the best operators in MapReduce
1) Sorting | 2) Group by
35
How to calculate statistics in Hadoop
1) reduce phase 2) Group by 3) count
36
What is an equi-join
An equijoin returns only the rows that have equivalent values for the specified columns.