Week 6 - Data Management in MapReduce Systems Flashcards
What is Map Reduce (3 items)
1) distributed computing
2) Map
3) Reduce
(key/value pairs).
Map Reduce - Map
Map takes a set of data and converts it into another set of data
Map Reduce - Reduce
Reduce takes the output from a map as an input and combines those data tuples into a smaller set of tuples
What is Hadoop
1) open-source framework
2) process big data
3) On clusters of computer
4) simple programming models.
What are the two main phases of Map Reduce
1) Map Phase
2) Reduce Phase
What does the master note/ master process do?
1) Keeps tract settle the cluster
2) Address local machines
3) Decides which machines run what for each phase
Where does each worker store it’s results?
On it’s local disk
What is the hidden phase between the Map phase and Reduce phase?
Data shuffle phase or data transfer phase (lot of data transfer)
Can the Reduce phase happen while the Map phase is still going on?
No, the Mapper has to finish first
Hadoop is what
An open source implication of the map reduce paradigm
Hadoop file system is called
HDFS
Hadoop Distributed File System
What does HDFS provide
1) Single name space for entire cluster
2) Replicated data 3x fault tolerance
Two items for MapReduce Framework
1) Executes user jobs as “map” and “reduce” functions
2) Manages work and distributes & fault-tolerance
HDFS is make up of these two elemets
1) NameNode
2) DataNode
HDSF what size of blocks
128MB
HDSF are block replicated
Yes, over several DataNodes
HDSF Optimized for large or small files
Large sequential reads
HDSF are file read and write?
No append only
HDSF DataNodes are what
Each Node is a machine
HDSF NameNode
Stores the meta data about machines and locations
Centralized nameNode contains
1) Filename
2) number of Replicas
3) Block-ids
4) More
What do you need to write a program in MapReduce
1) Data type
2) Map Function
3) Reduce Function
MapReduce - Data type
Key-value Records
MapReduce - Map Function
(Key,1 VALUE1) ->
list(Key2, VALUE2)