10. MapReduce & Hadoop Flashcards

1
Q

What is MapReduce

A

The MapReduce paradigm offers the means to break a large task into smaller tasks, run tasks in parallel, and consolidate the outputs of the individual tasks into the final output.

This makes it very scaleable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is Hadoop

A
Apache Hadoop is a collection of open-source software utilities that facilitate using a network of many computers to solve problems involving massive amounts of data and computation. It provides a software framework for distributed storage and processing of big data using the MapReduce programming model.
Hadoop is very good in handling unstructured data.
Hadoop stores data in a distributed system.  Written in JAVA Classes - a class is a "function" or "program".
Its cost effective, scaleable, more efficient, higher throughput
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What happens in the map phase

A

Applies an operation to a piece of data

Provides some intermediate output

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What happens in the reduce phase

A

Consolidates the intermediate outputs from the map steps

Provides the final output

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is a key value pair

A

Each step uses key/value pairs, denoted as , as input and output. It is useful to think of the key/value pairs as a simple ordered pair. However, the pairs can take fairly complex forms. For example, the key could be a filename, and the value could be the entire contents of the file.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the HDFS

A

The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware.
HDFS provides high throughput access to application data and is suitable for applications that have large data sets.
Breaks data into 64 MB chunks, (with some remainder chunks <64MB).
Generates 3 redundant copies of each chunk to guard against failure.
It tries to distribute the chunks across multiple computers/servers
Rack aware – knows what servers are physically next to each other
HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are NameNodes and DataNodes

A

An HDFS cluster consists of a single NameNode - a master server that manages the file system namespace* and regulates access to files by clients. In addition, there are a number of DataNodes, usually one per node in the cluster, which manage storage attached to the nodes that they run on.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What happens to a file being processed through the NameNodes and DataNodes

A

Internally, a file is split into one or more blocks and these blocks are stored in a set of DataNodes. The NameNode executes file system namespace operations like opening, closing, and renaming files and directories. It also determines the mapping of blocks to DataNodes. The DataNodes are responsible for serving read and write requests from the file system’s clients. The DataNodes also perform block creation, deletion, and replication upon instruction from the NameNode.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What does a NameNode do

A

Identifies, provides locations and tracks where various data chunks are stored.
If a data chunk is damaged or inaccessible, NameNode can replicate a redundant chunk on another server.
“The NameNode maintains the file system namespace. Any change to the file system namespace or its properties is recorded by the NameNode. An application can specify the number of replicas of a file that should be maintained by HDFS. The number of copies of a file is called the replication factor of that file. This information is stored by the NameNode.
Usually, the replication factor is 3.”
The NameNode executes file system namespace operations like opening, closing, and renaming files and directories. It also determines the mapping of blocks to DataNodes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What does a DataNode do

A

Manages the data chunks
Can identify corrupted or inaccessible data chunks and make reports to send to NameNode.
DataNodes manage storage attached to the nodes that they run on.
The DataNodes are responsible for serving read and write requests from the file system’s clients. The DataNodes also perform block creation, deletion, and replication upon instruction from the NameNode.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What does YARN stand for and comprise of

A

Yet Another Resource Negotiator (YARN)
Namenode - Knows the plan
ResourceManager - Allocating resources
Scheduler (Apps Manager) - Communications

NodeManager - coordinates the resources on the datanode
AppMaster - runs things and requests resources on the datanode. Deals with starting and ending the reduce part of the job

This is a more distributed way of operating

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What does the application manager do

A

Application manager is responsible for maintaining a list of submitted application. After application is submitted by the client, application manager firstly validates whether application requirement of resources for its application master can be satisfied or not.If enough resources are available then it forwards the application to scheduler otherwise application will be rejected.It also make sure that no other application is submitted with same application ID

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What does the applications master do

A

The Application Master is responsible for the execution of a single application. It asks for containers from the Resource Scheduler (Resource Manager) and executes specific programs (e.g., the main of a Java class) on the obtained containers. The Application Master knows the application logic and thus it is framework-specific. The MapReduce framework provides its own implementation of an Application Master.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is pig

A

Pig: Provides a high-level data-flow programming language - very close to the data. Replacement for MapReduce JAVA coding

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is Hive

A

Hive: Provides SQL-like access - further away from the data (HIVEQL is the MySQL version of HIVE)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is Mahout

A

Mahout: Provides analytical tools - a library

17
Q

What is HBase

A

HBase: Provides real-time reads and writes - entire overlaid system

18
Q

What is Sqoop

A

Sqoop is a command-line interface application for transferring data between relational databases and Hadoop

19
Q

What is Apache Spark

A

Apache Spark is an open-source distributed general-purpose cluster-computing framework. Spark provides an interface for programming entire clusters with implicit data parallelism and fault tolerance. Originally developed at the University of California, Berkeley’s AMPLab, the Spark codebase was later donated to the Apache Software Foundation, which has maintained it since.

20
Q

What is Apache Flume

A

Apache Flumeis a distributed, reliable, and available software for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. It uses a simple extensible data model that allows for online analytic application.

21
Q

What is NoSQL

A

NoSQL (Not only Structured Query Language) is a term used to describe those data stores that are applied to unstructured data.
As described earlier, HBase is such a tool that is ideal for storing key/values in column families.
In general, the power of NoSQL data stores is that as the size of the data grows, the implemented solution can scale by simply adding additional machines to the distributed system.

22
Q

What are the four NoSQL database types

A

Document Databases - JSON
Graph Databases - nodes
Key-Value Databases - pairs
Wide Column Stores - table style

23
Q

What is SFUNC

A

SFUNC = State transition function

24
Q

What is PREFUNC

A

PREFUNC = User-defined preliminary aggregate function

25
Q

For which type of tasks is MapReduce best for

A

Embarrassingly parallel

26
Q

MapReduce is good for

A

Text analysis

Where data is streaming in

27
Q

What are the stages of the MapReduce process

A
Input
Input Splits (64MB)
Mapping (determining the key value pairs)
Shuffling
Reducer
Final Output
28
Q

What are the two key types of programs in Hadoop

A

Storage

MapReduce

29
Q

What are three example Java Daemons (basic set up)

A

NameNode - master plan (called the amenode)
DataNode - slave workers (these are the datanodes)
JobTracker - communication (on the namenode)
TaskTracker - communication (on the datanodes)

30
Q

What are the three classes of Java classes (programs)

A

Driver -
Mapper -
Reducer - logic commands to be processed

31
Q

What does a combiner do in MapReduce

A

Steps in before the shuffle and sort, then combines the key value pairs earlier to hopefully speed up the later stages

32
Q

What does a partitioner do in MapReduce

A

Splits out the data into streams which can be acted on in different ways

33
Q

What is ETL

A

Extract Transform Load
batch process
automated

34
Q

Hadoop - Workflow Management & Scheduling

A

Oozie, Ambari, Zookeeper, Azkaban

35
Q

Hadoop - Streaming / Migration

A

Flume, Scoop, Storm

36
Q

Hadoop - Library

A

Mahout

37
Q

Hadoop - Resource Management

A

YARN

38
Q

Hadoop - Data Management & Storage

A

HDFS, HBase, Cassandra, Voldemort

39
Q

Hadoop - Data Flow / Data Access

A

Pig, HIVE