10. MapReduce & Hadoop Flashcards

Question 1

Q

What is MapReduce

Answer

A

The MapReduce paradigm offers the means to break a large task into smaller tasks, run tasks in parallel, and consolidate the outputs of the individual tasks into the final output.

This makes it very scaleable

Question 2

Q

What is Hadoop

Answer

A

Apache Hadoop is a collection of open-source software utilities that facilitate using a network of many computers to solve problems involving massive amounts of data and computation. It provides a software framework for distributed storage and processing of big data using the MapReduce programming model.
Hadoop is very good in handling unstructured data.
Hadoop stores data in a distributed system.  Written in JAVA Classes - a class is a "function" or "program".
Its cost effective, scaleable, more efficient, higher throughput

Question 3

Q

What happens in the map phase

Answer

A

Applies an operation to a piece of data

Provides some intermediate output

Question 4

Q

What happens in the reduce phase

Answer

A

Consolidates the intermediate outputs from the map steps

Provides the final output

Question 5

Q

What is a key value pair

Answer

A

Each step uses key/value pairs, denoted as , as input and output. It is useful to think of the key/value pairs as a simple ordered pair. However, the pairs can take fairly complex forms. For example, the key could be a filename, and the value could be the entire contents of the file.

Question 6

Q

What is the HDFS

Answer

A

The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware.
HDFS provides high throughput access to application data and is suitable for applications that have large data sets.
Breaks data into 64 MB chunks, (with some remainder chunks <64MB).
Generates 3 redundant copies of each chunk to guard against failure.
It tries to distribute the chunks across multiple computers/servers
Rack aware – knows what servers are physically next to each other
HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware.

Question 7

Q

What are NameNodes and DataNodes

Answer

A

An HDFS cluster consists of a single NameNode - a master server that manages the file system namespace* and regulates access to files by clients. In addition, there are a number of DataNodes, usually one per node in the cluster, which manage storage attached to the nodes that they run on.

Question 8

Q

What happens to a file being processed through the NameNodes and DataNodes

Answer

A

Internally, a file is split into one or more blocks and these blocks are stored in a set of DataNodes. The NameNode executes file system namespace operations like opening, closing, and renaming files and directories. It also determines the mapping of blocks to DataNodes. The DataNodes are responsible for serving read and write requests from the file system’s clients. The DataNodes also perform block creation, deletion, and replication upon instruction from the NameNode.

Question 9

Q

What does a NameNode do

Answer

A

Identifies, provides locations and tracks where various data chunks are stored.
If a data chunk is damaged or inaccessible, NameNode can replicate a redundant chunk on another server.
“The NameNode maintains the file system namespace. Any change to the file system namespace or its properties is recorded by the NameNode. An application can specify the number of replicas of a file that should be maintained by HDFS. The number of copies of a file is called the replication factor of that file. This information is stored by the NameNode.
Usually, the replication factor is 3.”
The NameNode executes file system namespace operations like opening, closing, and renaming files and directories. It also determines the mapping of blocks to DataNodes.

Question 10

Q

What does a DataNode do

Answer

A

Manages the data chunks
Can identify corrupted or inaccessible data chunks and make reports to send to NameNode.
DataNodes manage storage attached to the nodes that they run on.
The DataNodes are responsible for serving read and write requests from the file system’s clients. The DataNodes also perform block creation, deletion, and replication upon instruction from the NameNode.

Question 11

Q

What does YARN stand for and comprise of

Answer

A

Yet Another Resource Negotiator (YARN)
Namenode - Knows the plan
ResourceManager - Allocating resources
Scheduler (Apps Manager) - Communications

NodeManager - coordinates the resources on the datanode
AppMaster - runs things and requests resources on the datanode. Deals with starting and ending the reduce part of the job

This is a more distributed way of operating

Question 12

Q

What does the application manager do

Answer

A

Application manager is responsible for maintaining a list of submitted application. After application is submitted by the client, application manager firstly validates whether application requirement of resources for its application master can be satisfied or not.If enough resources are available then it forwards the application to scheduler otherwise application will be rejected.It also make sure that no other application is submitted with same application ID

Question 13

Q

What does the applications master do

Answer

A

The Application Master is responsible for the execution of a single application. It asks for containers from the Resource Scheduler (Resource Manager) and executes specific programs (e.g., the main of a Java class) on the obtained containers. The Application Master knows the application logic and thus it is framework-specific. The MapReduce framework provides its own implementation of an Application Master.

Question 14

Q

What is pig

Answer

A

Pig: Provides a high-level data-flow programming language - very close to the data. Replacement for MapReduce JAVA coding

Question 15

Q

What is Hive

Answer

A

Hive: Provides SQL-like access - further away from the data (HIVEQL is the MySQL version of HIVE)

Question 16

Q

What is Mahout

Answer

A

Mahout: Provides analytical tools - a library

Question 17

Q

What is HBase

Answer

A

HBase: Provides real-time reads and writes - entire overlaid system

Question 18

Q

What is Sqoop

Answer

A

Sqoop is a command-line interface application for transferring data between relational databases and Hadoop

Question 19

Q

What is Apache Spark

Answer

A

Apache Spark is an open-source distributed general-purpose cluster-computing framework. Spark provides an interface for programming entire clusters with implicit data parallelism and fault tolerance. Originally developed at the University of California, Berkeley’s AMPLab, the Spark codebase was later donated to the Apache Software Foundation, which has maintained it since.

Question 20

Q

What is Apache Flume

Answer

A

Apache Flumeis a distributed, reliable, and available software for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. It uses a simple extensible data model that allows for online analytic application.

Question 21

Q

What is NoSQL

Answer

A

NoSQL (Not only Structured Query Language) is a term used to describe those data stores that are applied to unstructured data.
As described earlier, HBase is such a tool that is ideal for storing key/values in column families.
In general, the power of NoSQL data stores is that as the size of the data grows, the implemented solution can scale by simply adding additional machines to the distributed system.

Question 22

Q

What are the four NoSQL database types

Answer

A

Document Databases - JSON
Graph Databases - nodes
Key-Value Databases - pairs
Wide Column Stores - table style

Question 23

Q

What is SFUNC

Answer

A

SFUNC = State transition function

Question 24

Q

What is PREFUNC

Answer

A

PREFUNC = User-defined preliminary aggregate function

Question 25

Q

For which type of tasks is MapReduce best for

Answer

A

Embarrassingly parallel

Question 26

Q

MapReduce is good for

Answer

A

Text analysis

Where data is streaming in

Question 27

Q

What are the stages of the MapReduce process

Answer

A

Input
Input Splits (64MB)
Mapping (determining the key value pairs)
Shuffling
Reducer
Final Output

Question 28

Q

What are the two key types of programs in Hadoop

Answer

A

Storage

MapReduce

Question 29

Q

What are three example Java Daemons (basic set up)

Answer

A

NameNode - master plan (called the amenode)
DataNode - slave workers (these are the datanodes)
JobTracker - communication (on the namenode)
TaskTracker - communication (on the datanodes)

Question 30

Q

What are the three classes of Java classes (programs)

Answer

A

Driver -
Mapper -
Reducer - logic commands to be processed

Question 31

Q

What does a combiner do in MapReduce

Answer

A

Steps in before the shuffle and sort, then combines the key value pairs earlier to hopefully speed up the later stages

Question 32

Q

What does a partitioner do in MapReduce

Answer

A

Splits out the data into streams which can be acted on in different ways

Question 33

Q

What is ETL

Answer

A

Extract Transform Load
batch process
automated

Question 34

Q

Hadoop - Workflow Management & Scheduling

Answer

A

Oozie, Ambari, Zookeeper, Azkaban

Question 35

Q

Hadoop - Streaming / Migration

Answer

A

Flume, Scoop, Storm

Question 36

Q

Hadoop - Library

Question 37

Q

Hadoop - Resource Management

Question 38

Q

Hadoop - Data Management & Storage

Answer

A

HDFS, HBase, Cassandra, Voldemort

Question 39

Q

Hadoop - Data Flow / Data Access

Answer

A

Pig, HIVE

Brainscape's Knowledge GenomeTM

10. MapReduce & Hadoop Flashcards

Brainscape's Knowledge Genome^TM