Unit 5 Flashcards

Question 1

Q

Cluster Computer

Answer

A

Big data clustering software combines the resources of many
smaller machines, seeking to provide a number of benefits

Question 2

Q

Benefits of Clustered Computing

Answer

A

Resource pooling: Processing large datasets requires large amounts of all three of these resources. (CPU, Primary storage and Secondary storage)

High Availability: has high fault tolerance and availability YAY

Easy Scalability: scale horizontally, system can react to changes in resource requirements
without expanding the physical resources on a machine

Question 3

Q

Hadoop

Answer

A

Open source framework based on a technical document by google

Question 4

Q

Main modules:

Answer

A

Hadoop Distributed file system: distributed file system that provides high-throughput access to application data.

YARN(Yet Another Resource Negotiator): framework for job scheduling and cluster resource management

Map Reduce: Yarn based for parallel processing

Question 5

Q

Common

Answer

A

Spark: In-Memory data processing * PIG, HIVE: Query-based processing of data services * HBase:
NoSQL Database * Mahout, Spark MLLib: Machine Learning algorithm libraries * Solar, Lucene:
Searching and Indexing * Zookeeper: Managing cluster * Oozie: Job Scheduling

Question 6

Q

Key characteristics of Hadoop

Answer

A

Economic

Reliable

Scalable

Flexible: store as much as you want

Question 7

Q

Ecosystem

Answer

A

A small or big complex network interacting as a system

Question 8

Q

Apache Hadoop

Answer

A

Collection of open-source software framework using cluster computing so solve big data problems

Question 9

Q

HDFS

Answer

A

Stores any form of data( and metadata) across nodes(name and data)

maintains all the coordination between the clusters and
hardware

Question 10

Q

Name node

Answer

A

Mistress, controls and manages the HDFS and how clients access files

Contains all metadata

Execution of file system namespace operations like opening, closing, and renaming files and directories

Map which data node gets which block

Question 11

Q

Data Node

Answer

A

slave nodes that are responsible for storing the actual data and providing read/write services to the system client

Question 12

Q

The nodes work on commodity hardware ? oui?

Answer

A

oui, makes them cost effective

Question 13

Q

YARN

Answer

A

job scheduling/monitoring and resource management

Question 14

Q

YARN – Major Components

Answer

A

Resource manager: allocating resources

Node manager: allocation of resources such as
CPU, memory, bandwidth per machine and later on
acknowledges the resource manager

Application manager: interface between the resource manager and node manager, performs negotiations as per the requirement of the task but the provision of the resource manager and the collaboration of the Node manager

Question 15

Q

Map reduce

Answer

A

write applications that process
big data in parallel on a distributed cluster computing

Question 16

Q

Functions of MapReduce

Answer

Study These Flashcards

A

Map(): Performs sorting and filtering of data, generates a key-value pair based result which is later on processed by the reduce() method

Reduce(): summarization by aggregating the mapped data, takes the output generated by map() as input and combines
those tuples into smaller set of tuples

Question 17

Q

PIG

Answer

Study These Flashcards

A

Structuring data flow, processing and analyzing big data sets

Developed by Yahoo, works on Pig latin

Question 18

Q

Big Data LC with Hadoop

Answer

Study These Flashcards

A

Ingesting Data

Processing

Computing and Analyzing

Visualizing

Question 19

Q

Data Ingestion

Answer

Study These Flashcards

A

Process of adding raw data into the system

Aim to keep data raw for flexibility

Tools like Apache Sqoop, Flume, Chukwa, and Kafka help import and manage data, while frameworks like Gobblin normalize and aggregate it.

During ingestion, ETL (Extract, Transform, Load) methods are often used for formatting, categorizing, filtering, and validating data.

Question 20

Q

Data Storage:

Answer

Study These Flashcards

A

Uses distributed file systems like Hadoop’s HDFS, Ceph, or GlusterFS for storage.

Distributed databases (NoSQL) provide structured access and fault tolerance.

Question 21

Q

Computing and Analyzing Data Summary:

Answer

Study These Flashcards

A

Data processing begins once it is available, using different methods depending on the insights needed:

Batch Processing: Breaks data into smaller tasks (e.g., splitting, mapping, reducing) using frameworks like Apache Hadoop MapReduce for large datasets requiring extensive computation.
Real-Time Processing: Handles continuous streams of data with minimal delay using frameworks like Apache Storm, Flink, or Spark. It relies on in-memory computing for speed.

Querying and Analysis: Tools like Hive, Pig, Drill, Impala, and Spark SQL for SQL-like interactions.
Machine Learning: Frameworks like Apache SystemML, Mahout, and Spark MLlib.

Programming for Analysis: Popular choices include R and Python for flexibility and ecosystem support.

Question 22

Q

Visualizing the Results

Answer

Study These Flashcards

A

Identify trends and changes, often more important than raw values, especially in real-time metrics. Key tools include:

Real-Time Visualization: Tools like Prometheus process time-series data for monitoring system health.
Elastic Stack (ELK): Combines Logstash (data collection), Elasticsearch (indexing), and Kibana (visualization) for interfacing with big data.
Silk Stack: Uses Apache Solr and Banana for a similar visualization workflow.
Interactive Notebooks: Tools like Jupyter Notebook and Apache Zeppelin support exploration, sharing, and collaboration in data science.

Unit 5 Flashcards

So close~ (22 cards)