Unit 5 Flashcards

So close~ (22 cards)

1
Q

Cluster Computer

A

Big data clustering software combines the resources of many
smaller machines, seeking to provide a number of benefits

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Benefits of Clustered Computing

A

Resource pooling: Processing large datasets requires large amounts of all three of these resources. (CPU, Primary storage and Secondary storage)

High Availability: has high fault tolerance and availability YAY

Easy Scalability: scale horizontally, system can react to changes in resource requirements
without expanding the physical resources on a machine

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Hadoop

A

Open source framework based on a technical document by google

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Main modules:

A

Hadoop Distributed file system: distributed file system that provides high-throughput access to application data.

YARN(Yet Another Resource Negotiator): framework for job scheduling and cluster resource management

Map Reduce: Yarn based for parallel processing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Common

A

Spark: In-Memory data processing * PIG, HIVE: Query-based processing of data services * HBase:
NoSQL Database * Mahout, Spark MLLib: Machine Learning algorithm libraries * Solar, Lucene:
Searching and Indexing * Zookeeper: Managing cluster * Oozie: Job Scheduling

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Key characteristics of Hadoop

A

Economic

Reliable

Scalable

Flexible: store as much as you want

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Ecosystem

A

A small or big complex network interacting as a system

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Apache Hadoop

A

Collection of open-source software framework using cluster computing so solve big data problems

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

HDFS

A

Stores any form of data( and metadata) across nodes(name and data)

maintains all the coordination between the clusters and
hardware

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Name node

A

Mistress, controls and manages the HDFS and how clients access files

Contains all metadata

Execution of file system namespace operations like opening, closing, and renaming files and directories

Map which data node gets which block

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Data Node

A

slave nodes that are responsible for storing the actual data and providing read/write services to the system client

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

The nodes work on commodity hardware ? oui?

A

oui, makes them cost effective

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

YARN

A

job scheduling/monitoring and resource management

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

YARN – Major Components

A

Resource manager: allocating resources

Node manager: allocation of resources such as
CPU, memory, bandwidth per machine and later on
acknowledges the resource manager

Application manager: interface between the resource manager and node manager, performs negotiations as per the requirement of the task but the provision of the resource manager and the collaboration of the Node manager

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Map reduce

A

write applications that process
big data in parallel on a distributed cluster computing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Functions of MapReduce

A

Map(): Performs sorting and filtering of data, generates a key-value pair based result which is later on processed by the reduce() method

Reduce(): summarization by aggregating the mapped data, takes the output generated by map() as input and combines
those tuples into smaller set of tuples

17
Q

PIG

A

Structuring data flow, processing and analyzing big data sets

Developed by Yahoo, works on Pig latin

18
Q

Big Data LC with Hadoop

A

Ingesting Data

Processing

Computing and Analyzing

Visualizing

19
Q

Data Ingestion

A

Process of adding raw data into the system

Aim to keep data raw for flexibility

Tools like Apache Sqoop, Flume, Chukwa, and Kafka help import and manage data, while frameworks like Gobblin normalize and aggregate it.

During ingestion, ETL (Extract, Transform, Load) methods are often used for formatting, categorizing, filtering, and validating data.

20
Q

Data Storage:

A

Uses distributed file systems like Hadoop’s HDFS, Ceph, or GlusterFS for storage.

Distributed databases (NoSQL) provide structured access and fault tolerance.

21
Q

Computing and Analyzing Data Summary:

A

Data processing begins once it is available, using different methods depending on the insights needed:

Batch Processing: Breaks data into smaller tasks (e.g., splitting, mapping, reducing) using frameworks like Apache Hadoop MapReduce for large datasets requiring extensive computation.
Real-Time Processing: Handles continuous streams of data with minimal delay using frameworks like Apache Storm, Flink, or Spark. It relies on in-memory computing for speed.

Querying and Analysis: Tools like Hive, Pig, Drill, Impala, and Spark SQL for SQL-like interactions.
Machine Learning: Frameworks like Apache SystemML, Mahout, and Spark MLlib.

Programming for Analysis: Popular choices include R and Python for flexibility and ecosystem support.

22
Q

Visualizing the Results

A

Identify trends and changes, often more important than raw values, especially in real-time metrics. Key tools include:

Real-Time Visualization: Tools like Prometheus process time-series data for monitoring system health.
Elastic Stack (ELK): Combines Logstash (data collection), Elasticsearch (indexing), and Kibana (visualization) for interfacing with big data.
Silk Stack: Uses Apache Solr and Banana for a similar visualization workflow.
Interactive Notebooks: Tools like Jupyter Notebook and Apache Zeppelin support exploration, sharing, and collaboration in data science.