Unit 5 Flashcards
So close~ (22 cards)
Cluster Computer
Big data clustering software combines the resources of many
smaller machines, seeking to provide a number of benefits
Benefits of Clustered Computing
Resource pooling: Processing large datasets requires large amounts of all three of these resources. (CPU, Primary storage and Secondary storage)
High Availability: has high fault tolerance and availability YAY
Easy Scalability: scale horizontally, system can react to changes in resource requirements
without expanding the physical resources on a machine
Hadoop
Open source framework based on a technical document by google
Main modules:
Hadoop Distributed file system: distributed file system that provides high-throughput access to application data.
YARN(Yet Another Resource Negotiator): framework for job scheduling and cluster resource management
Map Reduce: Yarn based for parallel processing
Common
Spark: In-Memory data processing * PIG, HIVE: Query-based processing of data services * HBase:
NoSQL Database * Mahout, Spark MLLib: Machine Learning algorithm libraries * Solar, Lucene:
Searching and Indexing * Zookeeper: Managing cluster * Oozie: Job Scheduling
Key characteristics of Hadoop
Economic
Reliable
Scalable
Flexible: store as much as you want
Ecosystem
A small or big complex network interacting as a system
Apache Hadoop
Collection of open-source software framework using cluster computing so solve big data problems
HDFS
Stores any form of data( and metadata) across nodes(name and data)
maintains all the coordination between the clusters and
hardware
Name node
Mistress, controls and manages the HDFS and how clients access files
Contains all metadata
Execution of file system namespace operations like opening, closing, and renaming files and directories
Map which data node gets which block
Data Node
slave nodes that are responsible for storing the actual data and providing read/write services to the system client
The nodes work on commodity hardware ? oui?
oui, makes them cost effective
YARN
job scheduling/monitoring and resource management
YARN – Major Components
Resource manager: allocating resources
Node manager: allocation of resources such as
CPU, memory, bandwidth per machine and later on
acknowledges the resource manager
Application manager: interface between the resource manager and node manager, performs negotiations as per the requirement of the task but the provision of the resource manager and the collaboration of the Node manager
Map reduce
write applications that process
big data in parallel on a distributed cluster computing
Functions of MapReduce
Map(): Performs sorting and filtering of data, generates a key-value pair based result which is later on processed by the reduce() method
Reduce(): summarization by aggregating the mapped data, takes the output generated by map() as input and combines
those tuples into smaller set of tuples
PIG
Structuring data flow, processing and analyzing big data sets
Developed by Yahoo, works on Pig latin
Big Data LC with Hadoop
Ingesting Data
Processing
Computing and Analyzing
Visualizing
Data Ingestion
Process of adding raw data into the system
Aim to keep data raw for flexibility
Tools like Apache Sqoop, Flume, Chukwa, and Kafka help import and manage data, while frameworks like Gobblin normalize and aggregate it.
During ingestion, ETL (Extract, Transform, Load) methods are often used for formatting, categorizing, filtering, and validating data.
Data Storage:
Uses distributed file systems like Hadoop’s HDFS, Ceph, or GlusterFS for storage.
Distributed databases (NoSQL) provide structured access and fault tolerance.
Computing and Analyzing Data Summary:
Data processing begins once it is available, using different methods depending on the insights needed:
Batch Processing: Breaks data into smaller tasks (e.g., splitting, mapping, reducing) using frameworks like Apache Hadoop MapReduce for large datasets requiring extensive computation.
Real-Time Processing: Handles continuous streams of data with minimal delay using frameworks like Apache Storm, Flink, or Spark. It relies on in-memory computing for speed.
Querying and Analysis: Tools like Hive, Pig, Drill, Impala, and Spark SQL for SQL-like interactions.
Machine Learning: Frameworks like Apache SystemML, Mahout, and Spark MLlib.
Programming for Analysis: Popular choices include R and Python for flexibility and ecosystem support.
Visualizing the Results
Identify trends and changes, often more important than raw values, especially in real-time metrics. Key tools include:
Real-Time Visualization: Tools like Prometheus process time-series data for monitoring system health.
Elastic Stack (ELK): Combines Logstash (data collection), Elasticsearch (indexing), and Kibana (visualization) for interfacing with big data.
Silk Stack: Uses Apache Solr and Banana for a similar visualization workflow.
Interactive Notebooks: Tools like Jupyter Notebook and Apache Zeppelin support exploration, sharing, and collaboration in data science.