L8 Flashcards
(29 cards)
def BD processing
a set of techniques or programming models to access large-scale data to extract useful information for supporting and providing decisions
-> in BD happens before storing
2 Types of data processing
- centralized data processing
- distributed data processing: distributed across different physical locations
Batch processing
- computer processes a nr of tasks that have been collected in a group, often simultanously, in non-stop, sequential order
Pros: - good when response time is not important
- suitable for large data volume
- fast, inexpensive and accurate
- offline
- query-driven as static and more about historical fact finding
eg) mothly payroll system, credit card billing system
Real-time processing
- streams of data
- processing is done as data is input: filtering, aggregating and preparing data
- often optimized for analytics and visualization and directly ingested in tools for it -> data-driven
eg) bank ATMs, control systems, social media
Parallel computing
- splitting up larger tasks into multiple subtasks and execute at the same time
- reduces execution time
- multiple processing within single machine
Distributed computing
- splits up larger tasks into subtasks and executes in separate machines networked together as a cluster
Hadoop Ecosystem
system comprised of several components that cover several aspects of data ingestion, processing, analysis exploration and storage
Sqoop
- interface application for transferring structured data between relational databases and Hadoop -> data ingestion
- can import and export
Flume
- collects large amount of semi and unstructured streaming data from multiple sources
Kafka
- streaming platform that handles constant influx of data and processes it incrementally and sequentially
- used to build real-time streaming pipelines
- combines messaging, storage and stream processing -> storage and analysis of both historical and real-time data
Storm
- real-time big data processing system
- handles influx of data and easily process unbounded stream of data
Storm vs MapRedue?
Hadoop MapReduce
- software for programming firn-specific model that can process data in-parallel on clusters of commodity hardware
- fault-tolerant and reliable
- divide and conquer -> only for batch workloads
-
disk-based processing
2 phases: - Map: splitting and mapping
- Reduce: shuffling and reducing
Spark
- for large-scale data processing
- good for batch and stream data
- in-memory processing -> fast
- supports large-scale data science project s, SQL analytics and ML
- many languages: python, SQL, Java, R, etc
MapReduce vs Spark
- spark is 100x faster due to in-memory proc
- MR only batch, while spark both real-time and batch
- spark good for ML
- MR low cost, spark high cost
- spark easy to combine databases
- MR linear processing -> slow
Pig and Hive
data-analysis platforms on top of MapReduce
YARN, Oozie, Zookeeper
- YARN (Yet Another Resource Negotiatior): takes over resource management and job scheduling from MapReduce
- Apache Oozie: manages workflow in Hadoop environment at desired order
- Apache ZooKeeper: maintaining open-source server to enable reliable distributed coordination
Explain Hadoop Ecosystem
?
BD with AWS
AWS has ecosystem with analytical solutions specifically for growing amount of data
Application: Clickstream analysis through AWS
- clickstream data sent to Kinesis Stream
- stored exposed for processing
- custom application programmed on Kinesis makes real-time recommendations
- output to user who sees personalized content suggestions
Application: Data Warehousing through AWS
- data is uploaded to S3
- EMR is used to transform and clean data
- is loaded back into S3
- loaded into Redshift where it is parallelized for fast analytics
- analysed and visualized with Quicksight
Smart Applications through AWS
- Amazon Kinesis receives data
(2. AWS lambda is used to write code to coordinate data flow) - Amazon Machine Learning model is for real-time predictions
- Amazon SNS is used to notify customer support agents
Amazon S3
Amazon Simple Storage Service
- object storage service offering industry-leading scalability, data avilability, security and performance
Amazon EMR
highly distributed computing framework for processing and storing big data
- EMR apache hadoop allows Hive, Pig, Spark to run on top