IDT lecture 4 Flashcards
(30 cards)
Drawbacks for older file system type db (before 1970’s)
- redundancy
- inconsistencies
- data isolation
- integrity
- atomicity of updates
- concurrent access by multiple users
- security problems
Solution for these problems was the creation of RDBMS… so RELATIONAL dbms
BIG DATA
Information assets that require NEW forms to process it.
The Vs of BIG DATA
Volume: amount of generated and stored data
Velocity: the speed/rate at which the data is generated, collected, processed
Variety: different types of data available (unstructured, semi structured)
Veracity: quality of captured data. Truthful/reliable data
Value: inherent wealth embedded in the data.
Visualization: display the data
Volatility: everything changes, data changes
Vulnerability: new security concerns
BIG DATA analytics: compromise about big data collection
You need to compromise because we cannot process it like RDMS.
People look for patterns in the data, look for top answers, etc
Interactive Processing
Algorithms that just stop the process and wait for the user input and then continue.
System users are asked to help during the processing, and their answers are considered as part of the algorithm
Approximate processing
use representative sample instead of whole population
- gives approximate output and not exact asnwer
- einstein photos
Crowdsourcing processing
Difficult tasks or opinions are given to a group of people.
Humans are asked about the relation between profiles for a small compensation per reply. ex: amazon mechanical turk.
Progressive processing
You have limited time/ resources to give an answer.
Results are shown as soon as there are available. (as opposed to SQL when you have to wait for it to finish the query)
Incremental processing
Data updates are frequent, makes previous results obsolete.
Update existing processing info
This method improves the answer as it gets more information.
Scalability in Data Management for traditional dbs
Traditional dbs:
- sql only (constraint)
- efficiency limited by server capacity
Scaling can be done by:
- adding more hw
- creating better algorithms
Solution for scalability for relational data (distributed dbs):
Distributed dbs (diff location for servers):
- add more dbms & partition the data
- efficiency limited by servers, network
- scaling: add more/better servers, faster network,
Massively parralel processing platforms
Move everything in the same place (opposed to distributed DBS
- connect computers over LAN and make development, parallelization and robustness easy
- functionality:
generic data-intensive computing
Scaling: buy more or better computers
Cloud
Massively parallel processing platforms running over rented hardware.
Innovation: Elasticity, standardization
Based on elasticity of demand (fluctuations) adjust resources for cloud.
Elasticity can be automatically adjusted
Scaling: it’s magic!
BIG DATA models
Store, Manage and Process by harnessing large clusters of commodity nodes
- MapReduce familiy: simpler, more constraint
ex: hadoop - 2nd gen: enables more complex processing and data, optimization opportunities
ex pySpark
Aspects of data intensive systems
- data storage
- needle in the haystack
- scalability (most important)
Architectural chaoices to consider when working with big data
- storage layer
- programming model and execution engine
- scheduling
- optimizations
- fault tolerance
- load balancing
The Hadoop Ecosystem
Hadoop is a family of systems.
Most important (for this course):
Object Storage: HDFS -> storing of the data. (bottom layer)
- > Table storage Hcatalog, Hbase
- > Computation: MapReduce
- > Programming language: Pig(dataflow); Hive(SQL)
HDFS: storage layer of hadoop requirements
Scalability: just add more data nodes
Efficiency: everything read from HD
Simplicity: no need to know where each block is stored
Fault tolerance: failures do not lead to loss of data
HDFS how it works:
Files partitioned into blocks.
Blocks then are distributed and replicated across nodes.
Types of nodes in HDFS with one functionality
Name nodes: keep the location of blocks
Secondary name nodes: backup nodes
Data nodes: keep the actual blocks
default size (in mb) of blocks in hadoop
64MB
Failed data nodes
Name node and data node communication using “heartbeat” (like a ping). Informs if the node is still available. —> Data nodes send Name nodes heartbeat at regular intervals to show that everything is fine.
On failure, the name node removes the failed data nodes from the index
Lost partitions are re-replicated to the remaining data nodes
Big Data Analytics (IBM)
driven by AI, IOT, social media, mobile devices.
- data sources are becoming more complex than those for traditional data
we want:
- deliver deeper insights
- predict future outcomes
- better and faster decision making
- power innovative apps
Analytics: MapReduce
- a programming paradigm (language) for the creation of code that supports the following:
Easy scale out:
Fault tolerance: 1/1000 off the shelf comp will fail
It is built in Hadoop using HDFS.
Code your analytics logic within:
- MAP FUNCTION: local processing
- REDUCE FUNCTION: aggregation