Store, Manage and Process by harnessing large clusters of commodity nodes - MapReduce familiy: simpler, more constraint ex: hadoop - 2nd gen: enables more complex processing and data, optimization opportunities ex pySpark

IDT lecture 4 Flashcards by Andrei Talaba

Drawbacks for older file system type db (before 1970’s)

redundancy
inconsistencies
data isolation
integrity
atomicity of updates
concurrent access by multiple users
security problems

Solution for these problems was the creation of RDBMS… so RELATIONAL dbms

How well did you know this?

Not at all

Perfectly

BIG DATA

Information assets that require NEW forms to process it.

How well did you know this?

Not at all

Perfectly

The Vs of BIG DATA

Volume: amount of generated and stored data

Velocity: the speed/rate at which the data is generated, collected, processed

Variety: different types of data available (unstructured, semi structured)

Veracity: quality of captured data. Truthful/reliable data

Value: inherent wealth embedded in the data.

Visualization: display the data

Volatility: everything changes, data changes

Vulnerability: new security concerns

How well did you know this?

Not at all

Perfectly

BIG DATA analytics: compromise about big data collection

You need to compromise because we cannot process it like RDMS.

People look for patterns in the data, look for top answers, etc

How well did you know this?

Not at all

Perfectly

Interactive Processing

Algorithms that just stop the process and wait for the user input and then continue.

System users are asked to help during the processing, and their answers are considered as part of the algorithm

How well did you know this?

Not at all

Perfectly

Approximate processing

use representative sample instead of whole population

gives approximate output and not exact asnwer
einstein photos

How well did you know this?

Not at all

Perfectly

Crowdsourcing processing

Difficult tasks or opinions are given to a group of people.

Humans are asked about the relation between profiles for a small compensation per reply. ex: amazon mechanical turk.

How well did you know this?

Not at all

Perfectly

Progressive processing

You have limited time/ resources to give an answer.

Results are shown as soon as there are available. (as opposed to SQL when you have to wait for it to finish the query)

How well did you know this?

Not at all

Perfectly

Incremental processing

Data updates are frequent, makes previous results obsolete.

Update existing processing info

This method improves the answer as it gets more information.

How well did you know this?

Not at all

Perfectly

Scalability in Data Management for traditional dbs

Traditional dbs:

sql only (constraint)
efficiency limited by server capacity

Scaling can be done by:

adding more hw
creating better algorithms

How well did you know this?

Not at all

Perfectly

Solution for scalability for relational data (distributed dbs):

Distributed dbs (diff location for servers):

add more dbms & partition the data
efficiency limited by servers, network
scaling: add more/better servers, faster network,

How well did you know this?

Not at all

Perfectly

Massively parralel processing platforms

Move everything in the same place (opposed to distributed DBS
- connect computers over LAN and make development, parallelization and robustness easy
- functionality:
generic data-intensive computing

Scaling: buy more or better computers

How well did you know this?

Not at all

Perfectly

Cloud

Massively parallel processing platforms running over rented hardware.

Innovation: Elasticity, standardization

Based on elasticity of demand (fluctuations) adjust resources for cloud.

Elasticity can be automatically adjusted

Scaling: it’s magic!

How well did you know this?

Not at all

Perfectly

BIG DATA models

Store, Manage and Process by harnessing large clusters of commodity nodes

MapReduce familiy: simpler, more constraint
ex: hadoop
2nd gen: enables more complex processing and data, optimization opportunities
ex pySpark

How well did you know this?

Not at all

Perfectly

Aspects of data intensive systems

data storage
needle in the haystack
scalability (most important)

How well did you know this?

Not at all

Perfectly

Architectural chaoices to consider when working with big data

Study These Flashcards

storage layer
programming model and execution engine
scheduling
optimizations
fault tolerance
load balancing

The Hadoop Ecosystem

Study These Flashcards

Hadoop is a family of systems.

Most important (for this course):

Object Storage: HDFS -> storing of the data. (bottom layer)

> Table storage Hcatalog, Hbase
> Computation: MapReduce
> Programming language: Pig(dataflow); Hive(SQL)

HDFS: storage layer of hadoop requirements

Study These Flashcards

Scalability: just add more data nodes
Efficiency: everything read from HD
Simplicity: no need to know where each block is stored
Fault tolerance: failures do not lead to loss of data

HDFS how it works:

Study These Flashcards

Files partitioned into blocks.

Blocks then are distributed and replicated across nodes.

Types of nodes in HDFS with one functionality

Study These Flashcards

Name nodes: keep the location of blocks

Secondary name nodes: backup nodes

Data nodes: keep the actual blocks

default size (in mb) of blocks in hadoop

Study These Flashcards

64MB

Failed data nodes

Study These Flashcards

Name node and data node communication using “heartbeat” (like a ping). Informs if the node is still available. —> Data nodes send Name nodes heartbeat at regular intervals to show that everything is fine.

On failure, the name node removes the failed data nodes from the index

Lost partitions are re-replicated to the remaining data nodes

Big Data Analytics (IBM)

Study These Flashcards

driven by AI, IOT, social media, mobile devices.

data sources are becoming more complex than those for traditional data

we want:

deliver deeper insights
predict future outcomes
better and faster decision making
power innovative apps

Analytics: MapReduce

Study These Flashcards

a programming paradigm (language) for the creation of code that supports the following:

Easy scale out:

Fault tolerance: 1/1000 off the shelf comp will fail

It is built in Hadoop using HDFS.

Code your analytics logic within:

MAP FUNCTION: local processing
REDUCE FUNCTION: aggregation

Example MapReduce:

Huge file -> split into multiple parts -> Map -> Reduce -> ->>>> RESULT

Data model big data/hadoop

Basic unit of info: KEY - VALUE pair Get the data, then translate/convert it to key - value pairs. This conversion makes it easy to work with various data types: relational, structured, unstructured, etc.' After converting the data to key-value pairs, give the pairs through the MapReduce.

Master-Slave architecture

The master controls the slaves. Master: namenode, jobtracker, secondary namenode Slave: task tracker > data nodes (see slide for diagram)

MapReduce weaknesses:

- cannot define complex processes - batch modes, acyclic - difficult to optimize - no distributed memory (file based)

What is the purpose of replication ? | How does replication of blocks help?

Replication = n means that each individual block of data will be replicated and stored on 'n' nodes. The purpose of replication should be to help with fault-tolerance and load balancing.

Load balancing = ?

Load balancing means that the blocks of data should be distributed in a balanced way across the nodes. You should have similar loads on all nodes. (no heavy lifting on certain nodes and not enough load on others)

IDT lecture 4 Flashcards

(30 cards)