LEcture 4,5 and 6 Flashcards
(36 cards)
What is data normalization?
Validates and improves a logical design so that it satisfies certain constraints
- Decomposes relations with anomalies to produce smaller, well-structured relations
What is goal van data normalization?
- Goal is to avoid anomalies
1. Insertion anomaly
a. Adding new rows forces user to create duplicate data
2. Deletion anomaly
a. Deleting rows may cause a loss of data that would be needed for somewhere
else
3. Modification anomaly
a. Changing data forces changes to other rows because of duplication
What zijn well structered relatios
- relations that contain minimal data redundancy and allow users to insert, delete, and
update rows without causing data inconsistencies
What is de first normaol form?
No multivalued attributes
- Steps:
- Ensure that every attribute value is atomic
- But, in the relational world one only works with relations in 1NF
- So, no need to actually do something
What is 2nd normal form?
- 1NF + remove partial functional dependencies
- create a new relation for each primary key attribute found in the old relation
- Move the nonkey attributes that are only dependent on this primary key attribute from
the old relation to the new relation
What is 3rd normal form?
2NF + remove transitive dependencies
- Steps:
- Create a new relation for each nonkey attribute that is a determinant in a
relation:
- make that attribute the key
- Move all dependent attributes to new relation
- Keep the determinant attribute in the old relation to serve as a foreign key
wat zijn Challenges arise from the application settings ?
o Data characteristics
o System and resources
o Time restrictions
Wat zijn de challenges van data management?
• Veracity
o Structured data with known semantics and quality
o Dealing with high levels of profile noise
• Volume
o Very large number of profiles
• Variety
o Large volumes of semi-structured, unstructured or highly heterogeneous structured data
• Velocity
o Increasing volume of available data
Eigenschappen van traditionele databases
Constrained functionality: SQL only Efficiency limited by server capacity - Memory - CPU - HDD - Network Scaling can be done by - Adding more hardware - Creating better algorithms - But there are still limits
Eigenschappen distributed databases
Innovation
- Add more DBMS and partition the data
Constrained functionality
- Answer SQL queries
Efficiency limited by #servers, network
API offers location transparency
- User/application always sees a single machine
- User/application not caring about data location
Scaling: add more/better servers, faster network
Eigenschappen van Massively parallel processing platforms:
Innovation
- Connect computers (nodes) over LAN
- make development, parallelization and robustness easy
Functionality
- Generic data-intensive computing
Efficiency relies on network, #computers & algorithms
API offers location & parallelism transparency
- Developers don’t know where data is stored and how the code will be parallelized
Scaling: add more and better computers
Eigenschappen van cloud
Massively parallel processing platforms running on nted hardware
- Innovation
- Elasticity, standardization
- e.g. university requires little resources during holidays, amazon
requires a lot of resources → elasticity
Elasticity can be automatically adjusted
API offers location and parallelism transparency
Scaling: It’s magic!
Five characteristics of big data
Volume - quantity of generated and stored data Velocity - speed at which the data is processed and stored Variety - Type and nature of the data
Variability
- inconsistency of the data set
Veracity
- quality of captured data
Architectural choices to consider:
- Storage layer
- Programming model & execution engine
- Scheduling
- Optimizations
- Fault tolerance
- Load balancing
Requirements of storage layer
- Scalability: handle the ever-increasing data sizes
- Efficiency: fast accesses to data
- Simplicity: hide complexity from the developers
- Fault-tolerance: failures do not lead to loss of data
• Developers are NOT reading from or writing to the files explicitly
• Distributed File System handles IO transparently
o Several DFS already available
o Hadoop Distributed File System
o Google File System
o Cosmos File system
What is HDFS
• Files partitioned into blocks
• Blocks distributed and replicated across nodes
• Three types of nodes in HDFS with one functionality:
o Name nodes: Keep the locations of blocks
o Secondary name nodes: backup nodes
o Data nodes: keep the actual blocks
What happens with failed daata node?
- Name and data node communicate using heartbeat
- Heartbeat is the signal that is sent by the data node to the name node after a regular interval to indicate that it is still present and working
- On failure, name node removes the failed data nodes from the index
- Lost partitions are re-replicated to the remaining data nodes
Proporties of HDFS
• Scalability: Handle the ever-increasing data sizes
o Just add more data nodes
• Efficiency: Fast accesses to data
o Everything read from hard disk (requires I/O)
• Simplicity: Hide complexity from the developers
o No need to know where each block is stored
• Fault-tolerance: Failures do not lead to loss of data
o Administrator can control replication
o If failures are not widespread, no data is lost
What is big data analytics?
• Driven by artificial intelligence, mobile devices, social media and the Internet of Things (IoT)
• Data sources are becoming more complex than those for traditional data
o e.g., Web applications allow user generated data
In order to • Deliver deeper insights • Power innovative data applications • Better and faster decision-making • Predicting future outcomes • Enhanced business intelligence
Types of analytics:
Traditional computation
- exact and all answers over the whole data collection
Approximate
- Use a representative sample instead of the entire input data collection
- Give approximate output and not exact answers
- Answers given within quarantines
Progressive
- Efficiently process given limited time and/or computational resources that
currently are available
Incremental
- Data updates is often high, which quickly makes previous result obsolete
- Update existing processing information
- Allow leveraging new evidence from the updates to fix previous
inconsistencies or complete the information
What is mapreduce?
A programming paradigm (~language) for the creation of code that supports the
following:
- Easy scale-out
- Parallelism & location transparency
- Simple to code and learn
- Fault tolerance
- In 1000’s off the shelf computers, one WILL fail
- Constrain the user to simple constructs!
What is an data model of mapreduce?
- Basic unit of information
- key-value pair
- Translate data to key-value pairs
- Thus can work on various data-types (structured, unstructured etc)
- Then, give the pairs through the MapReduce
what is an programming model of mapreduce?
Model based on different functions
- Primary ones: Map function and Reduce function
- Map (key, value):
- Invoked for every split of the input data
- Value corresponds to the records (lines) in the split
- Reduce(key , list(values))
- Invoked for every unique key emitted by Map
- List(values) corresponds to all values emitted from ALL mappers for this key
- Combine (key,list(values))
locally merge the keys at each node to reduce the number of cross-node
messages
- No guarantees that it will actually be executed!
- Typically, invoked after a fixed-memory buffer is full
downsides of Map reduce
MapReduce is not a panacea MapReduce simple but weak for some requirements • Cannot define complex processes • Batch mode, acyclic, not iterative • Everything file-based, no distributed memory • Difficult to optimize Not good in: • Iterative processes, e.g., clustering • Real-time answers, e.g., streams • Graph queries, e.g., shortest path