LEcture 4,5 and 6 Flashcards

(36 cards)

1
Q

What is data normalization?

A

Validates and improves a logical design so that it satisfies certain constraints
- Decomposes relations with anomalies to produce smaller, well-structured relations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is goal van data normalization?

A
  • Goal is to avoid anomalies
    1. Insertion anomaly
    a. Adding new rows forces user to create duplicate data
    2. Deletion anomaly
    a. Deleting rows may cause a loss of data that would be needed for somewhere
    else
    3. Modification anomaly
    a. Changing data forces changes to other rows because of duplication
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What zijn well structered relatios

A
  • relations that contain minimal data redundancy and allow users to insert, delete, and
    update rows without causing data inconsistencies
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is de first normaol form?

A

No multivalued attributes

  • Steps:
  • Ensure that every attribute value is atomic
  • But, in the relational world one only works with relations in 1NF
  • So, no need to actually do something
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is 2nd normal form?

A
  • 1NF + remove partial functional dependencies
  • create a new relation for each primary key attribute found in the old relation
  • Move the nonkey attributes that are only dependent on this primary key attribute from
    the old relation to the new relation
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is 3rd normal form?

A

2NF + remove transitive dependencies
- Steps:
- Create a new relation for each nonkey attribute that is a determinant in a
relation:
- make that attribute the key
- Move all dependent attributes to new relation
- Keep the determinant attribute in the old relation to serve as a foreign key

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

wat zijn Challenges arise from the application settings ?

A

o Data characteristics
o System and resources
o Time restrictions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Wat zijn de challenges van data management?

A

• Veracity
o Structured data with known semantics and quality
o Dealing with high levels of profile noise
• Volume
o Very large number of profiles
• Variety
o Large volumes of semi-structured, unstructured or highly heterogeneous structured data
• Velocity
o Increasing volume of available data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Eigenschappen van traditionele databases

A
Constrained functionality: SQL only
Efficiency limited by server capacity
- Memory
- CPU
- HDD
- Network
Scaling can be done by
- Adding more hardware
- Creating better algorithms
- But there are still limits
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Eigenschappen distributed databases

A

Innovation
- Add more DBMS and partition the data
Constrained functionality
- Answer SQL queries
Efficiency limited by #servers, network
API offers location transparency
- User/application always sees a single machine
- User/application not caring about data location
Scaling: add more/better servers, faster network

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Eigenschappen van Massively parallel processing platforms:

A

Innovation
- Connect computers (nodes) over LAN
- make development, parallelization and robustness easy
Functionality
- Generic data-intensive computing
Efficiency relies on network, #computers & algorithms
API offers location & parallelism transparency
- Developers don’t know where data is stored and how the code will be parallelized
Scaling: add more and better computers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Eigenschappen van cloud

A

Massively parallel processing platforms running on nted hardware
- Innovation
- Elasticity, standardization
- e.g. university requires little resources during holidays, amazon
requires a lot of resources → elasticity
Elasticity can be automatically adjusted
API offers location and parallelism transparency
Scaling: It’s magic!

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Five characteristics of big data

A
Volume
- quantity of generated and stored data
Velocity
- speed at which the data is processed and stored
Variety
- Type and nature of the data

Variability
- inconsistency of the data set
Veracity
- quality of captured data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Architectural choices to consider:

A
  • Storage layer
  • Programming model & execution engine
  • Scheduling
  • Optimizations
  • Fault tolerance
  • Load balancing
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Requirements of storage layer

A
  • Scalability: handle the ever-increasing data sizes
  • Efficiency: fast accesses to data
  • Simplicity: hide complexity from the developers
  • Fault-tolerance: failures do not lead to loss of data

• Developers are NOT reading from or writing to the files explicitly
• Distributed File System handles IO transparently
o Several DFS already available
o Hadoop Distributed File System
o Google File System
o Cosmos File system

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is HDFS

A

• Files partitioned into blocks
• Blocks distributed and replicated across nodes
• Three types of nodes in HDFS with one functionality:
o Name nodes: Keep the locations of blocks
o Secondary name nodes: backup nodes
o Data nodes: keep the actual blocks

17
Q

What happens with failed daata node?

A
  • Name and data node communicate using heartbeat
  • Heartbeat is the signal that is sent by the data node to the name node after a regular interval to indicate that it is still present and working
  • On failure, name node removes the failed data nodes from the index
  • Lost partitions are re-replicated to the remaining data nodes
18
Q

Proporties of HDFS

A

• Scalability: Handle the ever-increasing data sizes
o Just add more data nodes
• Efficiency: Fast accesses to data
o Everything read from hard disk (requires I/O)
• Simplicity: Hide complexity from the developers
o No need to know where each block is stored
• Fault-tolerance: Failures do not lead to loss of data
o Administrator can control replication
o If failures are not widespread, no data is lost

19
Q

What is big data analytics?

A

• Driven by artificial intelligence, mobile devices, social media and the Internet of Things (IoT)
• Data sources are becoming more complex than those for traditional data
o e.g., Web applications allow user generated data

In order to
•	Deliver deeper insights
•	Power innovative data applications 
•	Better and faster decision-making 
•	Predicting future outcomes 
•	Enhanced business intelligence
20
Q

Types of analytics:

A

Traditional computation
- exact and all answers over the whole data collection
Approximate
- Use a representative sample instead of the entire input data collection
- Give approximate output and not exact answers
- Answers given within quarantines
Progressive
- Efficiently process given limited time and/or computational resources that
currently are available
Incremental
- Data updates is often high, which quickly makes previous result obsolete
- Update existing processing information
- Allow leveraging new evidence from the updates to fix previous
inconsistencies or complete the information

21
Q

What is mapreduce?

A

A programming paradigm (~language) for the creation of code that supports the
following:
- Easy scale-out
- Parallelism & location transparency
- Simple to code and learn
- Fault tolerance
- In 1000’s off the shelf computers, one WILL fail
- Constrain the user to simple constructs!

22
Q

What is an data model of mapreduce?

A
  • Basic unit of information
  • key-value pair
  • Translate data to key-value pairs
  • Thus can work on various data-types (structured, unstructured etc)
  • Then, give the pairs through the MapReduce
23
Q

what is an programming model of mapreduce?

A

Model based on different functions
- Primary ones: Map function and Reduce function
- Map (key, value):
- Invoked for every split of the input data
- Value corresponds to the records (lines) in the split
- Reduce(key , list(values))
- Invoked for every unique key emitted by Map
- List(values) corresponds to all values emitted from ALL mappers for this key
- Combine (key,list(values))
locally merge the keys at each node to reduce the number of cross-node
messages
- No guarantees that it will actually be executed!
- Typically, invoked after a fixed-memory buffer is full

24
Q

downsides of Map reduce

A
MapReduce is not a panacea
MapReduce simple but weak for some requirements 
•	Cannot define complex processes
•	Batch mode, acyclic, not iterative
•	Everything file-based, no distributed memory 
•	Difficult to optimize
Not good in:
•	Iterative processes, e.g., clustering
•	Real-time answers, e.g., streams 
•	Graph queries, e.g., shortest path
25
More problems with map reduce
Problems with MapReduce: - MapReduce is a major step backwards (DeWitt 2008) Performance: Extensive I/O (input/output) - Everything is a file stored in the HDFS - Data access too slow - RAM is not used sufficiently Programming model: limited expressiveness - e.g. iterations, cyclic processes - Code is difficult to optimize - SQL: several optimization methodologies
26
What is SParks dataFlow paradigm?
- Models an algorithm as a directed graph with the data flowing between operations - Construction goals: - Improve expressiveness and extensibility - Make coding easier: strive for high-level code - Enable additional optimizations - Improve performance by utilizing the hardware better (RAM) - Representative examples: - Spark, apache, dryad, pregel
27
What is sparks architectural choiches?
``` Architectural choices of Spark • Storage layer o Resilient Distributed Datasets (RDDs) o Datasets and data sources o Input files still stored in HDFS • Programming model and execution engine ```
28
RDD storage layer requirements
* Scalability: handle the ever-increasing data sizes * Efficiency: fast accesses to data * Simplicity: hide complexity from the developers * Fault-tolerance: failures do not lead to loss of data * Fast RAM for hot data: recent data stored in RAM
29
What is RDD
• R resilient o Recover from failures • D distributed o Parts are placed on different computers • D dataset o Collection of data o Array, table, data frame, etc. Distributed, fault-tolerant collections of elements that can be processed in parallel   Resilient Distributed Datasets • Created by o Loading data from stable storage, e.g., from HDFS o Manipulation of existing RDDs • Core properties o Immutable, i.e., read-only, cannot change o Distributed o Lazily evaluated  Joining multiple orders together (restaurant example)  Intuition > optimization o Cacheable > by default, stored in memory! o Replicated
30
What do RDD contains?
``` • Details about the data o I.e., data location or the actual data • Lineage information (equals to “history”) o Dependencies from other RDDs o Functions/transformations for recreating a lost split of an RDD from a previous RDD! • Examples: o RDD2 = RDD1.filter(...) o RDD3 = RDD2.transform(...) ```
31
Why programming model spark?
Why? - MapReduce is simple but weak - Cannot define complex processes - Batch mode, acyclic, not iterative - Everything is file-based, no distributed memory - Procedural → difficult to optimize - Spark - Processing expressed as a directed acyclic graph (DAG)
32
Spark: Dataflow and RDDs
Spark development is RDD centric - In the future, dataset-centric - RDDs enable operations: - Transformations (lazy operations) - I.e. map, filter, flatmap, joins - Actions - I.e. count, collect - Chain of RDD transformations to implement the required functionality
33
What zijn transofrmation in spark?
Most used cases: - Map transformation - Returns a new RDD formed by passing each element of the source through the given function - Filter transformation - Returns a new RDD formed by keeping those elements of the source on which function returns = TRUE - FlatMap transformation - Similar to map - But each input can be mapped to 0 or more output elements - e.g. name FlatMap: First name and last name can be different output elements - ReduceByKey transformations - Processes the elements with each being an (K,V) pair (key, value) - Creates another set of (K,V) pairs where the values for each key are aggregated using the given function - GroupByKey transformations - Processes the elements with each being the (K,V) pair - For each key K it creates an iterable containing all values for the particular key
34
Transformation in spark
Create a new RDD from an existing one - All transformations in Spark are lazy - Do not compute results right away - Computed only
35
Actions and transformations on RDDs are fully parallelizable
- Synchronization required only on shuffling
36
Lazy evaluation in Spark;
- Spark = static rule-based optimizations - Exploits lazy evaluation of transformations - The actual computation starts only when an action is called, in this case collect()