Data Integration & Data Quality Flashcards

1
Q

Which three data models are there?

A

1 - hierarchical model
2 - network model
3 - relational database model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Hierarchical data model

A

Data is structured as a tree of records

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Network model

A

Allows nodes to have multiple parents, allowing for more flexibility

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What does the relational database model provide?

A

An independent way to store data. This is the only data model in which data can be restructured without affecting the applications.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Approaches against physical location challenges

A

1 - data federation

2 - data warehousing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Data federation

A

Leaving the data in its place and moving the query to the data. The database engine becomes the federation engine or the mediator, which maintains a global target schema to provide a virtual view of the integrated data. The query optimizer then uses this global target schema to decompose a query into partial queries that can be executed by each of the sources.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are wrappers used for in data federation?

A

They are used for each of the sources to map the global schema to the schema of the source and to negotiate how much of a partial query each of the sources can do and at what cost. It translates the query and returns to result back to the federation engine.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What to do if the data is described differently in different places?

A

Map the existing schemas to one common schema. This is a virtual schema in the case of federation and a materialised schema in the case of data warehousing.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What do declarative schema mappings describe?

A

The relationships between the schemas of heterogeneous data sources.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Name the measurements of data quality

A

1 - accuracy
2 - timeliness (update frequency)
3 - completeness
4 - consistency
5 - duplication
6 - referential integrity (first insert, then updates, etc)
7 - domain integrity (ages should be in a certain domain)
8 - follow the business rules (if there can only be one manager, not multiple managers in the database)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are the key aspects of data distribution?

A

1 - availability
2 - scalability
3 - transparency (of access)
4 - reliability/fault-tolerance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Homogenous distributed databases

A

When websites have identical software, are aware of each other, and agree to cooperate in processing user requests.
It appears as a single system to the user.
It is a closed world assumption, everything in the network is known.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Heterogeneous distrbuted databases

A

Different sites may use different schemas and software. They might not be aware of each other and may provide only limited facilities for cooperation in transaction processing (queries)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Horizontal fragmentation

A

Each tuple of the relation is assigned to one or more fragments
For example, storing the employees of each department in a separate database.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Vertical fragmentation

A

The schema for the relation is split into several smaller schemas

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What does vertical fragmentation require?

A

All schemas to have a common candidate key / super key to ensure losless joins (unique id)

17
Q

Communication heterogeneity

A

Some databases might allow for direct query language while others offer APIs

18
Q

Schema heterogeneity

A

The structure of the table is different

Example: student_info table and student_classes and student_contact_info

19
Q

Data type heterogeneity

A

Data stored as different data types

20
Q

Value heterogeneity

A

The same logical values being stored in different ways

Example: str, st, street

21
Q

Semantic heterogneity

A

The same values meaning different things in different sources
Example: title -> job title or title of a person

22
Q

When are federated databases useful?

A

When there are many sources and only a few are communicating.

23
Q

Name the two approaches to update data warehouses

A

1 - complete rebuild

2 - incremental updates

24
Q

Name two important things about data warehouses

A

1 - A data warehouse is never up to date at all times

2 - They cannot handle streaming data

25
Q

Data lake

A

A data lake stores all types of structured and unstructured data as-is.
Can be combined with data warehouses

26
Q

Mediator

A

A virtual view over the data. It has a virtual schema that combines all schemas from the sources. The mapping takes place at query time. The mediator sends queries to the wrappers, which connect to the course and send back the result to the mediator. The mediator combines all the results into one final result.