Lecture 3 Flashcards

1
Q

Data lake

A

A database that holds raw data in its narrative format. The data lake approach copies all the data that might be relevant. Whereas the data warehouse approach cleans the data, the data lake approach doesn’t clean the data. Mostly, the two exist next to each other because different employees have different needs. This approach is also called schema on read.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

ETL

A

Data warehousing process that consists of:

  • Extraction
  • Transformation
  • Load
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Extraction

A

Extracting data from a production database.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Transformation

A

Converting the extracted data from its source from into its target form, so that it can be placed into a data warehouse. Transformation activities: transfer, cleaning, integration and extraction.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Load

A

Putting the data into the warehouse.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

ETL data conversion

A
  • Simple data type conversion.
  • Complex data type conversion.
  • Currency data type conversion.
  • Language data type conversion.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

ETL cleaning

A
  1. Data must be validated and filtered -> garbage in = garbage out (GIGO).
  2. Data in the real world is dirty.
    - incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data.
    - noisy: containing errors or outliers.
    - inconsistent: containing discrepancies in code or names.
  3. It is often neglected or casually handled.
  4. Problems exposed when data is summarized.
  5. Prevention is better, but who does the cleaning?
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

ETL load

A

Integral load:
- All the records are taken and processes by comparing them with the records in the data warehouse.
- Extreme case: load deletes DW tables and replaces them with complete new versions (= full load).
- When no time stamps are present it is difficult to do an incremental load.
Delta/Incremental load:
- Only the changed records are taken and processed systematically in the DW.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

ETL tools

A
  1. Generation 0: do-it-yourself (generate purpose programming yourself).
  2. Generation 1: code generators.
  3. Generation 2: ETL engines -> highly optimized, metadata driven.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

ETL tool vendors

A
  1. Informatica PowerCenter
  2. Bluesky Integration Studio
  3. SAP Data Integrator
  4. IBM WebSphere Data Integration Suite SAS enterprise data integration
  5. Oracle Warehouse Builder
  6. MS SQL Server Integration Services
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

BI front-end applications

A
  1. Query and reporting
  2. OLAP
  3. Data mining models
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Query and reporting

A
  • Static reports generated with SQL or SQL-generators via the meta-layer.
  • Answers typical business questions.
  • When a business question changes a little, the SQL query needs to be changed.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Methods of reporting

A
  • A very popular method of reporting is reporting via a meta-layer query and reporting tool.
    + used when a company does not have SQL knowledgr.
    + creates a middle layer between the database and business.
    + can be done by Microsoft Access for example.
  • Reporting via SQL
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

OLAP

A
  • Enables users to interactively analyze multi-dimensional data from multiple perspectives.
  • Answers simple and complex business questions.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Multi-dimensional data

A

Data can be classified into measures and dimensions. Aggregating measures up to certain dimensions creates a multi-dimensional view on the data (= data cube).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Measures

A

Summable information concerning a business process.

17
Q

Dimensions

A

Represent different perspectives on viewing measures and dimensions are organized hierarchically. Hierarchy represents different levels of aggregation.

18
Q

Fact

A

When a cell in a data cube has a value.

19
Q

OLAP operators

A
  • Roll up: aggregating measures to a higher dimension level -> rollup on time from quarters to year.
  • Drill down: reverse of roll up -> drill down on time from quarters to months.
  • Slice (& dice): selecting subset of cells that satisfy a certain selection condition -> slice for time = Q1. Reverse is unslice and undice.
  • Pivot (or rotate): rotates data axes.
20
Q

Data mining models

A
  • Objective is to discover new valuable knowledge in databases.
  • Answers questions like:
    + What customers are likely to respond to my promotion.
    + Which customer is at risk of leaving.
21
Q

Modeling OLAP databases, define

A
  • Measures
  • Dimensions
  • Dimension hierarchies
22
Q

Star schema

A

A fact table in the middle connected to a set of dimension tables. It is a simple model that limits the number of joins for a better query performance.
In a star schema the primary key is composed of the primary keys of the different tables.
There is data redundancy due to denormalization (a lot is put in one table with a lot of rows). This improves query performance and less joins are needed.

23
Q

Snowflake schema (normalized)

A

A refinement of Star schema where some dimensional hierarchy is normalized into a set of smaller dimension tables, forming a shape similar to a snowflake. A Snowflake schema is used when the data volume is getting too large.

  • Why ? Tackles disadvantages of the Star schema, because it is less restricted, smaller data volumes, and the possibility to store aggregations.
  • However, you lose advantages of the Star schema. Is is not easily understood by end-users, high number of joins and no predictable/standard framework.
24
Q

Sparsity

A

Arises when not every cell in the cube is filled with data.

25
Q

Density

A

of actual rows / # of potential rows

26
Q

How to deal with changes in dimension attributes, or store type changes, etc.

A

Solutions by Kimball:

  1. Overwrite old values.
  2. Add ‘version number’ field as part of the key.
  3. Add new column(s) to dimension table.
27
Q

How to deal with infrequent changes

A

Overwrite old value = restating.

  • Kills information
  • Use for correction of errors
  • Sometimes history is not important…
28
Q

How to deal with small individual changes

A

Add ‘version number’ field as part of the key.

  • Insert new record with new version number.
  • Optional: use additional fields like ‘date from’ - ‘date to’
  • Time-sliced version of customer/store/…
  • Most popular solution
  • No referential integrity anymore, or refer with complete key.
29
Q

How to deal with suddenly big changes

A

Add new column(s) to dimension table.

  • Save old values in new columns
  • New values and old values in table
  • Analysis with old columns is still possible
  • Eventually, remove added column(s)