02 Data Warehouse Architecture, DWH Tool overview Flashcards

1
Q

What is the definition of a Data Warehouse?

A
  • A DWH is a collection of subject-oriented, integrated, nonvolatile and time-varying data to support management decisions.
    • Subject-oriented: analytical requirements, contrasting analytical databases.
    • Integrated: joining data from multiple sources, solving differences in data format.
    • Nonvolatile: durability of data, disabling modification, expanding scope of data.
    • Time-varying: older data can be retrieved, information paired with time.
  • A DWH is a copy of transaction data specifically structured for query and analysis.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Graphically explain the (7) main DWH Architecture components:

A
  • Analysis and visualization: dashboards, standard reports, spreadsheets, ad-hoc queries, data mining, planning.
  • Data Marts
  • Web-based Access
  • DWH Extension: Map-reduce, Hadoop, HDFS (Hadoop Distributed File System)
  • DWH Core: metadata, data basis, OLAP Server
  • Data Migration (ETL): selection, extraction, cleaning, transformation, load/update
  • Operational Data Sources: internal data sources, external data sources (Streams, HDFS, Logs).
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

From the DWH Architecture, clasify and mention relevant examples for the Operational Data Sources:

A

Operational Data Sources can be divided in:

  • Internal Data Sources:
    • ERP systems
    • CRM systems
    • BPM systems
    • ODS (operational data stores)
    • Master data management
    • Other DWHs
    • Flat files (CSVs, text files, others)
  • External Data Sources:
    • Streams
    • HDFS (Hadoop Distributed File System)
    • Web feeds
    • Flat files (CSVs, text files, others)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

From the DWH Architecture, describe the steps and mention some examples for the Data Migration (ETL):

A
  • ETL steps:
    • Selection (of sources)
    • Extraction (of actual data for further processing)
    • Cleansing (data unification rules, standardized values, validate)
    • Transformation (convert data using same units, joining data, agreggates, sortings, calcultations…)
    • Loading/Updating (correct fulll loads)
  • About 70% of DWH dev hours go to the ETL process
  • Scenarios: data aggregation, currency conversion, errors, deduplication, batch processing
  • Metadata generation/repository:
    • Technical (system, source data, source frequency)
    • Business (display name, content descriptions)
    • Process (task logs, disk reads, query time, rows processed)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are some motives for real time ETL? Also explain some of its tradeoffs.

A
  • Users demand higher levels of freshness (relevant for decision making, BI)
  • Globalization of economy (operating in several time-zones, nigh-time update windows disappear)
  • New types of data sources (live web feeds, stock market prices, more real time info available)
  • Affordable technical realizations (better loading performance thanks to faster technologies => column storing, in-memory computing)
  • Also has Tradeoffs:
    • Possible incomplete records (replacing batch files with reads from message queues, transaction logs)
    • Restricting data quality screening to some columns
    • Allowing new/early facts to be posted with old dimensions (outdated master data records, danger of missing referential integrity)
    • Data staging is eliminated
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Mention some reasons to separate OLTP (transactions) and OLAP (analytical, DWH):

A
  • Response times:
    • Bad query performance on operational DBs
    • Read-oriented data storage in DWHs
  • Long-term data storage:
    • Analyzing data over a period of time
  • Independent data access:
    • No interference of operational systems
  • Standardization of data formats, data harmonization:
  • Enhancing data quality
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Mention some key characteristics of OLTP:

A
  • OLTP = On Line Transaction Processing
  • Application: Operational, ERP, CRM, legacy Apps.
  • Typical users: Staff
  • Time horizon: Weeks, Months
  • Refresh: Inmediate
  • Data model: Entity-Relationship
  • Schema: Normalized
  • Emphasis: Update (constant transactions)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Mention some key characteristics of OLAP:

A
  • OLAP = On Line Analytic Processsing
  • Application: MIS, DSS, DWH
  • Typical users: Managers, Executives
  • Time horizon: Years
  • Refresh: Periodic
  • Data model: Multi-dimensional
  • Schema: Star
  • Emphasis: Retrieval (read)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Mention the (3) elements of the DWH Core Layer:

A
  • MetaData
  • DWH Data Basis
  • OLAP Server
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Mention the elements of the multidimensional model (data cubes):

A
  • Dimensions (general axis names: products, time, location)
  • Reference objects (names on the dimensions axis: México, USA, Canada)
  • Facts (stored content on each individual X,Y,Z coordinate)
  • Ratios
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are Data Marts?

A
  • Data marts are specialized DWHs targeted to a particular functional area or user group.
  • Can be derived from an enterprise DWH or collected directly from data sources.
  • Easier to build than an enterprise DWH.
  • Why a data mart approach?
    • Semantic partitioning
    • Archiving
    • Load balancing
    • Data protection
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are the (2) main data mart design approaches?

A
  • Top-down design
    • The requirements of users are merged before the design process begins. A DWH is created.
    • Separate data marts exist for each business area or process.
    • Data marts are closer to reporting.
  • Bottom-up design
    • Separate schemas for each data mart based on the requirements of business area.
    • Later, these schemas form a global schema for the DWH.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Mention some key (points) differences between Traditional BI and Data Discovery:

A
  • Traditional BI:
    • IT department
    • Mega vendors, large independent sellers
    • Top-down/Central approach
    • Interface based on Dashboards/KPIs
    • Reporting as main use
    • Query on existing repositories (data storage)
    • Deployed by Consultants
  • Data Discovery:
    • Business-users oriented
    • Small, fast growing independent sellers
    • Bottom-up/Decentralized approach
    • Interface based on Visualizations
    • Analysis as main use
    • Dedicated repository (data storage)
    • Deployed by Users
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are some drawbacks of data mart approaches?

A
  • No reconcilability of data (single point of truth is rlost)
  • Reconcile data inequalities can be difficult
  • Increasing extracts from the DWH needed
  • Change propagation (changing one data marte may affect others) -> Chances of error quickly grow
  • Non-extensibility (might be needed to start from scratch after big organizational changes)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

MapReduce can be used to extend a DWH, why?

A
  • Traditional and NoSQL DB’s are unfit to deal with the analysis of large quantities of data.
  • Having a MapReduce component allows for Big Data analytics tasks.
  • MapReduce can run complex analytical tasks in parallel.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Mention the (3) typical recepients of reports and what kind of report receive each of them:

A
  • CEO/Management: dashboards
  • Data Scientists: dashboards, standard reports, spreadsheets, ad-hoc queries, data mining and planning.
  • Business Users: standard reports, spreadsheets, ad-hoc queries, planning
17
Q

What are a list of the things to consider when designing a report?

A
  • Purpose: analytics, reporting
  • Update frequency: monthly, daily, hourly, right-time, real-time
  • Audience/Recipients: executives, business analysts
  • Output format: screen (size), paper (size), mobile devices, web
  • Data type: quantitative (booking, revenues, orders) or non-quantitative (top 10 customers, issues for investigation)
18
Q

Why data visualization is becoming increasingly important/valuable?

Also mention some benefits of it.

A
  • Tools and practices for data visualization, data discovery, visual analysis enable non-tech users to make effective user of data and reduce time to insight.
  • In BigData times it is essential to provide graphical representations of data and analytical concepts.
  • Visual analytics enable interactions with data and engagement in analytical processes.
  • Moreover, data aggregations and summaries remain critical to support visual reporting and analytics
  • Overall top benefits seeked by companies:
    • Improved operational efficiency
    • Faster response to business change
    • Ability to identify new business opportunities
    • Higher productivity
19
Q

Mention some Information Dashboard Design principles, elaborate on the Gestalt principles:

A

Information Dashboard Design general considerations:

  • Encoding data for rapid perception
  • Using attributes of colours
  • Using attributes and form (quantitative vs qualitative/categorical data)

Gestalt principles:

  • Proximity (objects that are closer together are perceived as more related than objects that are further apart)
  • Similarity (objects look similar to one another, perceive them as a coherent group or pattern, a single unit)
  • Enclosure (creating groups, elements are perceived as part of a group if they are located within the same closed region, everything outside the enclosure is seen as separated)
  • Closure (occurs when an object is incomplete or a space is not completely enclosed, a shape indicates a filling in the missing information)
  • Continuity (occurs when the eye is compelled to move through one object and continue to another one smoothly)
  • Connection (elements that are visually connected are perceived as more related than elements with no connections at all)
20
Q

Define and elaborate on the Tufte’s design principle of data-ink ratio:

A

Tufte’s design principle of data-ink ratio: every bit of ink on a graphic required a reason, almost always that reason should be that ink actually presents new information. Applying this to data pixels:

  • Reduce non-data pixels: eliminate all unnecessary non-data pixels, de-emphasize and regularize the non-data pixels that remain. Examples: getting rid of frames and using simple axis, avoid variations in color, use blank space instead of borders or lines, usin light colors in grid lines.
  • Enhance data pixels: eliminate all unnecessary data pixels, highlight the most important data pixels that remain. Examples: using visual emphasis through colors, varying color intensities to highlight data dynamically.