Data Warehousing Flashcards
(101 cards)
Data warehouse
Decision support database that is maintained separately from the organization’s operational database. It supports information processing by providing a solid platform of consolidated, historical data for analysis.
Subject-Oriented Data Warehouses
Organized around major subjects like customers, products and sales. It focuses on modelling and analysis of data for decision makers which is not on daily operations or transaction processing. Provides a simple and concise view around particular subject issues by excluding data that isn’t useful in the decision support process.
Integrated Data Warehouses
constructed by integrating multiple heterogeneous data sources such as relational DBs, flat files, Excel file and transaction records. Data cleaning and data integration techniques are applied to ensure consistency in naming conventions, encoding structures, attribute measures etc. among different data sources. When data is moved to the warehouse it is converted.
Time Variant in Data Warehouses
the time horizon for data warehouses is much longer than of operational systems. As they provide historical data from the past 5-10 years. Every structure in the data warehouse contains an element of time, explicitly or implicitly, but the key of operational data may or may not contain a time element.
Non-Volatile Data Warehouse
Physically separate store of data transformed from the operational environment. Operational update of data doesn’t occur in the data warehouse environment. It doesn’t require transaction processing, recovery and concurrency control mechanisms. It requires only two operations in accessing data: initial loading of data and access of data.
heterogeneous DBMS
integrates disparate databases to provide a unified query interface for users. This system relies on a query-driven approach
A data warehouse is an _____ driven system. What does it focus on
update driven system that integrates and stores information from heterogeneous sources in advance. It focuses on high performance for querying and analysis
Operational DBMS
focuses on Online Transaction Processing (OLTP) while data warehouse focuses on Online Analytical Processing (OLAP)
Fact Table
Contains measurable data or metrics about business processes ex sales
Dimension Table
Contains descriptive attributes that label and filter data ex. Product Names
Why do we need a separate warehouse?
- Performance: OLTP systems are tuned for transaction processing whilst OLAP is optimized for complex analytical queries, multidimensional views and data aggregation
- Historical Data: Operation databases typically lack historical data, which is critical for decision making
- Data Consolidation: Decision-making often requires aggregated and summarized data from multiple heterogeneous sources.
- Data Quality: Ensures consistency in data representation, codes and formats from diverse sources
Data Warehouse Design Overview
Key Design Perspectives
1 Top-Down View: Identifies relevant information needed for the data warehouse.
2 Data Source View: Exposes data being captured and managed by operational systems
3 Data Warehouse View: Defines fact tables and dimension tables for analytical use.
4 Business Query View: Represents how end users will query and interact with the data
Design Process
Top Down: Focuses on overall planning and design
Bottom Up: Starts with prototypes and experiments
Waterfall: Structured, step by step process
Spiral: Iterative development with quick functional prototypes.
Design Steps
1 Choose a business process: Identify processes to model. Ex orders.
2 Define the grain: decide the atomic level of data granularity (ex. Individual orders or daily summaries)
3 Select Dimensions: Identify descriptive attributes (ex. Time or location)
4 Choose Measures: Define numerical metrics (ex. Sales amount) for the fact table.
Enterprise Warehouse
Collects all subject info spanning the organization. It features a unified approach for organizing and representing data, the ability to classify data according to subject and give access to those divisions (sales etc.). Normalized design.
DataMart
A DataMart is the access layer of the data warehouse environment. Used to get data out to the users. The data mart is a subset of the data warehouse that is usually oriented to a specific business line or team. Data marts represent small slices of the data warehouse.
Virtual Warehouse
A set of SQL views over operational databases. Only some of the possible summary views may be persisted.
Data Warehouse Tools and Utilities
- Data Extraction: Get data from multiple heterogeneous and external sources
- Data Cleaning: Detect errors in the data and rectify where possible
- Data Transformation: Convert data from legacy or host format to warehouse format.
- Load: Sort, summarize, consolidate, compute views, check integrity, and build indices and partitions.
- Refresh: Propagate the updates from the data source to the warehouse.
Metadata is the data defining warehouse objects. It stores:
- Description of the structure of the data warehouse (schemas, views…)
- Operational Metadata (Data lineage, currency of data…)
- Technical Metadata (DB system names, tables etc…)
- Algorithms used for summarization
- The mapping from operational environment to the data warehouse
- Data related to system performance (Data Warehouse Schema, View…)
- Business Data (Business terms and definitions, ownership of data…)
OLAP Server Architecture
- ROLAP
- MOLAP
- HOLAP
- Specialized SQL Servers
Relational OLAP (ROLAP)
Use relational or extended relational DBMS to store and manage warehouse data and OLAP middleware. Include optimization of DBMS backend, implementation of aggregation based navigation logic and more tools. Scalable.
Multidimensional OLAP (MOLAP)
Sparse array-based multidimensional storage engine. Fast indexing to pre-computed summarized data.
Hybrid OLAP (HOLAP)
Flexibility ex. Low level: relational, high level: array
Specialized SQL servers
ex Redbricks, specialized support for SQL queries over star/snowflake schemas.