CH1 Flashcards
(83 cards)
What is a data warehouse
subject-oriented, integrated, time-varying, non-volatile collection of data that is used primarily in organizational decision making
subject oriented
Focused on business related subjects and organizational activity
Ex: Customers, employees, products, suppliers
Integrated
Data from different sources are stored in a consistent format
Requires addressing naming conflicts, unit discrepancies
. same naming format, concept, etc.
TIme-Varient
Maintains historical data
Each source may contain data at different time points –day, week, month etc. Used to analyze trends.
nonvolatile
Users cannot change data after it is entered. It is periodically updated.
Structured for query and analysis
Database vs. Data Warehouse
database for querying, for transaction processing.
Data warehouse takes the data from all different databases and creates a layer optimized for and dedicated to analytics.
A database is designed to handle transactions (OLTP). It is not structured to do analytics well.
A data warehouse, on the other hand, is structured to make analytics fast and easy.
Database you don’t do analysis
Need for Data Warehousing
Integrated, company-wide view of high-quality information (from disparate databases)
Separation of operational and informational systems and data (for improved performance)
Need for a Company-Wide View
Data in operational systems are fragmented and inconsistent.
Data is generally distributed on a variety of incompatible hardware and software.
Data must be consolidated to provide a single corporate view
Issues to resolve
Inconsistent key structures! Synonyms Free-form vs. structured fields Inconsistent data values Missing data
Organizational Trends Motivating Data Warehouses
No single system of records Multiple systems not synchronized Organizational need to analyze activities in a balanced way Customer relationship management Supplier relationship management
Data Warehouse Versus Data Mart
DataMart is subset of Data Warehouse. single subject.
DataMart is specific, decentralized, organic, some history, hightly denormalized.. Datamart could be a table in a database. segmented data(partitioned).
Logical Data Mart and Real-Time Data Warehouse
capture real time information
Independent Data Mart
Dependent Data Mart and Operational Data Store
Dependent: Dependent data marts are created by drawing data directly from operational, external or both sources.
Independent: Independent data mart is created without the use of a central data warehouse.
Hybrid: This type of data marts can take data from data warehouses or operational systems.
all datamarts
All involve some form of extraction, transformation and loading (ETL)
why use a datamart?
Data Mart helps to enhance user’s response time due to reduction in volume of data
It provides easy access to frequently requested data.
Data mart are simpler to implement when compared to corporate Datawarehouse. At the same time, the cost of implementing Data Mart is certainly lower compared with implementing a full data warehouse.
Compared to Data Warehouse, a datamart is agile. In case of change in model, datamart can be built quicker due to a smaller size.
A Datamart is defined by a single Subject Matter Expert. On the contrary data warehouse is defined by interdisciplinary SME from a variety of domains. Hence, Data mart is more open to change compared to Datawarehouse.
Data is partitioned and allows very granular access control privileges.
Data can be segmented and stored on different hardware/software platforms.
Dependent data mart with operational data store: a three-level architecture
ETL . see slide 16
near real time
can only be done if data is coming in clean
Extraction
connect via gatewarys and interfaces, odbc, oractle connect, etc.
data cleaning and transforming
Field length, anomalies, missing data etc.
Data migration -Include simple transforms – (field names etc.)
Data scrubbing - Use domain specific knowledge (zip codes)
Data auditing – discover rules, relationships, suspicious patterns
Load
Check integrity constrains (foreign key, primary key, null etc.)
Sorting, summarization, partition, index
Refresh
When and how
When - Depends on user needs, traffic, database server capabilities
How -Triggers (Oracle), Sniff transaction logs (MS)
Event
Event = a database action (create/update/delete) that results from a transaction
Status
after event status updated
transitory data
snapshot, whatever update before is lost.