CH1 Flashcards

(83 cards)

1
Q

What is a data warehouse

A
subject-oriented, 
integrated,
time-varying, 
non-volatile 
collection of data that is used primarily in organizational decision making
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

subject oriented

A

Focused on business related subjects and organizational activity
Ex: Customers, employees, products, suppliers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Integrated

A

Data from different sources are stored in a consistent format
Requires addressing naming conflicts, unit discrepancies
. same naming format, concept, etc.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

TIme-Varient

A

Maintains historical data

Each source may contain data at different time points –day, week, month etc. Used to analyze trends.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

nonvolatile

A

Users cannot change data after it is entered. It is periodically updated.
Structured for query and analysis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Database vs. Data Warehouse

A

database for querying, for transaction processing.
Data warehouse takes the data from all different databases and creates a layer optimized for and dedicated to analytics.

A database is designed to handle transactions (OLTP). It is not structured to do analytics well.

A data warehouse, on the other hand, is structured to make analytics fast and easy.
Database you don’t do analysis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Need for Data Warehousing

A

Integrated, company-wide view of high-quality information (from disparate databases)
Separation of operational and informational systems and data (for improved performance)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Need for a Company-Wide View

A

Data in operational systems are fragmented and inconsistent.
Data is generally distributed on a variety of incompatible hardware and software.
Data must be consolidated to provide a single corporate view

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Issues to resolve

A
Inconsistent key structures!
Synonyms
Free-form vs. structured fields
Inconsistent data values
Missing data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Organizational Trends Motivating Data Warehouses

A
No single system of records
Multiple systems not synchronized
Organizational need to analyze activities in a balanced way
Customer relationship management
Supplier relationship management
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Data Warehouse Versus Data Mart

A

DataMart is subset of Data Warehouse. single subject.
DataMart is specific, decentralized, organic, some history, hightly denormalized.. Datamart could be a table in a database. segmented data(partitioned).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Logical Data Mart and Real-Time Data Warehouse

A

capture real time information

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Independent Data Mart

Dependent Data Mart and Operational Data Store

A

Dependent: Dependent data marts are created by drawing data directly from operational, external or both sources.
Independent: Independent data mart is created without the use of a central data warehouse.
Hybrid: This type of data marts can take data from data warehouses or operational systems.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

all datamarts

A

All involve some form of extraction, transformation and loading (ETL)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

why use a datamart?

A

Data Mart helps to enhance user’s response time due to reduction in volume of data
It provides easy access to frequently requested data.
Data mart are simpler to implement when compared to corporate Datawarehouse. At the same time, the cost of implementing Data Mart is certainly lower compared with implementing a full data warehouse.
Compared to Data Warehouse, a datamart is agile. In case of change in model, datamart can be built quicker due to a smaller size.
A Datamart is defined by a single Subject Matter Expert. On the contrary data warehouse is defined by interdisciplinary SME from a variety of domains. Hence, Data mart is more open to change compared to Datawarehouse.
Data is partitioned and allows very granular access control privileges.
Data can be segmented and stored on different hardware/software platforms.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Dependent data mart with operational data store: a three-level architecture

A

ETL . see slide 16

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

near real time

A

can only be done if data is coming in clean

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Extraction

A

connect via gatewarys and interfaces, odbc, oractle connect, etc.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

data cleaning and transforming

A

Field length, anomalies, missing data etc.
Data migration -Include simple transforms – (field names etc.)
Data scrubbing - Use domain specific knowledge (zip codes)
Data auditing – discover rules, relationships, suspicious patterns

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Load

A

Check integrity constrains (foreign key, primary key, null etc.)
Sorting, summarization, partition, index

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Refresh

A

When and how
When - Depends on user needs, traffic, database server capabilities
How -Triggers (Oracle), Sniff transaction logs (MS)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Event

A

Event = a database action (create/update/delete) that results from a transaction

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Status

A

after event status updated

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

transitory data

A

snapshot, whatever update before is lost.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
transient data
With transient data, changes to existing records are written over previous records, thus destroying the previous data content
26
Periodic Data
Periodic data are never physically altered or deleted once they have been added to the store creaet second colum called Action column. C=current. U= update, you are updating values. so your storing the older rows(historical). D=Deleted data.
27
Database Limitation
``` Relational databases consists of tables Tables are flat one-dimensional Cross-tabs can be 2-dimensional How to represent multi-dimensional tables? The job of data warehousing/OLAP ```
28
Dimensional Modeling
Is a preferred technique to present analytic data Contains same information as a normalized model Deliver data that’s understandable to the user Deliver fast query performance
29
Dimensional Modeling types
Star Schemas: Dimensional modeling implemented in relational DBMS (ROLAP) OLAP (online analytical processing): Dimensional modeling in a multi-dimensional DB environment OLAP cubes deliver superior query performance because of pre-calculations, indexing strategies, and other optimizations
30
ROLAP
Relational Olap model. Star Schemaa Dimensional modeling implemented in relational dbms. similar to rdms.
31
Star Schema
multiple tables that point to 1 fact table in relational dbms
32
Dimensional table
Dimensions are the subjects, who what when why
33
Fact table
collect data about subjects(dimensions)
34
Dimensions - context
who what where when hos of measurement, example total amount of sales, qty, etc.
35
dimensional model
characterize in fact or dimension tables.
36
same grain?
measurements rows in a fact table must be at the same grain. When creating table, at what detail of specification do you want to save the data. weekly, monthly, or individual transactions. they have to match.
37
Dimension Tables
Dimension tables often have many columns (attribute) Each dimension table contains data for one dimension Dimension table often represent hierarchical relationships Product roll up into brands and then into categories Each dimension is defined by a single primary key (PK) PK serves as the basis of referential integrity with the given fact table to which it is joined.
38
Online Analytical Processing (OLAP) Tools
The use of a set of graphical tools that provides users with multidimensional views of their data and allows them to analyze the data using simple windowing techniques OLAP provides advanced query capabilities to the warehouse that standard SQL cannot Complex queries that need to aggregate data can take hours to run End users cannot be expected to issue SQL statements
39
Cube
A multidimensional structure consisting of “Data Cubes”
40
Dimension
Sides of a cube
41
Measure
Facts in a fact table
42
Aggregation
Projection of the cube
43
Cube shape
A Cube need not be in the shape of a cube at all. It can have as many dimensions as necessary. Each dimension can have as many “members” as necessary. A cube can have as many dimensions as necessary.
44
what goes inside the cube
measures from fact table
45
What is a fact table?
The Fact Table is the table that provides the data for the elements of the Cube. There can be only one Fact Table per Cube!
46
What is an Aggregation
Is a Projection of the cube. An Aggregation is a Projection of the Cube An Aggregation Collapses Dimensions. Summation. similar to SQL "Group BY" Clause
47
OLAP Provides 4 types of Aggregation
sum, count, min, max
48
cube slicing
come up with 2-D view of data by filtering (fixing) a dimension. like a where clause, slice based on a condition. example slice between the males and femails to compare.
49
dicing
come up with a small cube (sub-cube) by selecting a subset of all dimensions
50
drill down
going from summary to more detailed views.
51
roll up
going from detailed views to a summary view.
52
pivot(rotate)
to rotate the cube across a dimension to see various faces.
53
add slide
61
54
operational systems vs. bi systems
information. This asset is almost always used for two purposes: operational record keeping and analytical decision making. Simply speaking, the operational systems are where you put the data in, and the DW/BI system is where you get the data out
55
What are operational Systems?
The operational systems are optimized to process transactions quickly. These systems almost always deal with one transaction record at a time. They predictably perform the same operational tasks over and over, executing the organization’s business processes. Given this execution focus, operational systems typically do not maintain history, but rather update data to refl ect the most current state.
56
BI Users
Users of a DW/BI system, on the other hand, watch the wheels of the organization turn to evaluate performance. They count the new orders and compare them with last week’s orders, and ask why the new customers signed up, and what the customers complained about. They worry about whether operational processes are working correctly. Although they need detailed data to support their constantly changing questions, DW/BI users almost never deal with one transaction at a time. These systems are optimized for high-performance queries as users’ questions often require that hundreds or hundreds of thousands of transactions be searched and compressed into an answer set. To further complicate matters, users of a DW/BI system typically demand that historical context be preserved to accurately evaluate the organization’s performance over time
57
Business management requirements for BI/DataWarehouse
The DW/BI system must make information easily accessible. The contents of the DW/BI system must be understandable The DW/BI system must present information consistently. The data in the DW/BI system must be credible. Data must be carefully assembled from a variety of sources, cleansed, quality assured, and released only when it is fi t for user consumption. The DW/BI system must adapt to change. User needs, business conditions, data, and technology are all subject to change The DW/BI system must present information in a timely way The DW/BI system must be a secure bastion that protects the information assets The DW/BI system must serve as the authoritative and trustworthy foundation for improved decision making The business community must accept the DW/BI system to deem it successful
58
Dimensional modeling is widely accepted as the preferred technique for presenting analytic data because it addresses two simultaneous requirements
■ Deliver data that’s understandable to the business users. | ■ Deliver fast query performance.
59
difference between 3NF and dimensional model
Both 3NF and dimensional models can be represented in ERDs because both consist of joined relational tables; the key diff erence between 3NF and dimensional models is the degree of normalization. Because both model types can be presented as ERDs, we refrain from referring to 3NF models as ER models; instead, we call them normalized models to minimize confusion.
60
same info as normalized model?
A dimensional model contains the same information as a normalized model, but packages the data in a format that delivers user understandability, query performance, and resilience to change.
61
Star Schemas Versus OLAP Cubes
Dimensional models implemented in relational database management systems are referred to as star schemas because of their resemblance to a star-like structure. Dimensional models implemented in multidimensional database environments are referred to as online analytical processing (OLAP) cubes, as illustrated in Figure 1-1.
62
loading data in OLAP Cube
When data is loaded into an OLAP cube, it is stored and indexed using formats and techniques that are designed for dimensional data. Performance aggregations or precalculated summary tables are often created and managed by the OLAP cube engine.
63
analytics with OLAP
OLAP cubes also provide more analytically robust functions that exceed those available with SQL. The downside is that you pay a load performance price for these capabilities, especially with large data sets.
64
what does a fact represent?
a business measure
65
how and were store measurement data?
You should strive to store the low-level measurement data resulting from a business process in a single dimensional model. Because measurement data is overwhelmingly the largest set of data, it should not be replicated in multiple places for multiple organizational functions around the enterprise. Allowing business users from multiple organizations to access a single centralized repository for each set of measurement data ensures the use of consistent data throughout the enterprise
66
what does each row in a fact table represent?
Each row in a fact table corresponds to a measurement event. The data on each row is at a specifi c level of detail, referred to as the grain, such as one row per product on a sales transaction
67
one to one measurement
The idea that a measurement event in the physical world has a one-to-one relationship to a single row in the corresponding fact table is a bedrock principle for dimensional modeling. Everything else builds from this foundation.
68
most useful facts
The most useful facts are numeric and additive, such as dollar sales amount. Additivity is crucial because BI applications rarely retrieve a single fact table row. Rather, they bring back hundreds, thousands, or even millions of fact rows at a time, and the most useful thing to do with so many rows is to add them up.
69
semi additive and non additive facts
Semi-additive facts, such as account balances, cannot be summed across the time dimension. Non-additive facts, such as unit prices, can never be added.
70
fact table | grains fall into one of three categories
ransaction, periodic snapshot, and accumulating | snapshot. Transaction grain fact tables are the most common
71
fact tables and foreign keys
All fact tables have two or more foreign keys (refer to the FK notation in Figure 1-2) that connect to the dimension tables’ primary keys.
72
what is a fact table primary key made of?
The fact table generally has its own primary key composed of a subset of the foreign keys. This key is often called a composite key. Every table that has a composite key is a fact table. Fact tables express many-to-many relationships. All others are dimension tables.
73
what does dimension table describe?
They describe the “who, what, where, when, how, and why” associated with the event. Dimension tables tend to have fewer rows than fact tables, but can be wide with many large text columns. Each dimension is defi ned by a single primary key (refer to the PK notation in Figure 1-3), which serves as the basis for referential integrity with any given fact table to which it is joined
74
What are dimension tables primary source of?
Dimension attributes serve as the primary source of query constraints, groupings, and report labels. In a query or report request, attributes are identifi ed as the by words. For example, when a user wants to see dollar sales by brand, brand must be available as a dimension attribute
75
what should attributes consist of in dimension tables?
Attributes should consist of real words rather than cryptic abbreviations. You should strive to minimize the use of codes in dimension tables by replacing them with more verbose 14 Chapter 1 textual attributes.In many ways, the data warehouse is only as good as the dimension attributes; the analytic power of the DW/BI environment is directly proportional to the quality and depth of the dimension attributes
76
how to tell if it's a dimension or should be in fact table
You often make the decision by asking whether the column is a measurement that takes on lots of values and participates in calculations (making it a fact) or is a discretely valued description that is more or less constant and participates in constraints and row labels (making it a dimensional attribute). For example, the standard cost for a product seems like a constant attribute of the product but may be changed so often that you decide it is more like a measured fact.
77
do fact tables or dimension tables represent fierarchical relationships?
dimension tables often represent hierarchical relationships. For example, products roll up into brands and then into categories. For each row in the product dimension, you should store the associated brand and category description. The hierarchical descriptive information is stored redundantly in the spirit of ease of use and query performance. You should resist the perhaps habitual urge to normalize data by storing only the brand code in the product dimension and creating a separate brand lookup table, and likewise for the category description in a separate category lookup table
78
snowflaking
This normalization is called snowfl aking. Instead of third normal form, dimension tables typically are highly denormalized with fl attened many-to-one relationships within a single dimension table. Because dimension tables typically are geometrically smaller than fact tables, improving storage effi ciency by normalizing or snowfl aking has virtually no impact on the overall database size. You should almost always trade off dimension table space for simplicity and accessibility
79
what data has the most dimensionality
This book illustrates repeatedly that the most granular or atomic data has the most dimensionality. Atomic data that has not been aggregated is the most expressive data; this atomic data should be the foundation for every fact table design to withstand business users’ ad hoc attacks in which they pose unexpected queries. With dimensional models, you can add completely new dimensions to the schema as long as a single value of that dimension is defi ned for each existing fact row. Likewise, you can add new facts to the fact table, assuming that the level of detail is consistent with the existing fact table. You can supplement preexisting dimension tables with new, unanticipated attributes. In each case, existing tables can be changed in place either by simply adding new data rows in the table or by executing an SQL ALTER TABLE command.
80
``` SELECT store.district_name, product.brand, sum(sales_facts.sales_dollars) AS "Sales Dollars" FROM store, product, date, sales_facts WHERE date.month_name="January" AND date.year=2013 AND store.store_key = sales_facts.store_key AND product.product_key = sales_facts.product_key AND date.date_key = sales_facts.date_key GROUP BY store.district_name, product.bran ```
If you study this code snippet line-by-line, the fi rst two lines under the SELECT statement identify the dimension attributes in the report, followed by the aggregated metric from the fact table. The FROM clause identifi es all the tables involved in the query. The fi rst two lines in the WHERE clause declare the report’s fi lter, and the remainder declare the joins between the dimension and fact tables. Finally, the GROUP BY clause establishes the aggregation within the report.
81
Operational Source Systems
These are the operational systems of record that capture the business’s transactions. Think of the source systems as outside the data warehouse because presumably you have little or no control over the content and format of the data in these operational systems. The main priorities of the source systems are processing performance and availability. Operational queries against source systems are narrow, one-record-at-a-time Data Warehousing, Business Intelligence, and Dimensional Modeling Primer 19 queries that are part of the normal transaction fl ow and severely restricted in their demands on the operational system
82
The extract, transformation, and load (ETL)
environment consists of a work area, instantiated data structures, and a set of processes. The ETL system is everything between the operational source systems and the DW/BI presentation area
83
3 step process of ETL
1.Extraction is the fi rst step in the process of getting data into the data warehouse environment. Extracting means reading and understanding the source data and copying the data needed into the ETL system for further manipulation. At this point, the data belongs to the data warehouse. 2.After the data is extracted to the ETL system, there are numerous potential transformations, such as cleansing the data (correcting misspellings, resolving domain conflicts, dealing with missing elements, or parsing into standard formats), combining data from multiple sources, and de-duplicating data. The ETL system addsvalue to the data with these cleansing and conforming tasks by changing the data and enhancing. 3.fi nal step of the ETL process is the physical structuring and loading of data into the presentation area’s target dimensional models. Because the primary mission of the ETL system is to hand off the dimension and fact tables in the delivery step, these subsystems are critical