Stores measurable business events

tmlsn Flashcards by hanna mcadam

What is a data warehouse

Centralized system designed to store process and analyze large volumes of data from multiple sources
Optimized for queries and reporting rather than transactions.

How well did you know this?

Not at all

Perfectly

OLTP

Fast reads / writes
Day to day transactions
Real-time, current data

How well did you know this?

Not at all

Perfectly

OLAP

Complex queries
Analytical queries and reporting
Historical, aggregated data
Read optimized for querying

How well did you know this?

Not at all

Perfectly

Database

OLTP online transaction processing
-structured system for storing and managing real time transactions handling read/writes for day to day
-sql server
-maintaining inventory records in a store

How well did you know this?

Not at all

Perfectly

Data warehouse

OLAP online analytical processing
-central repo optimized for storing historical data and performing complex analytical queries. Holds structured data
-red shift, snowflake
-trends on last 5 years, hr report across departments

How well did you know this?

Not at all

Perfectly

Data lake

-massive storage holding raw unstructed semi struct and structured data.
-used for data processing, ml
-s3, azure data lake, Hadoop
-storing raw sensor data from IoT devices
-unfiltered ocean of information

How well did you know this?

Not at all

Perfectly

Data mart

Subset of warehouse focused on specific department or business unit (sales, hr, marketing)
-provides faster access to relevant insights
-structured data
-hr analyzing employee retention

How well did you know this?

Not at all

Perfectly

Star schema

o One fact connected to multiple dimension tables
o Optimized for fast querying
o Sales fact table

How well did you know this?

Not at all

Perfectly

Snowflake schema

o Normalizes dimensions into sub-dimensions
o Reduces data redundancy but increases query complexity

How well did you know this?

Not at all

Perfectly

Galaxy schema – fact constellation

o Multiple fact tables share common dimension tables
o Complex business scenarios (sales, returns data)

How well did you know this?

Not at all

Perfectly

Slowly changing dimensions (SCD) type 1

o Overwrites old data

How well did you know this?

Not at all

Perfectly

Slowly changing dimensions (SCD) type 2

o New row is added for every address change with start and end date

How well did you know this?

Not at all

Perfectly

Slowly changing dimensions (SCD) type 3

o Limited history (prev_addr is added to track only the last change)

How well did you know this?

Not at all

Perfectly

FACT table

Stores measurable business events

How well did you know this?

Not at all

Perfectly

DIM table

Stores descriptive attributes

How well did you know this?

Not at all

Perfectly

Landing vs staging

Study These Flashcards

Landing is data as is staging is modified, multiple layers, structured

How do you approach ETL design

Study These Flashcards

Gathering stakeholder needs.
* Analyzing data sources.
* Researching architecture and processes.
* Proposing a solution.
* Fine-tuning the solution based on feedback.
* Launching the solution and user onboarding.

TYPICAL ETL SOURCES?

Study These Flashcards

-databases
-cloud storage like s3, blob storage
-flat files (csv, Json)
-APIs, like for web scraping

What data validation techniques would you use?

Study These Flashcards

Schema validation
o Pyspark schema enforcement and SQL constraints to detect missing or incorrect fields
Data type & format checks
o Date formats, numerical precision, text standardization
Business rule validation
o Custom business logic (mortgage balance cant be negative)
Duplicate detection
o Windows functions and deduplication rules
Range and anomaly detection
o Flagged outliers using aggregate and statistical checks

How have you optimized etl workflows in the past?

Study These Flashcards

Refactored SQL queries for performance
o Explain plan
o Optimized joins and aggregations
o Removed unnecessary sub queries and used CTEs instead
Partitioning and indexing (faster data retrieval)
o Table partitioning by date, location to improve query speeds
o Created indexes on frequently queried columns reducing scan time
Incremental loading
o Change Data Capture (CDC) – only process new or updated records
Parallel processing and workflow automation
o Automated error handling and retry mechanisms to prevent workflow failure.

Normal view

Study These Flashcards

dynamically fetches data on each query

Materialized view

Study These Flashcards

precomputed stored result of a query

OLAP cubes and structures

Study These Flashcards

An OLAP cube is a multi-dimensional data structure that enables fast analytical queries by pre-aggregating data into different dimensions.

types of OLAP

Study These Flashcards

MOLAP (Multidimensional OLAP)
ROLAP (Relational OLAP)
HOLAP (Hybrid OLAP)

MOLAP (Multidimensional OLAP)

* Stores precomputed OLAP cubes in a specialized database

ROLAP (Relational OLAP)

* Stores data in relational databases (e.g., SQL, Snowflake, Redshift)

HOLAP (Hybrid OLAP)

* Combines MOLAP & ROLAP – Some data is precomputed, and some is queried dynamically

What is a cluster

A pool of computers working together but viewed as a single system

What is a container?

- container is an isolated virtual runtime environment Comes with some CPU and memory allocation

First normal form (1NF)

each row in the table should have a unique identifier, and each value in the table should be indivisible (atomic value).

Second Normal Form (2NF)

that each non-key column in a table is dependent on the primary key. In other words, there should be no partial dependencies in the table.

Third Normal Form (3NF)

requiring that all non-key attributes are independent of each other. This means that each column should be directly related to the primary key, and not to any other columns in the same table

BCNF — Boyce-Codd Normal Form

ensures that each determinant in a table is a candidate key. In other words, BCNF ensures that each non-key attribute is dependent only on the candidate key.

Fourth Normal Form (4NF)

used to eliminate the possibility of multi-valued dependencies in a table. A multi-valued dependency occurs when one or more attributes are dependent on a part of the primary key, but not on the entire primary key.

Fifth normal form (5NF)

Project-Join Normal Form (PJNF). It is used to handle complex many-to-many relationships in a database.

Star Schema: Pros:

1. Simplicity: With denormalized dimension tables, the structure is easier to understand and navigate. 2. Query Performance: Fewer joins are required, which can improve the speed of querying large datasets. 3. Adaptability: It’s relatively easy to add new dimensions without altering existing structure.

Star Schema: Cons:

1. Redundancy: Denormalization can lead to data redundancy in dimension tables. 2. Storage: The redundant data might require more storage space. 3. Maintenance: Redundancy can make updates and inserts more complex, potentially leading to inconsistencies.

Snowflake Schema: Pros:

1. Normalization: Reduced data redundancy due to the normalization of dimension tables. 2. Storage Efficiency: Uses less storage space compared to the star schema because of reduced redundancy. 3. Clear Structure: The normalized structure can be clearer for those familiar with relational database design.

Snowflake Schema: Cons:

tmlsn Flashcards

(39 cards)