tmlsn Flashcards
(39 cards)
What is a data warehouse
Centralized system designed to store process and analyze large volumes of data from multiple sources
Optimized for queries and reporting rather than transactions.
OLTP
Fast reads / writes
Day to day transactions
Real-time, current data
OLAP
Complex queries
Analytical queries and reporting
Historical, aggregated data
Read optimized for querying
Database
OLTP online transaction processing
-structured system for storing and managing real time transactions handling read/writes for day to day
-sql server
-maintaining inventory records in a store
Data warehouse
OLAP online analytical processing
-central repo optimized for storing historical data and performing complex analytical queries. Holds structured data
-red shift, snowflake
-trends on last 5 years, hr report across departments
Data lake
-massive storage holding raw unstructed semi struct and structured data.
-used for data processing, ml
-s3, azure data lake, Hadoop
-storing raw sensor data from IoT devices
-unfiltered ocean of information
Data mart
Subset of warehouse focused on specific department or business unit (sales, hr, marketing)
-provides faster access to relevant insights
-structured data
-hr analyzing employee retention
- Star schema
o One fact connected to multiple dimension tables
o Optimized for fast querying
o Sales fact table
- Snowflake schema
o Normalizes dimensions into sub-dimensions
o Reduces data redundancy but increases query complexity
- Galaxy schema – fact constellation
o Multiple fact tables share common dimension tables
o Complex business scenarios (sales, returns data)
Slowly changing dimensions (SCD) type 1
o Overwrites old data
Slowly changing dimensions (SCD) type 2
o New row is added for every address change with start and end date
Slowly changing dimensions (SCD) type 3
o Limited history (prev_addr is added to track only the last change)
FACT table
- Stores measurable business events
DIM table
- Stores descriptive attributes
Landing vs staging
Landing is data as is staging is modified, multiple layers, structured
How do you approach ETL design
Gathering stakeholder needs.
* Analyzing data sources.
* Researching architecture and processes.
* Proposing a solution.
* Fine-tuning the solution based on feedback.
* Launching the solution and user onboarding.
TYPICAL ETL SOURCES?
-databases
-cloud storage like s3, blob storage
-flat files (csv, Json)
-APIs, like for web scraping
What data validation techniques would you use?
- Schema validation
o Pyspark schema enforcement and SQL constraints to detect missing or incorrect fields - Data type & format checks
o Date formats, numerical precision, text standardization - Business rule validation
o Custom business logic (mortgage balance cant be negative) - Duplicate detection
o Windows functions and deduplication rules - Range and anomaly detection
o Flagged outliers using aggregate and statistical checks
How have you optimized etl workflows in the past?
- Refactored SQL queries for performance
o Explain plan
o Optimized joins and aggregations
o Removed unnecessary sub queries and used CTEs instead - Partitioning and indexing (faster data retrieval)
o Table partitioning by date, location to improve query speeds
o Created indexes on frequently queried columns reducing scan time - Incremental loading
o Change Data Capture (CDC) – only process new or updated records - Parallel processing and workflow automation
o Automated error handling and retry mechanisms to prevent workflow failure.
Normal view
dynamically fetches data on each query
Materialized view
precomputed stored result of a query
OLAP cubes and structures
An OLAP cube is a multi-dimensional data structure that enables fast analytical queries by pre-aggregating data into different dimensions.
types of OLAP
MOLAP (Multidimensional OLAP)
ROLAP (Relational OLAP)
HOLAP (Hybrid OLAP)