What is a window function pattern used for ranking rows?
A ranking window function like ROW_NUMBER or RANK lets you order rows inside groups without collapsing them. It is like sorting students within each class and numbering them.
How do data engineers use ROW_NUMBER() to remove duplicates?
You group similar rows and give them numbers using ROW_NUMBER. Then you keep only row number 1. This removes duplicates by keeping the best or most recent row.
What is a deduplication query?
A query used to remove duplicate rows from a dataset while keeping the most relevant record. It usually uses ROW_NUMBER with PARTITION BY.
What does PARTITION BY help achieve in analytics queries?
It divides data into groups so calculations run separately inside each group
What is a Slowly Changing Dimension (SCD)?
A method for tracking changes in data over time. Instead of overwriting old values you keep historical records so you know what the data looked like in the past.
What is SCD Type 1?
Type 1 simply overwrites old data. It keeps only the most recent value and does not track history.
What is SCD Type 2?
Type 2 keeps historical versions of rows. When a value changes a new row is created with timestamps so you can see past states of the data.
Why do data warehouses use SCD Type 2?
Because analysts often need to know what the data looked like at a specific time in history
What is an incremental data pipeline?
Instead of reprocessing all data every time you only process new or changed records since the last run.
Why are incremental pipelines important?
They make large systems scalable because processing millions of old records repeatedly would waste time and resources.
What is a time-series query?
A query that analyzes data over time
What is a rolling average query?
A rolling average calculates an average over a moving window of rows
What problem does a rolling window solve?
It smooths out fluctuations so trends become easier to see.
What is a star schema?
A data warehouse design where a central fact table connects to multiple dimension tables like a star shape.
What is a fact table?
A table that stores measurable events such as sales
What is a dimension table?
A table that stores descriptive information such as customers
Why do data warehouses separate fact and dimension tables?
Because it organizes data for faster analytical queries and clearer relationships.
What is a surrogate key?
A generated ID used as the primary key in warehouse tables instead of natural identifiers.
Why are surrogate keys useful?
Because natural identifiers can change
What is a staging table?
A temporary table used to hold raw data before it is cleaned or transformed.
Why do pipelines use staging tables?
They allow data engineers to validate and transform raw data safely before loading it into production tables.
What is a data pipeline?
A system that automatically moves and transforms data from source systems to storage or analytics systems.
What is batch processing?
Processing large groups of data at scheduled intervals rather than in real time.
What is streaming processing?
Processing data continuously as it arrives rather than waiting for batches.