Snapshots Flashcards
(61 cards)
What problem do snapshots in dbt solve?
Snapshots help analysts track changes over time in mutable source tables, enabling historical analysis where the original data system does not maintain history.
What type of Slowly Changing Dimensions do dbt snapshots implement?
dbt snapshots implement Type-2 Slowly Changing Dimensions, which create new records for changes to capture historical states.
What is a real-world example of a field that would benefit from snapshotting?
A field like status
in an orders
table that changes over time, such as going from ‘pending’ to ‘shipped’, benefits from snapshotting.
What happens to data in a table when a field like ‘status’ is overwritten without snapshotting?
The original state is lost, making it impossible to analyze how the field changed over time, such as how long an order was ‘pending’.
How does dbt store historical data using snapshots?
It creates a new record in the snapshot table for each change, with dbt_valid_from
and dbt_valid_to
fields indicating the time range of validity.
What are the key columns added by dbt to a snapshot table?
dbt_valid_from
and dbt_valid_to
are added to indicate when a particular version of a row was valid.
What does a snapshot record look like for a status that changed from ‘pending’ to ‘shipped’?
Two rows would exist: one for ‘pending’ with a set dbt_valid_to
, and another for ‘shipped’ with dbt_valid_to
set to null.
What command is used to generate snapshots in a dbt project?
The dbt snapshot
command is used to execute and materialize snapshot logic.
Where are snapshot configurations defined in dbt?
Snapshot configurations are defined in YAML files located in either the models
or snapshots
directory.
What are the required configuration fields for a snapshot in dbt?
You must define strategy
and unique_key
. Depending on the strategy, you also need check_cols
(for check) or updated_at
(for timestamp).
What is the purpose of the unique_key
in snapshot configuration?
It specifies the primary key (or keys) used to uniquely identify a record for change detection.
What are the two snapshot strategies supported in dbt?
The two strategies are timestamp
and check
. timestamp
uses a column to track updates, while check
compares column values.
What does the check_cols
config do?
Used with the check
strategy, it defines which columns to compare between runs to detect changes.
What is the dbt_valid_to_current
config used for?
It allows you to set a custom value (e.g., ‘9999-12-31’) for the dbt_valid_to
column on current records instead of using NULL.
How does the hard_deletes
configuration affect snapshots?
It controls how deleted rows in the source are handled. Options include ignore
, invalidate
, or new_record
.
What is the default behavior of the hard_deletes
config?
The default value is ignore
, meaning deletions in the source are not tracked in the snapshot table.
What should be included in your snapshot select statement?
Typically all columns from the source, though transformations should be done in a separate ephemeral model.
What is the benefit of using an ephemeral model in conjunction with a snapshot?
It allows for pre-processing, such as filtering or deduplication, without persisting intermediate tables.
What macro is used to determine schema placement when schema
is not explicitly defined?
The generate_schema_name
macro determines where to build snapshots if schema
is not set.
What happens during the first run of the dbt snapshot
command?
A snapshot table is created containing the current result set plus dbt_valid_from
and dbt_valid_to
columns.
What happens during subsequent runs of dbt snapshot
?
dbt detects changes by strategy, updates dbt_valid_to
for old records, and inserts updated/new records.
How can snapshot meta field names be customized?
By using the snapshot_meta_column_names
configuration, you can override default column names.
How are snapshots referenced in downstream models?
Use the ref()
function, just like with referencing models or sources.