Snapshots Flashcards

(61 cards)

1
Q
A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What problem do snapshots in dbt solve?

A

Snapshots help analysts track changes over time in mutable source tables, enabling historical analysis where the original data system does not maintain history.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What type of Slowly Changing Dimensions do dbt snapshots implement?

A

dbt snapshots implement Type-2 Slowly Changing Dimensions, which create new records for changes to capture historical states.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is a real-world example of a field that would benefit from snapshotting?

A

A field like status in an orders table that changes over time, such as going from ‘pending’ to ‘shipped’, benefits from snapshotting.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What happens to data in a table when a field like ‘status’ is overwritten without snapshotting?

A

The original state is lost, making it impossible to analyze how the field changed over time, such as how long an order was ‘pending’.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How does dbt store historical data using snapshots?

A

It creates a new record in the snapshot table for each change, with dbt_valid_from and dbt_valid_to fields indicating the time range of validity.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are the key columns added by dbt to a snapshot table?

A

dbt_valid_from and dbt_valid_to are added to indicate when a particular version of a row was valid.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What does a snapshot record look like for a status that changed from ‘pending’ to ‘shipped’?

A

Two rows would exist: one for ‘pending’ with a set dbt_valid_to, and another for ‘shipped’ with dbt_valid_to set to null.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What command is used to generate snapshots in a dbt project?

A

The dbt snapshot command is used to execute and materialize snapshot logic.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Where are snapshot configurations defined in dbt?

A

Snapshot configurations are defined in YAML files located in either the models or snapshots directory.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are the required configuration fields for a snapshot in dbt?

A

You must define strategy and unique_key. Depending on the strategy, you also need check_cols (for check) or updated_at (for timestamp).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the purpose of the unique_key in snapshot configuration?

A

It specifies the primary key (or keys) used to uniquely identify a record for change detection.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are the two snapshot strategies supported in dbt?

A

The two strategies are timestamp and check. timestamp uses a column to track updates, while check compares column values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What does the check_cols config do?

A

Used with the check strategy, it defines which columns to compare between runs to detect changes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the dbt_valid_to_current config used for?

A

It allows you to set a custom value (e.g., ‘9999-12-31’) for the dbt_valid_to column on current records instead of using NULL.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How does the hard_deletes configuration affect snapshots?

A

It controls how deleted rows in the source are handled. Options include ignore, invalidate, or new_record.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is the default behavior of the hard_deletes config?

A

The default value is ignore, meaning deletions in the source are not tracked in the snapshot table.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What should be included in your snapshot select statement?

A

Typically all columns from the source, though transformations should be done in a separate ephemeral model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is the benefit of using an ephemeral model in conjunction with a snapshot?

A

It allows for pre-processing, such as filtering or deduplication, without persisting intermediate tables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What macro is used to determine schema placement when schema is not explicitly defined?

A

The generate_schema_name macro determines where to build snapshots if schema is not set.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What happens during the first run of the dbt snapshot command?

A

A snapshot table is created containing the current result set plus dbt_valid_from and dbt_valid_to columns.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What happens during subsequent runs of dbt snapshot?

A

dbt detects changes by strategy, updates dbt_valid_to for old records, and inserts updated/new records.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

How can snapshot meta field names be customized?

A

By using the snapshot_meta_column_names configuration, you can override default column names.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

How are snapshots referenced in downstream models?

A

Use the ref() function, just like with referencing models or sources.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
What best practice should you follow when choosing a schema for snapshots?
Use a schema separate from your models’ schema to organize your data assets clearly.
26
Why should snapshots be run frequently?
Frequent runs ensure changes are captured in a timely manner, maintaining accurate historical records.
27
What happens during the first run of the `dbt snapshot` command?
dbt creates an initial snapshot table based on the select statement, including additional columns like `dbt_valid_from` and `dbt_valid_to`. All records will have `dbt_valid_to = NULL` or a configured value if `dbt_valid_to_current` is set.
28
What does dbt do on subsequent runs of `dbt snapshot`?
It checks for changed or new records, updates the `dbt_valid_to` of changed existing records, and inserts updated or new records with `dbt_valid_to = NULL` or the configured `dbt_valid_to_current` value.
29
What is the purpose of the `dbt_valid_to` column in snapshots?
`dbt_valid_to` marks the end of a record's validity period. If it's NULL or set to a configured current value, the record is considered the latest version.
30
How can you customize the names of snapshot metadata columns in dbt?
You can use the `snapshot_meta_column_names` configuration to align column names with your team's conventions.
31
What does the `dbt_valid_to_current` config do in snapshots?
It allows setting a custom value (e.g., '9999-12-31') for `dbt_valid_to` on current records, instead of using NULL.
32
What are the three supported options for the `hard_deletes` config in dbt snapshots?
The options are `ignore`, `invalidate`, and `new_record`, each handling deletes differently in the source data.
33
How does the `ignore` option in `hard_deletes` affect snapshotting?
It ignores deleted rows in the source data, meaning no changes are made to the snapshot table for deleted records.
34
What happens with the `invalidate` option in `hard_deletes`?
A deleted source row causes the current snapshot record’s `dbt_valid_to` to be set, ending its validity without inserting a new row.
35
What does the `new_record` option in `hard_deletes` do?
It inserts a new snapshot record to indicate that a row has been deleted in the source, tracking deletion explicitly.
36
Can you reference snapshot tables in downstream dbt models?
Yes, snapshot tables can be referenced just like models using the `ref` function.
37
Why are `dbt_valid_from` and `dbt_valid_to` important in a snapshot table?
They define the time range during which a specific record version was valid, enabling historical analysis of data changes.
38
What is the default behavior for `dbt_valid_to` on current snapshot records?
By default, it is NULL unless overridden with `dbt_valid_to_current`.
39
How does dbt detect if a record has changed for snapshots?
It compares the current source record with the previous snapshot version. If differences are detected, it updates the snapshot.
40
How does the snapshot feature support data auditing and versioning?
It preserves historical versions of records, enabling you to trace data changes over time and recreate past states.
41
What is required to set up a snapshot in dbt?
You must define a snapshot block with a `select` statement and configure required fields like `unique_key`, `strategy`, and optional metadata fields.
42
What are the two built-in snapshot strategies in dbt?
Timestamp and Check. The Timestamp strategy uses a single updated_at field to track changes, while the Check strategy compares specified columns' values.
43
How does the Timestamp snapshot strategy detect changes?
It compares the source row's updated_at timestamp with the last snapshot run. If it's more recent, the old record is invalidated and a new one is recorded.
44
Why is the Timestamp strategy generally recommended over the Check strategy?
It requires tracking only one column, adapts better to schema changes, and is less error-prone over time.
45
What config is required for the Timestamp strategy?
You must specify the `updated_at` column that indicates when the source row was last updated.
46
How does the Check strategy detect changes in dbt snapshots?
It compares specified `check_cols` between current and historical versions. If values differ, a new record is created.
47
What configuration is needed for the Check strategy?
You must define `check_cols`, which is a list of columns or the keyword 'all' to check all columns.
48
What is the effect of using `check_cols = 'all'`?
dbt will compare all columns to detect changes. This may introduce noise and performance issues if the table has many or volatile columns.
49
How can the Check strategy incorporate an updated_at field?
By configuring `updated_at`, dbt uses it to populate timestamp fields. If `updated_at` is null, dbt defaults to the current timestamp.
50
What happens if you don't configure an `updated_at` field in the Check strategy?
dbt uses the current timestamp when a change is detected to populate `dbt_valid_from`, `dbt_valid_to`, and `dbt_updated_at`.
51
What is the `hard_deletes` config used for in dbt snapshots?
It controls how deleted rows from the source are handled. Options include `ignore`, `invalidate`, and `new_record`.
52
What does the `ignore` option in `hard_deletes` do?
It takes no action for deleted rows — they remain unchanged in the snapshot table.
53
What does the `invalidate` option do in `hard_deletes`?
It sets `dbt_valid_to` to end the row's validity period when the source row is deleted, without inserting a new row.
54
What is the purpose of the `new_record` option in `hard_deletes`?
It inserts a new row for a deleted record with `dbt_is_deleted = True`, and new rows for restored records with `dbt_is_deleted = False`.
55
What additional column is used when `hard_deletes = 'new_record'`?
`dbt_is_deleted` indicates whether the record is deleted (`True`) or active (`False`).
56
What is the role of `dbt_valid_from` in snapshots?
It marks when a record version became valid, used to track record history over time.
57
What does the `dbt_valid_to` field represent in snapshots?
It marks when a record version became invalid. For current records, it's NULL or a configured value like '9999-12-31'.
58
What is `dbt_updated_at` used for in snapshots?
It records the source row’s update timestamp at the time of snapshot insertion, reflecting the original change time.
59
What does `dbt_scd_id` represent in a snapshot table?
It’s a unique identifier for each version of a record, used internally for tracking.
60
How can you customize metadata column names in snapshot tables?
By using the `snapshot_meta_column_names` config to map internal fields like `dbt_valid_from` to custom names like `start_date`.
61
How are timestamps populated differently in Timestamp vs. Check strategies?
Timestamp strategy uses the configured `updated_at`, while Check uses the current timestamp or `updated_at` if specified.