What are some of the benefits of using Delta Live Tables?
What are some of the components necessary to create a new DLT pipeline?
Scenarios for using DLT pipelines
Difference between triggered and continuous pipelines
Triggered run once and then shut down until the next manual or scheduled update
Continuous run continuously, investing new data as it arrives
Explain the Notebooks, Storage Location and Target parameters resources for a DLT pipeline
What is Auto Loader?
What are the use cases of Auto Loader?
Other features of Auto Loader
Sample AutoLoader syntax (sql)
CREATE OR REFRESH STREAMING TABLE orders_bronze
AS SELECT current_timestamp() processing_time, input_file_name() source_file, *
FROM cloud_files(“${source}/orders”, “json”, map(“cloudFiles.inferColumnTypes”, “true”))
What are the two types of persistent tables that can be created with DLT?
Materialised Views/Live Tables: the current result of an y query with each refresh
Streaming Tables: designed for incremental, near-real time data processing. Supports reading from “append-only” streaming sources. Reads once
CREATE OR REFRESH LIVE TABLE orders_by_date
AS SELECT date(order_timestamp) AS order_date, count(*) AS total_daily_orders
FROM LIVE.orders_silver
GROUP BY date(order_timestamp)
Three expectations of constraints on Delta Live Tables
What is CDC (Change Data Capture) data?
How can APPLY CHANGES INTO be used for CDC?
CREATE OR REFRESH STREAMING TABLE customers_silver;
APPLY CHANGES INTO LIVE.customers_silver
FROM STREAM(LIVE.customers_bronze_clean)
KEYS (customer_id)
APPLY AS DELETE WHEN operation = “DELETE”
SEQUENCE BY timestamp
COLUMNS * EXCEPT (operation, source_file, _rescued_data)
What is the events log and how can it be queried?
A managed Delta Lake table with important fields stored as nested JSON on each pipeline run
event_log_path = f”{DA.paths.storage_location}/system/events”
event_log = spark.read.format(‘delta’).load(event_log_path)
event_log.createOrReplaceTempView(“event_log_raw”)
display(event_log)
How can audit logging be performed on the event log?
The JSON for each run can be queried and the user action and user name queried under user_action:action and user_action:user_name
How can data lineage be examined from the event log?
The JSON for each run can be queried for “flow_definition” and the “output_dataset” and “input_dataset” can be returned
How are existing Delta Live/Streaming Tables queried?
AS SELECT *
FROM LIVE.customers_bronze
or
AS SELECT *
FROM STREAM(LIVE.customers_bronze)