Databricks Data Engineer Professional Certification Flashcards

Question

Implement stream-static joins

Answer 1

``` static_df = spark \ .read \ .format("delta") \ .load("/path/to/delta/product_catalog") enriched_df = streaming_df.join(static_df, "product_id", "left") query = enriched_df \ .writeStream \ .format("delta") \ .outputMode("append") \ .option("checkpointLocation", "/path/to/checkpoint") \ .start("/path/to/enriched_sales") ```

Answer 2

When performing stream-stream join, Spark buffers past inputs as a streaming state for both input streams, so that it can match every future input with past inputs. This state can be limited by using watermarks. ## Footnote https://www.databricks.com/blog/2018/03/13/introducing-stream-stream-joins-in-apache-spark-2-3.html

Answer 3

# Note that below we are using event_time as the watermarking column and not including the event_time column as deduplication key in dropDuplicates function. ``` volume_path = '/Volumes/mt_asqs/demo/checkpoint_paths' stream_df = ( spark.readStream.format("delta") .table("mt_asqs.demo.dd_source") .withWatermark("event_time", "5 minutes") .dropDuplicates(["event_id"]) ) dropDuplicatesWithinWatermark does exactly what you would expect it to do, this is what you should use for deduplication when you know the time window within which you expect duplicate records in your stream. ( stream_df.writeStream .format("delta") .option("checkpointLocation", f"{volume_path}/dd_test3") .table("mt_asqs.demo.dd_output") ) ``` ```python stream_df = ( spark.readStream.format("delta") .table("mt_asqs.demo.ddww_source") .withWatermark("event_time", "5 minutes") .dropDuplicatesWithinWatermark(["event_id"]) ) volume_path = '/Volumes/mt_asqs/demo/checkpoint_paths' ( stream_df.writeStream .format("delta") .option("checkpointLocation", f"{volume_path}/ddww_test") .table("mt_asqs.demo.ddww_output") ) ``` ## Footnote https://community.databricks.com/t5/technical-blog/deep-dive-streaming-deduplication/ba-p/105062

Answer 4

# As a stream ``` CREATE TABLE student (id INT, name STRING, age INT) TBLPROPERTIES (delta.enableChangeDataFeed = true); ALTER TABLE myDeltaTable SET TBLPROPERTIES (delta.enableChangeDataFeed = true) ``` ...or for all tables: ```text set spark.databricks.delta.properties.defaults.enableChangeDataFeed = true; ``` To read the changes: ```python # as a stream spark.readStream .option("readChangeFeed", "true") .option("startingVersion", 76) .table("source_table") ) versions as ints or longs spark.read \ .option("readChangeFeed", "true") \ .option("startingVersion", 0) \ .option("endingVersion", 10) \ .table("myDeltaTable") timestamps as formatted timestamp spark.read \ .option("readChangeFeed", "true") \ .option("startingTimestamp", '2021-04-21 05:45:46') \ .option("endingTimestamp", '2021-05-21 12:00:00') \ .table("myDeltaTable") providing only the startingVersion/timestamp spark.read \ .option("readChangeFeed", "true") \ .option("startingVersion", 0) \ .table("myDeltaTable") ``` ```sql -- version as ints or longs e.g. changes from version 0 to 10 SELECT * FROM table_changes('tableName', 0, 10) -- timestamp as string formatted timestamps SELECT * FROM table_changes('tableName', '2021-04-21 05:45:46', '2021-05-21 12:00:00') -- providing only the startingVersion/timestamp SELECT * FROM table_changes('tableName', 0) ``` ## Footnote https://docs.databricks.com/aws/en/delta/delta-change-data-feed#enable https://docs.databricks.com/aws/en/delta/delta-change-data-feed

Answer 5

# Old syntax **Python:** ``` apply_changes( target = "user_silver", source = "cdf_user_bronze", keys = ["user_id"], sequence_by = struct('user_timestamp','_commit_version'), except_column_list = ["_change_type", "_commit_version", "_commit_timestamp"], apply_as_deletes = expr("_change_type = 'delete'"), stored_as_scd_type = 1 ) ``` **SQL:** ``` CREATE OR REFRESH STREAMING TABLE table_name; CREATE FLOW flow_name AS AUTO CDC INTO table_name FROM source KEYS (keys) [IGNORE NULL UPDATES] [APPLY AS DELETE WHEN condition] [APPLY AS TRUNCATE WHEN condition] SEQUENCE BY orderByColumn [COLUMNS {columnList | * EXCEPT (exceptColumnList)}] [STORED AS {SCD TYPE 1 | SCD TYPE 2}] [TRACK HISTORY ON {columnList | * EXCEPT (exceptColumnList)}] ``` ## Footnote https://docs.databricks.com/aws/en/dlt-ref/dlt-sql-ref-apply-changes-into

Answer 6

Example: Partition by date or date part to be able to archive or delete data older than a certain date. `ALTER TABLE SET TBLPROPERTIES(delta.timeUntilArchived = 'X days');` ## Footnote https://docs.databricks.com/aws/en/optimizations/archive-delta#queries-optimized-for-archived-data

Answer 7

Scanning and metadata overhead, and under-utilization of CPU/memory resources due to small tasks sizes. ## Footnote https://medium.com/@sujathamudadla1213/section-2-data-processing-batch-processing-incremental-processing-and-optimization-subtopic-is-36f4af9d46f7

Answer 8

* Cleaned, filtered, and enriched data. * Data is transformed into a more structured and usable format. * Acts as a single source of truth for downstream analytics. ## Footnote https://medium.com/@sujathamudadla1213/section-3-data-modeling-subtopic-is-describe-the-objective-of-data-transformations-during-5e1414dc4739

Answer 9

CDF simplifies the process by eliminating the need for a watermark column, as it directly provides the changes (inserts, updates, deletes).

Answer 10

* A deep clone copies metadata and data, including stream and `copy into` metadata * A shallow clone copies metadata (excluding stream and `copy into` metadata) only, and retains a reference to the original data files

Answer 11

A multiplex bronze table is a single table that consolidates data from multiple sources or streams while addressing the challenges mentioned above. It uses metadata columns to differentiate between sources, handle schema evolution, and ensure data quality. **Key Features:** * Source Identification: Each record includes metadata about its source (e.g., source_id, source_timestamp). * Schema Flexibility: Supports schema evolution by storing data in a flexible format (e.g., JSON or Avro). * Deduplication: Uses unique identifiers or metadata to detect and remove duplicates. * Partitioning: Optimizes query performance by partitioning data (e.g., by date or source).

Answer 12

* Use schema enforcement to ensure that incoming data adheres to a predefined structure. This prevents malformed data from entering the bronze layer. * Partition by Source or Topic: Partition the bronze table by source or topic to improve query performance and simplify data management. * Time-Based Partitioning: Use time-based partitioning (e.g., by day or hour) for time-series data. * Remove Duplicates: Use unique identifiers (e.g., event_id) to deduplicate data during ingestion. * Use micro-batching to balance latency and throughput.

Answer 13

* MERGE from **CDC**: ``` MERGE INTO silver_table AS target USING bronze_table AS source ON target.id = source.id WHEN MATCHED THEN UPDATE SET * WHEN NOT MATCHED THEN INSERT * ``` * **Watermarking:** Use a timestamp column (e.g., event_time) to identify new records. * **Auto Loader:** ``` df = spark.readStream \ .format("cloudFiles") \ .option("cloudFiles.format", "json") \ .load("path/to/bronze") ```

Answer 14

* Pipeline/Job: * Schema Enforcement: `spark.readStream.schema(schema).json("path/to/data")` * Constraints: `df.filter(col("value").isNotNull() & (col("value") > 0))` * Delta Lake Constraints (table level): * CHECK and NOT NULL Constraints * Data Quality Metrics and Monitoring: ``` from databricks import lakehouse_monitoring lakehouse_monitoring.create_monitor( table_name="my_table", metrics=["null_count", "value_distribution"] ) ``` * Data deduplication: `df.dropDuplicates(["id"])` ## Footnote https://medium.com/@sujathamudadla1213/section-3-data-modeling-subtopic-is-make-informed-decisions-about-how-to-enforce-data-b480b58a8999

Answer 15

``` ALTER TABLE employees ADD CONSTRAINT valid_age CHECK (age >= 18 AND age <= 65); ```

Answer 16

* **EXPECT (default):** Records violating the expectation are retained, but the violation is logged as a data quality issue. * **EXPECT or ON VIOLATION DROP ROW:** Records violating the expectation are dropped from the target dataset. * **EXPECT or ON VIOLATION FAIL UPDATE:** The pipeline fails immediately when a record violates the expectation. ## Footnote https://learn.microsoft.com/en-us/azure/databricks/dlt/expectations

Answer 17

Levels of Normalization: * **First Normal Form (1NF):** Eliminate duplicate columns and ensure atomic values (no array, table, or object column data types). * **Second Normal Form (2NF):** Remove partial dependencies (all non-key attributes depend on the entire primary key). * **Third Normal Form (3NF)**: Remove transitive dependencies (non-key attributes depend only on the primary key). ## Footnote https://en.wikipedia.org/wiki/First_normal_form https://en.wikipedia.org/wiki/Second_normal_form https://en.wikipedia.org/wiki/Third_normal_form

Answer 18

* **SCD Type 0:** No changes permitted (durable). * **SCD Type 1:** No history is maintained; changes are updated in place! * **SCD Type 2:** History is maintained; changes are upserted! * **SCD Type 3:** Tracks changes using additional columns for previous data. * **SCD Type 4:** History table

Answer 19

* **`current_user()`:** Returns the current user's email address. * **`is_account_group_member():`** Returns TRUE if the current user is a member of a specific account-level group. Recommended for use in dynamic views against Unity Catalog data. * **`is_member()`:** Returns TRUE if the current user is a member of a specific workspace-level group. This function is provided for compatibility with the existing Hive metastore. Avoid using it with views against Unity Catalog data, because it does not evaluate account-level group membership.

Answer 20

Use the Jobs Timeline to identify major issues * Look at longest stage * Look for skew or spill * Determine if longest stage is I/O bound * Look for other causes of slow stage runtime

Answer 21

* **Event Timeline:** The event timeline provides a visual representation of the sequence of events (e.g., task execution, shuffling, I/O operations) during the execution of a job. It helps identify bottlenecks (e.g., long-running tasks, excessive shuffling) and resource contention (e.g., CPU, memory). * **Metrics:** Metrics are quantitative measurements of cluster and job performance, such as: * CPU usage * Memory usage * Disk I/O * Network I/O * Task duration * Shuffle read/write sizes * **Stages and Jobs:** A job is a high-level action (e.g., count, save) submitted to the cluster. A stage is a set of tasks that can be executed in parallel. Stages are separated by shuffle boundaries.

Answer 22

Event Timeline - Task Execution: Identify long-running tasks or tasks with high resource usage. - Shuffling: Look for stages with excessive shuffle read/write sizes. - Gaps: Gaps in the timeline indicate idle time, which may be caused by resource contention or scheduling delays. - A stage with many small tasks may indicate data skew. - A stage with long shuffle times may indicate network bottlenecks. Cluster Metrics * CPU Usage: High CPU usage may indicate computationally intensive tasks. * Memory Usage: High memory usage may lead to Out of Memory (OOM) errors. * Disk I/O: High disk I/O may indicate excessive spilling to disk. * Network I/O: High network I/O may indicate excessive shuffling. Task Metrics * Task Duration: Long task durations may indicate inefficient code or resource contention. * Shuffle Read/Write Sizes: Large shuffle sizes may indicate inefficient data partitioning. ## Footnote https://learn.microsoft.com/en-us/azure/databricks/optimizations/spark-ui-guide/ https://spark.apache.org/docs/latest/web-ui.html https://www.youtube.com/watch?v=y3VCWVbzAKA https://medium.com/@sujathamudadla1213/826467ab0f27

Answer 23

Optimize Data Ingestion Use Efficient Formats: Use columnar formats like Parquet or ORC for storage and Avro or Protobuf for streaming. Compress Data: Apply compression (e.g., Snappy, GZIP) to reduce data size and network overhead. Batch Small Records: Group small records into larger batches to reduce the number of I/O operations. b. Scale Resources Dynamically Autoscaling: Use autoscaling features (e.g., AWS Auto Scaling, Databricks Autoscaling) to adjust resources based on workload. Spot Instances: Use spot instances or preemptible VMs for non-critical workloads to reduce costs. c. Tune Processing Logic Parallelism: Increase parallelism by partitioning data and using multiple workers. Windowing: Use windowed aggregations to process data in chunks instead of processing each record individually. State Management: Optimize state management (e.g., use RocksDB for efficient state storage in Apache Flink). d. Monitor and Optimize Metrics Collection: Collect metrics (e.g., latency, throughput, resource usage) to identify bottlenecks. Cost Monitoring: Use cloud cost management tools (e.g., AWS Cost Explorer, Azure Cost Management) to track spending. Alerting: Set up alerts for SLA violations (e.g., latency exceeding a threshold or cost overruns). e. Fault Tolerance and Recovery Checkpointing: Use checkpointing to save the state of the streaming job periodically, enabling fast recovery from failures. Idempotent Processing: Design jobs to handle duplicate records without causing data corruption. Dead Letter Queues: Route failed records to a dead letter queue for later reprocessing. ## Footnote https://medium.com/@sujathamudadla1213/section-6-testing-deployment-9747b10d3f18

Answer 24

**Streaming Jobs:** * **Input Rate:** Volume of data being ingested. * **Processing Time:** Time taken to process each micro-batch. * **Latency:** Delay between data arrival and processing. * **Backpressure:** Indicates if the system is struggling to keep up with the input rate. **Batch Jobs:** * **Job Duration:** Time taken to complete the job. * **Resource Utilization:** CPU, memory, and disk usage. * **Data Volume:** Amount of data processed. * **Success/Failure Rate:** Percentage of jobs that succeed or fail. ## Footnote https://medium.com/@sujathamudadla1213/section-5-monitoring-logging-subtopic-deploy-and-monitor-streaming-and-batch-jobs-edbd77b4783b

Answer 25

Right-Size Clusters Micro-Batching Data Compression (Snappy) Cost Monitoring (AWS Cost Explorer) Fault Tolerance and Recovery: Enable checkpointing Optimize Queries to Reduce Data Skew Databricks Lakehouse Monitoring

Answer 26

```bash pip install databricks-cli databricks configure --token ``` ...or set environment variables ## Footnote https://docs.databricks.com/aws/en/dev-tools/cli/authentication

Answer 27

# ...etc. ```bash databricks jobs create --json-file job_config.json databricks jobs reset --job-id 123 --json-file updated_job_config.json databricks jobs get-run --run-id 456 databricks jobs get-run-output --run-id 456 ```

Answer 28

Endpoints: * jobs/get * jobs/create * jobs/run-now * jobs/runs/export

Databricks Data Engineer Professional Certification Flashcards

https://www.databricks.com/sites/default/files/2025-02/databricks-certified-data-engineer-professional-exam-guide-1-mar-2025.pdf (59 cards)