Databricks Data Engineer Professional Certification Flashcards
https://www.databricks.com/sites/default/files/2025-02/databricks-certified-data-engineer-professional-exam-guide-1-mar-2025.pdf (59 cards)
Explain how Delta Lake uses the transaction log and cloud object storage to guarantee atomicity and durability
- The transaction log only records transactions that execute fully and completely.
- Databricks inherits the durability guarantees of the cloud object storage on which the data is stored.
https://docs.databricks.com/aws/en/lakehouse/acid
Describe how Delta Lake’s Optimistic Concurrency Control provides isolation, and which transactions might conflict
By checking for conflicts only at the time of commit, rather than throughout the entire transaction.
https://docs.databricks.com/aws/en/optimizations/isolation-level
Describe basic functionality of Delta clone.
- A deep clone copies metadata and data, including stream and
copy into
metadata - A shallow clone copies metadata (excluding stream and
copy into
metadata) only, and retains a reference to the original data files
https://docs.databricks.com/aws/en/delta/clone
What is a Spark watermark?
- In stream processing, a watermark is an Apache Spark feature that can define a time-based threshold for processing data when performing stateful operations such as aggregations.
- Spark waits to close and output the windowed aggregation until the max event time seen, minus the specified watermark, is greater than the upper bound of the window.
What is Bloom Filtering?
A Bloom filter index is a space-efficient data structure that enables data skipping on chosen columns, particularly for fields containing arbitrary text.
CREATE BLOOMFILTER INDEX ON TABLE table_name FOR COLUMNS(column_name OPTIONS (fpp=0.1, numItems=5000))
Databricks Bloom filter indexes consist of a data skipping index for each data file. The Bloom filter index can be used to determine that a column value is definitively not in the file, or that it is probably in the file. Before reading a file Databricks checks the index file, and the file is read only if the index indicates that the file might match a data filter.
Predictive I/O outperforms bloom filters.
What is predictive optimization?
Predictive optimization removes the need to manually manage maintenance operations for Unity Catalog managed tables on Databricks.
With predictive optimization enabled, Databricks automatically does the following:
- Identifies tables that would benefit from maintenance operations and queues these operations to run.
- Collects statistics when data is written to a managed table.
- Maintenance operations are run as necessary, eliminating both unnecessary runs for maintenance operations and the burden associated with tracking and troubleshooting performance.
https://docs.databricks.com/aws/en/optimizations/predictive-optimization
What is ZOrdering?
Z-ordering is a technique to colocate related information in the same set of files. This co-locality is automatically used by Delta Lake on Databricks data-skipping algorithms. This behavior dramatically reduces the amount of data that Delta Lake on Databricks needs to read. To Z-order data, you specify the columns to order on in the ZORDER BY clause:
OPTIMIZE events WHERE date >= current_timestamp() - INTERVAL 1 day ZORDER BY (eventType)
https://docs.databricks.com/aws/en/delta/data-skipping#what-is-z-ordering
What is predictive I/O?
Predictive I/O improves scanning performance by applying deep learning techniques to do the following:
- Determine the most efficient access pattern to read the data and only scanning the data that is actually needed.
- Eliminate the decoding of columns and rows that are not required to generate query results.
- Calculate the probabilities of the search criteria in selective queries matching a row. As queries run, we use these probabilities to anticipate where the next matching row would occur and only read that data from cloud storage.
What is Liquid Clustering?
Liquid clustering replaces table partitioning and ZORDER to simplify data layout decisions and optimize query performance. It provides the flexibility to redefine clustering keys without rewriting existing data, allowing data layout to evolve alongside analytic needs over time.
-- Create a new table with liquid clustering CREATE TABLE table1(col0 INT, col1 string) CLUSTER BY (col0); -- Alter an existing table ALTER TABLE <table_name> CLUSTER BY (<clustering_columns>)
https://docs.databricks.com/aws/en/delta/clustering
Implement Delta tables optimized for Databricks SQL service
When should I partition a Delta Lake table?
Partitioning can speed up your queries if you provide the partition column(s) as filters or join on partition column(s) or aggregate on partition column(s) or merge on partition column(s), as it will help Spark to skip a lot of unnecessary data partition (i.e., subfolders) during scan time.
Alert Icon
- Databricks recommends not to partition tables under 1TB in size and let ingestion time clustering automatically take effect. This feature will cluster the data based on the order the data was ingested by default for all tables.
- You can partition by a column if you expect data in each partition to be at least 1GB
- Always choose a low cardinality column — for example, year, date — as a partition column
- You can also take advantage of Delta’s generated columns feature while choosing the partition column. Generated columns are a special type of column whose values are automatically generated based on a user-specified function over other columns in the Delta table.
https://www.databricks.com/discover/pages/optimize-data-workloads-guide#intro
How do I set a table’s target file size?
In cases where the default file size targeted by Auto-optimize (128MB) or Optimize (1GB) isn’t working for you, you can fine-tune it as per your requirement. You can set the target file size by using delta.targetFileSize
table property and then Auto-optimize and Optimize will binpack to achieve the specified size instead.
How can I avoid data shuffling using a broadcast hash join?
To entirely avoid data shuffling, broadcast one of the two tables or DataFrames (the smaller one) that are being joined together. The table is broadcast by the driver, which copies it to all worker nodes.
If you’re running a driver with a lot of memory (32GB+), you can safely raise the broadcast thresholds to something like 200MB
set spark.sql.autoBroadcastJoinThreshold = 209715200; set spark.databricks.adaptive.autoBroadcastJoinThreshold = 209715200;
Using hints:
SELECT /*+ BROADCAST(t) */ * FROM <table-name> t
https://www.databricks.com/discover/pages/optimize-data-workloads-guide#data-shuffling
How can I control data shuffling using a Shuffle hash join over sort-merge join?
set spark.sql.join.preferSortMergeJoin = false
In most cases Spark chooses sort-merge join (SMJ) when it can’t broadcast tables. Sort-merge joins are the most expensive ones. Shuffle-hash join (SHJ) has been found to be faster in some circumstances (but not all) than sort-merge since it does not require an extra sorting step like SMJ. There is a setting that allows you to advise Spark that you would prefer SHJ over SMJ, and with that Spark will try to use SHJ instead of SMJ wherever possible. Please note that this does not mean that Spark will always choose SHJ over SMJ. We are simply defining your preference for this option.
https://www.databricks.com/discover/pages/optimize-data-workloads-guide#data-shuffling
What is Databricks Cost Based Optimizer (CBO)?
Spark SQL can use a cost-based optimizer (CBO) to improve query plans. This is especially useful for queries with multiple joins. For this to work it is critical to collect table and column statistics and keep them up to date.
To get the full benefit of the CBO it is important to collect both column statistics and table statistics. You can use the ANALYZE TABLE command to manually collect statistics.
https://docs.databricks.com/aws/en/optimizations/cbo
What is Data Spilling, why does it happen, and how do I get rid of it?
When the memory available to a Spark Task’s CPU core is insufficient to process the amount of data it is given to process, some of that data is spilled to disk, which is inneficient.
AQE auto-tuning:
Spark AQE has a feature called autoOptimizeShuffle (AOS), which can automatically find the right number of shuffle partitions (so long as compression is not excessive):
set spark.sql.shuffle.partitions=auto
Manually fine-tune
-- in SQL set spark.sql.shuffle.partitions = 2*<number of total worker cores in cluster> -- in PySpark spark.conf.set(“spark.sql.shuffle.partitions”, 2*<number of total worker cores in cluster>) -- or spark.conf.set(“spark.sql.shuffle.partitions”, 2*sc.defaultParallelism)
https://www.databricks.com/discover/pages/optimize-data-workloads-guide#data-spilling
Identify and remediate data skewness
If all the Spark tasks for the shuffle stage are finished and just one or two of them are hanging for a long time, that’s an indication of skew.
You can remediate by:
* Filter skewed values, if possible (such as nulls)
* In the case where you are able to identify the table, the column, and preferably also the values that are causing data skew, then you can explicitly tell Spark about it using skew hints so that Spark can try to resolve it for you:
SELECT /*+ SKEW(’table’, ’column_name’, (value1, value2)) */ * FROM table
* AQE skew optimization (enabled by default for spark 3+)
* Salting: Append random integers to skewed column values
https://www.databricks.com/discover/pages/optimize-data-workloads-guide#data-skewness
What operations can cause “data explosion”?
- The EXPLODE function
- Joins
Contrast different strategies for partitioning data (e.g. identify proper partitioning columns to use)
- Partitioning works well only for low or known cardinality fields (for example, date fields or physical locations), but not for fields with high cardinality such as timestamps
- Z-order works for all fields, including high cardinality fields and fields that may grow infinitely (for example, timestamps or the customer ID in a transactions or orders table)
- Most tables can leverage ingestion time clustering to avoid needing to worry about Z-order and partition tuning.
- Databricks recommends all partitions contain at least a gigabyte of data. Tables with fewer, larger partitions tend to outperform tables with many smaller partitions.
- Partitions can be beneficial for very large tables. Many performance enhancements around partitioning focus on very large tables (hundreds of terabytes or greater).
https://docs.databricks.com/aws/en/tables/partitions
Describe and distinguish partition hints: coalesce, repartition, repartition by range, and rebalance
-
COALESCE ( part_num )
: Reduce the number of partitions to the specified number of partitions. It takes a partition number as a parameter. -
REPARTITION ( { part_num | [ part_num , ] column_name [ , ...] } )
: Repartition to the specified number of partitions using the specified partitioning expressions. It takes a partition number, column names, or both as parameters. -
REPARTITION_BY_RANGE ( part_num [, column_name [, ...] ] | column_name [, ...] )
: Repartition to the specified number of partitions using the specified partitioning expressions. It takes column names and an optional partition number as parameters. -
REBALANCE [ ( column_name [, ...] ) ]
: The REBALANCE hint can be used to rebalance the query result output partitions, so that every partition is of a reasonable size (not too small and not too big). It can take column names as parameters, and try its best to partition the query result by these columns. This is a best-effort: if there are skews, Spark will split the skewed partitions, to make these partitions not too big. This hint is useful when you need to write the result of this query to a table, to avoid too small/big files. This hint is ignored if AQE is not enabled.
https://docs.databricks.com/aws/en/sql/language-manual/sql-ref-syntax-qry-select-hints
Articulate how to write Pyspark dataframes to disk while manually controlling the size of individual part-files
df.write.option("maxRecordsPerFile", 50).save("/tmp/foo")
Articulate multiple strategies for updating 1+ records in a spark table (Type 1)
MERGE, JOIN+OVERWRITE, UPDATE
Implement common design patterns unlocked by Structured Streaming and Delta Lake.
Explore and tune state information using stream-static joins and Delta Lake
Define the static delta table:
In stream-static joins, state information refers to the data that needs to be maintained across micro-batches to perform the join. Delta Lake simplifies state management by:
* Storing State in Delta Tables: The static dataset is stored in a Delta table, which is efficiently managed and updated.
* Handling Updates: Delta Lake supports upserts and deletes, making it easy to update the static dataset without disrupting the streaming job.
Example: If the product catalog (static dataset) is updated, Delta Lake ensures that the changes are reflected in the next micro-batch of the streaming job.
https://medium.com/@sujathamudadla1213/section-2-data-processing-batch-processing-incremental-processing-and-optimization-subtopic-is-12fe3e77da2b