If a Delta Lake table is created with the following query: CREATE TABLE target AS SELECT FROM source What will be the result of running DROP TABLE source 1. An error will occur indicating that other tables are based on this source table 2. Both the target and source tables will be dropped 3. No table will be dropped until CASCADE keyword is added to the command 4. Only the source table will be dropped, but the target table will be no more queryable 5. Only the source table will be dropped, while the target table will not be affected

5. Only the source table will the dropped, while the target table will not be affected

Given the following code block in a notebook db_password = dbutils.secrets.get(scope=dev, key=database_password) print (db_password) Which statement describes what will happen when the above code is executed? 1. An interactive input box will appear in the notebook 2. The string "REDACTED" will be printed 3. The error message "Secrets can not be printed" will be shown 4. The string value of the password will be printed in plain text 5. If the user has "Can Read" permission, the string value of the password will be printed in plain text, otherwise "REDACTED"

2. The string "REDACTED" will be printed.

A junior data engineer is using the %sh magic command to run some legacy code. A senior data engineer has recommended refactoring the code instead. Which of the following could explain why a data engineer may need to avoid using the %sh magic command ? 1. %sh restarts the Python interpreter. This clears all the variables declared in the notebook 2. %sh executes shell code only on the local driver machine which leads to significant performance overhead 3. %sh can not access storage to persist the output 4. All the above reasons 5. None of these reasons

2. %sh executes shell code only on the local driver machine which leads to significant performance overhead

Data Engineer Professional Flashcards by Beth K

Which utility can a data engineer use to read passed parameters inside a notebook?

dbutils.secrets
dbutils.library
dbutils.fs
dbutils.notebook
dbutils.widgets

dbutils.widgets

How well did you know this?

Not at all

Perfectly

Which of the following describes the minimal permissions a data engineer needs to view the metrics, driver logs and Spark UI of an existing cluster?

Can Attach To
Can Restart
Can Manage
Cluster creation allowed + Can Attach To privileges
Cluster creation allowed + Can Restart privileges

Can Attach To

How well did you know this?

Not at all

Perfectly

For production Databricks jobs, which of the following cluster types is recommended to use?

All-purpose clusters
Production clusters
Job clusters
On-premises clusters
Serverless clusters

Job clusters

(Job clusters are dedicated clusters for a job or task run. A job cluster auto terminates once the job is completed, which saves cost compared to all-purpose clusters.)

How well did you know this?

Not at all

Perfectly

If a Delta Lake table is created with the following query:

CREATE TABLE target AS SELECT * FROM source

What will be the result of running DROP TABLE source

An error will occur indicating that other tables are based on this source table
Both the target and source tables will be dropped
No table will be dropped until CASCADE keyword is added to the command
Only the source table will be dropped, but the target table will be no more queryable
Only the source table will be dropped, while the target table will not be affected

Only the source table will the dropped, while the target table will not be affected

How well did you know this?

Not at all

Perfectly

Which of the following describes the minimal permissions a data engineer needs to start and terminate an existing cluster ?

Can Attach To
Can Restart
Can Manage
Cluster creation allowed + Can Attach To privileges
Cluster creation allowed + Can Restart privileges

Can Restart

How well did you know this?

Not at all

Perfectly

The data engineering team has a Delta Lake table created with following query:

CREATE TABLE customers_clone
LOCATION ‘dbfs:/mnt/backup’
AS SELECT * FROM customers

A data engineer wants to drop the table with the following query:

DROP TABLE customers_clone

Which statement describes the result of running this drop command?

An error will occur as the table is deep cloned from the customers table
An error will occur as the table is shallowly clones from the customers table
Only the table’s metadata will be deleted from the catalog, while the data files will be kept in the storage
Both the table’s metadata and the data files will be deleted
The table will not be dropped until VACUUM command is run

Only the table’s metadata will be deleted from the catalog, while the data files will be kept in the storage.

(External (unmanaged) tables are tables whose data is stored in an external storage path by using a LOCATION clause.

When you run DROP TABLE on an external table, only the table’s metadata is deleted, while the underlying data files are kept.)

How well did you know this?

Not at all

Perfectly

Which of the following describes the minimal permissions a data engineer needs to edit the configurations of an existing cluster ?

Can Restart
Can Manage
Cluster creation allowed + Can Restart
Cluster creation allowed + Can Manage
Only administrators can edit the configurations on existing clusters

Can Manage

How well did you know this?

Not at all

Perfectly

Given the following code block in a notebook

db_password = dbutils.secrets.get(scope=”dev”, key=”database_password”)

print (db_password)

Which statement describes what will happen when the above code is executed?

An interactive input box will appear in the notebook
The string “REDACTED” will be printed
The error message “Secrets can not be printed” will be shown
The string value of the password will be printed in plain text
If the user has “Can Read” permission, the string value of the password will be printed in plain text, otherwise “REDACTED”

The string “REDACTED” will be printed.

How well did you know this?

Not at all

Perfectly

A junior data engineer is using the %sh magic command to run some legacy code. A senior data engineer has recommended refactoring the code instead.

Which of the following could explain why a data engineer may need to avoid using the %sh magic command ?

%sh restarts the Python interpreter. This clears all the variables declared in the notebook
%sh executes shell code only on the local driver machine which leads to significant performance overhead
%sh can not access storage to persist the output
All the above reasons
None of these reasons

%sh executes shell code only on the local driver machine which leads to significant performance overhead

How well did you know this?

Not at all

Perfectly

Given a Delta table ‘products’ with the following schema:

name STRING, category STRING, expiration_date DATE, price FLOAT

When executing the below query:

SELECT * FROM products
WHERE price > 30.5

Which of the following will be leaverged by the query optimizer to identify the data files to load?

Column statistics in the Hive metastore
Column statistics in the metadata of Parquet files
File statistics in the Delta transaction log
File statistics in the Hive metastore
None of the above

File statistics in the Delta transaction log

In the Transaction log, Delta Lake captures statistics for each data file of the table. These statistics indicate per file:

Total number of records

Minimum value in each column of the first 32 columns of the table

Maximum value in each column of the first 32 columns of the table

Null value counts for in each column of the first 32 columns of the table

When a query with a selective filter is executed against the table, the query optimizer uses these statistics to generate the query result. it leverages them to identify data files that may contain records matching the conditional filter.

For the SELECT query in the question, The transaction log is scanned for min and max statistics for the price column.

How well did you know this?

Not at all

Perfectly

The data engineering team has a table ‘orders_backup’ that was created using Delta Lake’s SHALLOW CLONE functionality from the table ‘orders’. Recently, the team started getting an error when querying the ‘orders_backup’ table indicating that some data files are no longer present.

Which of the following correctly explains this error ?

The VACUUM command was run on the orders table
The VACUUM command was run on the orders_backup table
The OPTIMIZE command was run on the orders table
The OPTIMIZE command was run on the orders_backup table
The REFRESH command was run on the orders_backup table

The VACUUM command was run on the orders table

With Shallow Clone, you create a copy of a table by just copying the Delta transaction logs.

That means that there is no data moving during Shallow Cloning.

Running the VACUUM command on the source table may purge data files referenced in the transaction log of the clone. In this case, you will get an error when querying the clone indicating that some data files are no longer present.

How well did you know this?

Not at all

Perfectly

A data engineer has a Delta Lake table named ‘orders_archive’ created using the following command:

CREATE TABLE orders_archive
DEEP CLONE orders

They want to sync up the new changes in the orders table to the clone.

Which of the following commands can be run to achieve this task?

REFRESH orders_archive
SYNC orders_archive
INSERT OVERWRITE orders_archive SELECT * FROM orders
CREATE OR REPLACE TABLE orders_archive DEEP CLONE orders
DROP TABLE orders_archive
CREATE TABLE orders_archive
DEEP CLONE orders

CREATE OR REPLACE TABLE orders_archive DEEP CLONE orders

Cloning can occur incrementally. Executing the CREATE OR REPLACE TABLE command can sync changes from the source to the target location.

Now, If you run DESCRIBE HISTORY orders_archive, you will see a new version of CLONE operation occurred on the table.

How well did you know this?

Not at all

Perfectly

The data engineering team has a Delta Lake table named ‘daily_activities’ that is completely overwritten each night with new data received from the source system.

For auditing purposes, the team wants to set up a post-processing task that uses Delta Lake Time Travel functionality to determine the difference between the new version and the previous version of the table. They start by getting the current table version via this code:

current_version = spark.sql(“SELECT max(version) FROM (DESCRIBE HISTORY daily_activities)”).collect()[0][0]

Which of the following queries can be used by the team to complete this task ?

SELECT * FROM daily_activities
UNION
SELECT * FROM daily_activities AS VERSION = {current_version-1}
SELECT * FROM daily_activities
UNION ALL
SELECT * FROM daily_activities@v{current_version-1}
SELECT * FROM daily_activities
INTERSECT
SELECT * FROM daily_activities AS VERSION = {current_version-1}
SELECT * FROM daily_activities
EXCEPT
SELECT * FROM daily_activities@v{current_version-1}
SELECT * FROM daily_activities
MINUS
SELECT * FROM daily_activities AS VERSION = {current_version-1}

SELECT * FROM daily_activities
EXCEPT
SELECT * FROM daily_activities@v{current_version-1}

Each operation that modifies a Delta Lake table creates a new table version. You can use history information to audit operations or query a table at a specific point in time using:

Version number

SELECT * FROM my_table@v36
SELECT * FROM my_table VERSION AS OF 36

Timestamp

SELECT * FROM my_table TIMESTAMP AS OF “2019-01-01”

Using the EXCEPT set operator, you can get the difference between the new version and the previous version of the table

How well did you know this?

Not at all

Perfectly

The data engineering team wants to build a pipeline that receives customers data as change data capture (CDC) feed from a source system. The CDC events logged at the source contain the data of the records along with metadata information. This metadata indicates whether the specified record was inserted, updated, or deleted. In addition to a timestamp column identified by the field update_time indicating the order in which the changes happened. Each record has a primary key identified by the field customer_id.

In the same batch, multiple changes for the same customer could be received with different update_time. The team wants to store only the most recent information for each customer in the target Delta Lake table.

Which of the following solutions meets these requirements?

Enable Delta Lake’s Change Data Feed (CDF) on the target table to automatically merge the received CDC feed
Use MERGE INTO to upsert the most recent entry for each customer_id into the table
Use MERGE INTO with SEQUENCY BY clause on the update_time for ordering how operations should be applied
Use DropDuplicates function to remove duplicates by customer_id, then merge the duplicate records into the table.
Use the option mergeSchema when writing the CDC data into the table to automatically merge the changed data with its most recent schema

Use MERGE INTO to upsert the most recent entry for each customer_id into the table

How well did you know this?

Not at all

Perfectly

A data engineer is using a foreachBatch logic to upsert data in a target Delta table.

The function to be called at each new microbatch processing is displayed below with a blank:

def upsert_data(microBatchDF, batch_id):
microBatchDF.createOrReplaceTempView(“sales_microbatch”)

sql_query = """
            MERGE INTO sales_silver a
            USING sales_microbatch b
            ON a.item_id=b.item_id
                AND a.item_timestamp=b.item_timestamp
            WHEN NOT MATCHED THEN INSERT *
            """
\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_

Which option correctly fills in the blank to execute the sql query in the function on a cluster with Databricks Runtime below 10.5 ?

spark.sql(sql_query)
batch_id.sql(sql_query)
microBatchDF.sql(sql_query)
microBatchDF.sparkSession.sql(sql_query)
microBatchDF._jdf.sparkSession.sql(sql_query)

microBatchDF._jdf.sparkSession.sql(sql_query)

Usually, we use spark.sq() function to run SQL queries. However, in this particular case, the spark session can not be accessed from within the microbatch process. Instead, we can access the local spark session from the microbatch dataframe.

For clusters with Databricks Runtime version below 10.5, the syntax to access the local spark session is:

microBatchDF._jdf.sparkSession().sql(sql_query)

How well did you know this?

Not at all

Perfectly

The data engineering team has a singleplex bronze table called ‘orders_raw’ where new orders data is appended every night. They created a new Silver table called ‘orders_cleaned’ in order to provide a more refined view of the orders data

The team wants to create a batch processing pipeline to process all new records inserted in the orders_raw table and propagate them to the orders_cleaned table.

Which solution minimizes the compute costs to propagate this batch of data?

Use time travel capabilities in Delta Lake to compare the latest version of orders_raw with one version prior, then write the difference to the orders_cleansed table
Use Spark Structured Streaming to process the new records from orders_raw in batch mode use the trigger availableNow option
Use Spark Structured Streaming foreachBatch logic to process the new records from orders_raw using trigger(processingTime=”24 hours”)
Use batch overwrite logic to reprocess all records in orders_raw and overwrite the orders_cleaned table
Use insert-only merge into the orders_cleansed table using orders_raw data based on a composite key

Use Spark Structured Streaming to process the new records from orders_raw in batch mode use the trigger availableNow option

Databricks supports trigger(availableNow=True) for Delta Lake and Auto Loader sources. This functionality consumes all available records in an incremental batch.

There is also the trigger(once=True) option for incremental batch processing. However, this setting is now deprecated in the newer Databricks Runtime versions.

NOTE: You may still see this option in the current certification exam version. However, Databricks recommends you use trigger(availableNow=True) for all future incremental batch processing workloads.

How well did you know this?

Not at all

Perfectly

The data engineering team has a Silver table called ‘sales_cleaned’ where new sales data is appended in near real-time.

They want to create a new Gold-layer entity against the ‘sales_cleaned’ table to calculate the year-to-date (YTD) of the sales amount. The new entity will have the following schema:

country_code STRING, category STRING, ytd_total_sales FLOAT, updated TIMESTAMP

It’s enough for these metrics to be recalculated once daily. But since they will be queried very frequently by several business teams, the data engineering team wants to cut down the potential costs and latency associated with materializing the results.

Which of the following solutions meets these requirements?

Define the new entity as a view to avoid persisting the results each time the metrics are recalculated
Define the new entity as a global temporary view since it can be shared between the notebooks or jobs that share computing resources
Configuring a nightly batch job to recalculate the metrics and store them as a table overwritten with each update
Create multiple tables, one per business team so the metrics can be queried quickly and efficiently
All the above solutions meet the requirements since Databricks uses the Delta Caching feature

Configuring a nightly batch job to recalculate the metrics and store them as a table overwritten with each update

Data engineers must understand how materializing results is different between views and tables on Databricks, and how to reduce total compute and storage cost associated with each materialization depending on the scenario.

Consider using a view when:

Your query is not complex. Because views are computed on demand, the view is re-computed every time the view is queried. So, frequently querying complex queries with joins and subqueries increases compute costs

You want to reduce storage costs. Views do not require additional storage resources.

Consider using a gold table when:

Multiple downstream queries consume the table, so you want to avoid re-computing complex ad-hoc queries every time.

Query results should be computed incrementally from a data source that is continuously or incrementally growing.

How well did you know this?

Not at all

Perfectly

A data engineer wants to calculate predictions using a MLFlow model logged in a given “model_url”. They want to register the model as a Spark UDF in order to apply it to a test dataset.

Which code block allows the data engineer to register the MLFlow model as a Spark UDF ?

predict_udf = mlflow.pyfunc.spark_udf(spark, “model_url”)
predict_udf = mlflow.spark_udf(spark, “model_url”)
predict_udf = mlflow.udf(spark, “model_url”)
predict_udf = pyfunc.spark_udf(spark, “model_url”)
predict_udf = mlflow.pyfunc(spark, “model_url”)

predict_udf = mlflow.pyfunc.spark_udf(spark, “model_url”)

Mlflow.pyfunc.spark_udf function allows to register a MLFlow model as a Apache Spark UDF. It needs at least 2 parameters:

spark: A SparkSession object

model_uri: the location, in URI format, of the MLflow model

Once the Spark UDF created, it can be applied to a dataset to calculate the predictions:

predict_udf = mlflow.pyfunc.spark_udf(spark, “model_url”)
pred_df = data_df.withColumn(“prediction”, predict_udf(*column_list))

How well did you know this?

Not at all

Perfectly

“A Delta Lake’s functionality that automatically compacts small files during individual writes to a table by performing two complementary operations on the table”

Which of the following is being described in the above statement?

Optimized writes
Auto compaction
Auto Optimize
OPTIMIZE command
REORG TABLE command

Auto Optimize

Auto Optimize is a functionality that allows Delta Lake to automatically compact small data files of Delta tables. This can be achieved during individual writes to the Delta table.

Auto optimize consists of 2 complementary operations:

Optimized writes: with this feature enabled, Databricks attempts to write out 128 MB files for each table partition.
Auto compaction: this will check after an individual write, if files can further be compacted. If yes, it runs an OPTIMIZE job with 128 MB file sizes (instead of the 1 GB file size used in the standard OPTIMIZE)

How well did you know this?

Not at all

Perfectly

The data engineering team has a large external Delta table where new changes are merged very frequently. They enabled Optimized writes and Auto Compaction on the table in order to automatically compact small data files to target files of size 128 MB. However, when they look at the table directory, they see that most data files are smaller than 128 MB.

Which of the following likely explains these smaller file sizes?

Optimized Writes and Auto Compaction have no effect on large Delta tables. The table needs to be partitioned so Auto Compaction can be applied at partition level.
Optimized Writes and Auto Compaction have no effect on external tables. The table needs to be managed in order to store the information of file sizes in the Hive metastore.
Optimized Writes and Auto Compaction automatically generate smaller data files to reduce the duration of future MERGE operations.
Auto compaction supports Auto Z-Ordering which is more expensive than just compaction
The team needs to look at the table’s auto_optimize directory, where all new compacted files are written

Optimized Writes and Auto Compaction automatically generate smaller data files to reduce the duration of future MERGE operations.

Having many small files can help minimize rewrites during some operations like merges and deletes. For such operations, Databricks can automatically tune the file size of Delta tables. As a result, it can generate data files smaller than the default 128MB. This helps in reducing the duration of future MERGE operations.

How well did you know this?

Not at all

Perfectly

Which statement regarding streaming state in Stream-Stream Joins is correct?

Stream-Stream Joins are not stateful. Spark does not buffer past inputs as a streaming state for the input streams.
Spark buffers past inputs as a streaming state only for the left input stream, so that it can match future right inputs with past left inputs.
Spark buffers past inputs as a streaming state only for the right input stream, so that it can match future left inputs with past right inputs.
Spark buffers past inputs as a streaming state for both input streams, so that it can match every future input with past inputs.
Stream-Stream Joins does not support limiting the state information using watermarks.

When performing stream-stream join, Spark buffers past inputs as a streaming state for both input streams, so that it can match every future input with past inputs. This state can be limited by using watermarks.

How well did you know this?

Not at all

Perfectly

Which statement regarding static Delta tables in Stream-Static joins is correct?

Static Delta tables must be small enough to be broadcasted to all worker nodes in the cluster
Static Delta tables need to be partitioned in order to be used in stream-static join
Static Delta tables need to be refreshed with REFRESH TABLE command for each microbatch of a stream-static join
The latest version of the static Delta table is returned each time it is queried by a microbatch of the stream-static join
The latest version of the static Delta table is returned only for the first microbatch of the stream-static join. Then, it will be cached to be used by any upcoming microbatch.

The latest version of the static Delta table is returned each time it is queried by a microbatch of the stream-static join

Stream-static joins take advantage of Delta Lake guarantee that the latest version of the static delta table is returned each time it is queried in a join operation with a data stream.

How well did you know this?

Not at all

Perfectly

A data engineer has the following streaming query with a blank:

spark.readStream
.table(“orders_cleaned”)
____________________________
.groupBy(
“order_timestamp”,
“author”)
.agg(
count(“order_id”).alias(“orders_count”),
avg(“quantity”).alias(“avg_quantity”))
.writeStream
.option(“checkpointLocation”, “dbfs:/path/checkpoint”)
.table(“orders_stats”)

For handling late-arriving data, they want to maintain the streaming state information for 30 minutes.

Which option correctly fills in the blank to meet this requirement ?

trigger(processingTime=”30 minutes”)
.awaitTermination(“order_timestamp”, “30 minutes”)
.awaitWatermark(“order_timestamp”,”30 minutes”)
.withWatermark(“order_timestamp”, “30 minutes”)
.window(“order_timestamp”,”30 minutes”)

.withWatermark(“order_timestamp”, “30 minutes”)

How well did you know this?

Not at all

Perfectly

Given the following streaming query:

spark.readStream
.table(“orders_cleaned”)
.withWatermark(“order_timestamp”, “10 minutes”)
.groupBy(
window(“order_timestamp”, “5 minutes”).alias(“time”),
“author”)
.agg(
count(“order_id”).alias(“orders_count”),
avg(“quantity”).alias(“avg_quantity”))
.writeStream
.option(“checkpointLocation”, “dbfs:/path/checkpoint”)
.table(“orders_stats”)

Which of the following statements best describe this query ?

It calculates business-level aggregates for each non-overlapping ten-minute interval. Incremental state information is maintained for 5 minutes for late-arriving data.
It calculates business-level aggregates for each non-overlapping five-minute interval. Incremental state information is maintained for 10 minutes for late-arriving data.
It calculates business-level aggregates for each overlapping five-minute interval. Incremental state information is maintained for 10 minutes for late-arriving data.
It calculates business-level aggregates for each overlapping ten-minute interval. Incremental state information is maintained for 5 minutes for late-arriving data.
None of the above

It calculates business-level aggregates for each non-overlapping five-minute interval. Incremental state information is maintained for 10 minutes for late-arriving data.

Pyspark.sql.functions.window function bucketizes rows into one or more time windows given a timestamp specifying column. In this query, we are performing aggregations per order_timestamp for each non-overlapping five minute interval.

pyspark.sql.DataFrame.withWatermark function allows you to only track state information for a window of time in which we expect records could be delayed. Here we define a watermark of 10 minutes.

How well did you know this?

Not at all

Perfectly

Which statement regarding checkpointing in Spark Structured Streaming is Not correct? 1. Checkpoints store the current state of a streaming job to cloud storage 2. Checkpointing allows the streaming engine to track the progress of stream processing 3. Checkpoints can be shared between separate streams 4. To specify the checkpoint in a streaming query, we use the checkpointLocation option 5. Checkpointing with write-ahead logs mechanism ensure fault-tolerant stream processing

3. Checkpoints can be shared between separate streams Checkpoints cannot be shared between separate streams. Each stream needs to have its own checkpoint directory to ensure processing guarantees.

Which of the following statements best describes Delta Lake Auto Compaction? 1. Auto Compaction occurs after a write to a table has succeeded to check if files can further be compacted; if yes, it runs an OPTIMIZE job with Z-Ordering towards a file size of 128 MB. 2. Auto Compaction occurs after a write to a table has succeeded to check if files can further be compacted; if yes, it runs an OPTIMIZE job without Z-Ordering towards a file size of 128 MB. 3. Auto Compaction occurs after a write to a table has succeeded to check if files can further be compacted; if yes, it runs an OPTIMIZE job with Z-Ordering towards a file size of 1 GB. 4. Auto Compaction occurs after a write to a table has succeeded to check if files can further be compacted; if yes, it runs an OPTIMIZE job without Z-Ordering toward a file size of 1 GB. 5. None of the above

2. Auto Compaction occurs after a write to a table has succeeded to check if files can further be compacted; if yes, it runs an OPTIMIZE job without Z-Ordering towards a file size of 128 MB. Auto Compaction is part of the Auto Optimize feature in Databricks. it checks after an individual write, if files can further be compacted, if yes, it runs an OPTIMIZE job with 128 MB file sizes instead of the 1 GB file size used in the standard OPTIMIZE. Auto compaction does not support Z-Ordering as Z-Ordering is significantly more expensive than just compaction.

Which of the following statements best describes Auto Loader ? 1. Auto loader allows applying Change Data Capture (CDC) feed to update tables based on changes captured in source data 2. Auto loader monitors a source location, in which files accumulate, to identify and ingest only newly arriving files with each command run. Files that have already been ingested in previous runs are skipped. 3. Auto loader allows cloning a source Delta table to a target destination at a specific version. 4. Auto loader defines data quality expectations on the contents of a dataset, and reports the records that violate these expectations in metrics 5. Auto loader enables efficient insert, update, delete, and rollback capabilities by adding a storage layer that provides better data reliability to data lakes.

2. Auto loader monitors a source location, in which files accumulate, to identify and ingest only newly arriving files with each command run. Files that have already been ingested in previous runs are skipped.

Which of the following functions can a data engineer use to return a new DataFrame containing the distinct rows from a given DataFrame based on multiple columns? 1. pyspark.sql.DataFrame.drop 2. pyspark.sql.DataFrame.distinct 3. pyspark.sql.DataFrame.dropDuplicates 4. pyspark.sql.DataFrame.na.drop 5. pyspark.sql.DataFrame.dropna

3. pyspark.sql.DataFrame.dropDuplicates

Which of the following approaches allows to correctly perform streaming deduplication ? 1. De-duplicate records within each batch, and then append the result into the target table 2. De-duplicate records within each batch, and then merge the result into the target table using insert-only merge 3. De-duplicate records within each batch, rank the result, and then insert only records having rank = 1 into the target table 4. De-duplicate records in all batches with watermarking, and then overwrite the target table by the result 5. None of the above

2. De-duplicate records within each batch, and then merge the result into the target table using insert-only merge

A junior data engineer is testing the following code block to get the newest entry for each item added in the ‘sales’ table since the last table update. from pyspark.sql import functions as F from pyspark.sql.window import Window window = Window.partitionBy("item_id").orderBy(F.col("item_time").desc()) ranked_df = (spark.readStream .table("sales") .withColumn("rank", F.rank().over(window)) .filter("rank == 1") .drop("rank") ) display(ranked_df) However, the command fails when executed. Why? 1. The query output can not be displayed. They should use spark.writeStream to persist the query result. 2. Watermarking is missing. It should be added to allow tracking state information for the window of time. 3. Non-time-based window operations are not supported on streaming DataFrames. They need to be implemented inside a foreachBatch logic instead. 4. The item_id field is not unique. Records must be de-duplicated on the item_id using dropDuplicates function. 5. The item_id field is not unique. The drop("rank") must be called before applying the rank function in order to drop any duplicate record.

3. Non-time-based window operations are not supported on streaming DataFrames. They need to be implemented inside a foreachBatch logic instead. If you try to call such a window operation on a streaming DataFrames, this will generate an error indicating that “Non-time-based window operations are not supported on streaming DataFrames”. Instead, these window operations need to be implemented inside a foreachBatch logic.

Given the following query on the Delta table ‘customers’ on which Change Data Feed is enabled: spark.readStream .option("readChangeFeed", "true") .option("startingVersion", 0) .table ("customers") .filter (col("_change_type").isin(["update_postimage"])) .writeStream .option ("checkpointLocation", "dbfs:/checkpoints") .trigger (availableNow=True) .table("customers_updates") Which statement describes the results of this query each time it is executed ? 1. Newly updated records will be merged into the target table, modifying previous entries with the same primary keys. 2. Newly updated records will be appended to the target table 3. Newly updated records will overwrite the target table 4. The entire history of updates records will be appended to the target table at each execution, which leads to duplicate entries 5. The entire history of update records will overwrite the target table at each execution

2. Newly updated records will be appended to the target table Databricks supports reading table’s changes captured by CDF in streaming queries using spark.readStream. This allows you to get only the new changes captured since the last time the streaming query was run. The query in the question then appends the data to the target table at each execution since it’s using the default writing mode, which is ‘append’.

The data engineering team maintains a Type 1 table that is overwritten each night with new data received from the source system. A junior data engineer has suggested enabling the Change Data Feed (CDF) feature on the table in order to identify those rows that were updated, inserted, or deleted. Which response to the junior data engineer's suggestion is correct? 1. CDF can not be enabled on existing tables, it can only be enabled on newly created tables. 2. Table data changes captured by CDF can only be read in streaming mode 3. CDF is useful when only a small fraction of records are updated in each batch. 4. CDF is useful when the table is SCD2 5. All of the above

3. CDF is useful when only a small fraction of records are updated in each batch. Generally speaking, we use CDF for sending incremental data changes to downstream tables in a multi-hop architecture. So, use CDF when only small fraction of records updated in each batch. Such updates are usually received from external sources in CDC format. If most of the records in the table are updated, or if the table is overwritten in each batch, like in the question, don’t use CDF.

Given the following query on the Delta table customers on which Change Data Feed is enabled: spark.read .option("readChangeFeed", "true") .option("startingVersion", 0) .table ("customers") .filter(col("_change_type").isin(["update_postimage"])) .write .mode(“overwrite”) .table("customers_updates") Which statement describes the results of this query each time it is executed? 1. Newly updated records will be merged into the target table, modifying previous entries with the same primary keys 2. Newly updated records will be appended to the target table 3. Newly updated records will overwrite the target table 4. The entire history of updates records will be appended to the target table at each execution, which leads to duplicate entries 5. The entire history of updated records will overwrite the target table at each execution

5. The entire history of updated records will overwrite the target table at each execution Reading table’s changes, captured by CDF, using spark.read means that you are reading them as a static source. So, each time you run the query, all table’s changes (starting from the specified startingVersion) will be read. The query in the question then writes the data in mode “overwrite” to the target table, which completely overwrites the table at each execution.

A data engineer wants to ingest input json data into a target Delta table. They want the data ingestion to happen incrementally in near real-time. Which option correctly meets the specified requirement ? 1. spark.ReadStream .format("autoloader") .option("autoloader.format","json") .load(source_path) .writeStream .option("checkpointLocation", checkpointPath) .start("target_table") 2.spark.readStream .format("autoloader") .option("autoloader.format","json") .load(source_path) .writeStream .option("checkpointLocation", checkpointPath) .trigger(real-time=True) .start("target_table") 3.spark.readStream .format("cloudFiles") .option("cloudFiles.format","json") .load(source_path) .writeStream .option("checkpointLocation", checkpointPath) .start("target_table") 4. spark.readStream .format("cloudFiles") .option"cloudFiles.format","json") .load(source_path) .writeStream .trigger(real-time=True) .start("target_table") 5. spark.readStream .format("cloudFiles") .option("cloudFiles.format","json") .load(source_path) .writeStream .trigger(availableNow=True) .start("target_table")

3.spark.readStream .format("cloudFiles") .option("cloudFiles.format","json") .load(source_path) .writeStream .option("checkpointLocation", checkpointPath) .start("target_table") In order to ingest input json data into a target Delta table, we use Autoloader. Auto Loader is based on Spark Structured Streaming and provides a Structured Streaming source called ‘cloudFiles’. If you want the data ingestion to happen incrementally in near real-time, you can use the default trigger method which is trigger(processingTime=”500ms"). This allows the processing of data in micro-batches at a fixed interval of half a second.

Given the following Structured Streaming query: (spark.table("orders") .withColumn("total_after_tax", col("total")+col("tax")) .writeStream .option("checkpointLocation", checkpointPath) .outputMode("append") ._____________ .table("new_orders") ) Fill in the blank to make the query executes a micro-batch to process data every 2 minutes 1. trigger(once="2 minutes") 2. trigger(processingTime="2 minutes") 3.processingTime("2 minutes") 4. trigger("2 minutes") 5. trigger()

2. trigger(processingTime="2 minutes") In Spark Structured Streaming, in order to process data in micro-batches at a user-specified intervals, you can use the processingTime trigger method. This allows you to specify a time duration as a string. By default, it’s “500ms”.

Which statement regarding Delta Lake File Statistics is Correct? 1. By default, Delta Lake captures statistics in the Hive metastore on the first 16 columns of each table 2. By default, Delta Lake captures statistics in the Hive metastore on the first 32 columns of each table 3. By default, Delta Lake captures statistics in the transaction log on the first 16 columns of each table 4. By default, Delta Lake captures statistics in the transaction log on the first 32 columns of each table 5. By default, Delta Lake captures statistics in both Hive metastore and transaction log for each added data file

4. By default, Delta Lake captures statistics in the transaction log on the first 32 columns of each table Delta Lake automatically captures statistics in the transaction log for each added data file of the table. By default, Delta Lake collects the statistics on the first 32 columns of each table. These statistics indicate per file: Total number of records Minimum value in each column of the first 32 columns of the table Maximum value in each column of the first 32 columns of the table Null value counts for in each column of the first 32 columns of the table These statistics are leveraged for data skipping based on query filters.

A data engineer uses the following SQL query: GRANT USAGE ON DATABASE sales_db TO finance_team Which of the following is the benefit of the USAGE privilege ? 1. Gives read access on the database 2. Gives full permissions on the entire database 3. Gives the ability to view database objects and their metadata 4. No effect but it's required to perform any action on the database 5. Usage privilege is not back of the Databricks model

4. No effect but it's required to perform any action on the database

The data engineering team is using the LOCATION keyword for every new Delta Lake table created in the Lakehouse. Which of the following describes the purpose of using the LOCATION keyword in this case ? 1. The LOCATION keyword is used to configure the created Delta Lake tables as managed tables 2. The LOCATION keyword is used to configure the created Delta Lake tables as external tables 3. The LOCATION keyword is used to define the created Delta Lake tables in an external database 4. The LOCATION keyword is used to define the created Delta Lake tables in a database over a JDBC connection 5. The LOCATION keyword is used to set a default schema and checkpoint location for the created Delta Lake tables

2. The LOCATION keyword is used to configure the created Delta Lake tables as external tables External (unmanaged) tables are tables whose data is stored in an external storage path by using a LOCATION clause.

A data engineer wants to create a Delta Lake table for storing user activities of a website. The table has the following schema: user_id LONG, page STRING, activity_type LONG, ip_address STRING, activity_time TIMESTAMP, activity_date DATE Based on the above schema, which column is a good candidate for partitioning the Delta Table? 1. user_id 2. activity_type 3. page 4. activity_time 5. activity_date

5. activity_date When choosing partitioning columns, it's good to consider the fact that records with a given value (the activities of a given user) will continue to arrive indefinitely. In such a case, we use a datetime column for partitioning. This allows your partitions to be optimized, and allows you to easily archive those partitions of previous time periods, if necessary.

The data engineering team has a large Delta table named ‘users’. A recent query on the table returned some entries with negative values in the ‘age’ column. To avoid this issue and enforce data quality, a junior data engineer decided to add a CHECK constraint to the table with the following command: ALTER TABLE users ADD CONSTRAINT valid_age CHECK (age> 0); However, the command fails when executed. Which statement explains the cause of this failure? 1. The syntax for adding the CHECK constraint is incorrect. Instead, the command should be: ALTER TABLE users ADD CONSTRAINT ON COLUMN age (CHECK > 0) 2. The users table already exists; CHECK constraints can only be added during table creation using CREATE TABLE command 3. The users table already contains rows that violate the new constraint; all existing rows must satisfy the constraint before adding it to the table 4. The users table is not partitioned on the age column. CHECK constraints can only be added on partitioning columns. 5. The users table already contains rows; CHECK constraints can only be added on empty tables.

3. The users table already contains rows that violate the new constraint; all existing rows must satisfy the constraint before adding it to the table ADD CONSTRAINT command verifies that all existing rows in the table satisfy the constraint before adding it to the table. Otherwise, the command failed with an error that says some rows in the table violate the new CHECK constraint.

A data engineer has added a CHECK constraint to the sales table using the following command: ALTER TABLE sales ADD CONSTRAINT valid_date CHECK (item_date >= '2024-01-01'); In addition, they have added a comment on the item_date column using the following command: ALTER TABLE sales ALTER COLUMN item_date COMMENT "Date must be newer than Jan 1, 2024"; Which of the following commands allows the data engineer to verify that both the constraint and the column comment have been successfully added on the table ? 1. SHOW TBLPROPERTIES sales 2. DESCRIBE TABLE sales 3. DESCRIBE DETAIL sales 4. DESCRIBE EXTENDED sales 5. SHOW TABLES sales

4. DESCRIBE EXTENDED sales DESCRIBE TABLE EXTENDED or simply DESCRIBE EXTENDED allows to show the added tables constraints in the ‘Table Properties’ field. It shows both the name and the actual condition of the check constraints. In addition, DESCRIBE EXTENDED allows to show the comments on each column, and the comment on the table itself.

Which of the following is the benefit of Delta Lake File Statistics ? 1. They are leveraged for process time forecasting when executing selective queries 2. They are leveraged for data skipping when executing selective queries 3. They are leveraged for data compression in order to improve Delta Caching 4. They are used as checksums to check data corruption in parquet files 5. None of the above

2. They are leveraged for data skipping when executing selective queries These statistics are leveraged for data skipping based on query filters. For example, if you are querying the total number of records in a table, Delta will not calculate the count by scanning all data files. Instead, it will leverage these statistics to generate the query result

In which SCD type does new data overwrite existing data?

Type 1 SCD

The data engineering team created a new Databricks job for processing sensitive financial data. A financial analyst asked the team to transfer the "Owner" privilege of this job to the “finance” group. A junior data engineer that has the “CAN MANAGE” permission on the job is attempting to make this privilege transfer via Databricks Job UI, but it keeps failing. 1. The "Owner" privilege is assigned at job creation to the creator and cannot be changed. The job must be re-created using the "finance" group's credentials. 2. Databricks Jobs UI doesn't support changing the owners of jobs. Databricks REST API needs to be used instead. 3. Having the CAN MANAGE permisson is not enough to grant "Owner" privileges to a group. The data engineer must be the current owner of the job. 4. Having the CAN MANGE permission is not enough to grant "Owner" privileges to a group. The data engineer must be a workspace administrator. 5. Groups can not be owners of Databricks jobs. The owner must be an individual user.

5. Groups can not be owners of Databricks jobs. The owner must be an individual user.

The data engineering team noticed that a partitioned Delta Lake table is suffering greatly. They are experiencing slowdowns for most general queries on this table. The team tried to run an OPTIMIZE command on the table, but this did not help to resolve the issue. Which of the following likely explains the cause of these slowdowns? 1. The table has too many old data files that need to be purged. They need to run a VACUUM command instead. 2. The table is over-partitioned or incorrectly partitioned. This requires a full rewrite of all data files to resolve the issue. 3. They are applying the OPTIMIZE command on the whole table - it must be applied at each partition separately. 4. They are applying the OPTIMIZE command without ZORDER. Z-ordering is needed on the partitioning columns. 5. The transaction log is too large - log files older than a certain age must be deleted or archived at partition boundaries.

2. The table is over-partitioned or incorrectly partitioned. This requires a full rewrite of all data files to resolve the issue. Data that is over-partitioned or incorrectly partitioned will suffer greatly. Files cannot be combined or compacted across partition boundaries, so partitioned small tables increase storage costs and total number of files to scan. This leads to slowdowns for most general queries. Such an issue requires a full rewrite of all data files to remedy.

The data engineering team has the following query for processing customers’ requests to be forgotten: DELETE FROM customers WHERE customer_id IN (SELECT customer_id FROM delete_requests) Which statement describes the results of executing this query ? 1. The identified records will be deleted from the customers tables, and their associated data files will be permanently purged from the table directory. 2. The identified records will be deleted from both the customers and delete_requests tables, and their associated data files will be permanently purged from the tables directories. 3. The identified records will be deleted from the customers table, but they will still be accessible in the table history until a VACUUM command is run. 4. The identified records will be deleted from both customers and delete_requests tables, but they will still be accessible in the table history until VACUUM commands are run. 5. The identified records will be deleted from the customers tables, but they will still be accessible in the table history until updating the status of the requests in the delete_requests table.

3. The identified records will be deleted from the customers table, but they will still be accessible in the table history until a VACUUM command is run. Delete requests, also known as requests to be forgotten, require deleting user data that represent Personally Identifiable Information or PII, such as the name and the email of the user. Because of how Delta Lake tables time travel are implemented, deleted values are still present in older versions of the data. Remember, deleting data does not delete the data files from the table directory. Instead, it creates a copy of the affected files without these deleted records. So, to fully commit these deletes, you need to run VACUUM commands on the customers table.

Given the following commands: CREATE DATABASE db_hr; LOCATION '/mnt/hr_external'; USE db_hr; CREATE TABLE employees; In which of the following locations will the employees table be located? 1. dbfs:/user/hive/warehouse 2. dbfs:/user/hive/warehouse/db_hr.db 3. /mnt/hr_external 4. /mnt/hr_external/db_hr.db 5. More info needed

4. /mnt/hr_external/db_hr.db Since we are creating the database here with the LOCATION keyword, it will be created as an external database under the specified location ‘/mnt/hr_external’. The database folder has the extension (.db) And since we are creating the table without specifying a location, the table becomes a managed table created under the database directory (in db_hr.db folder)

The data engineering team has a secret scope named ‘DataOps-Prod’ that contains all secrets needed by DataOps engineers in a production workspace. Which of the following is the minimum permission required for the DataOps engineers to use the secrets in this scope ? 1. MANAGE permission on the "DataOps-Prod" scope 2. READ permission on the "DataOps-Prod" scope 3. MANAGE permission on each secret in the "DataOps-Prod" scope 4. READ permission on each secret in the "DataOps-Prod" scope 5. Workspace Administrator role

2. READ permission on the "DataOps-Prod" scope The secret access permissions are as follows: MANAGE - Allowed to change ACLs, and read and write to this secret scope. WRITE - Allowed to read and write to this secret scope. READ - Allowed to read this secret scope and list what secrets are available. Each permission level is a subset of the previous level’s permissions (that is, a principal with WRITE permission for a given scope can perform all actions that require READ permission).

Which of the following is NOT part of the Ganglia UI? 1. Memory usage 2. Overall workload of the cluster 3. CPU usage 4. Lifecycle events of the cluster 5. Network performance

Lifecycle events of the cluster are not part of Ganglia UI. Ganglia allows monitoring the performance of Databricks clusters. Ganglia UI provides you with the overall workload of the cluster, in addition to detailed metrics on memory, CPI, and Network usage. Lifecycle events of the cluster are part of the Cluster Event log.

In the Spark UI, which of the following is NOT one of the metrics in a stage's details page? 1. Duration 2. Spill (Memory) 3. Spill (Disk) 4. DBU Cost 5. GC time

4. DBU Cost In Spark UI, the stage’s details page shows summary metrics for completed tasks. This includes: Duration of tasks. GC time: is the total JVM garbage collection time. Shuffle spill (memory): is the size of the deserialized form of the shuffled data in memory. Shuffle spill (disk): is the size of the serialized form of the data on disk. and others … DBU Cost is not part of Spark UI. DBU stands for Databricks Unit and it is a unit of processing capability per hour for pricing purposes. This depends on your cluster configuration which tells you how much DBUs would be consumed if a virtual machine runs for an hour, and then pays for each DBU consumed.

A data engineer is using Databricks REST API to send a GET request to the endpoint ‘api/2.1/jobs/runs/get’ to retrieve the run’s metadata of a multi-task job using its run_id. Which statement correctly describes the response structure of this API call? 1. Each task of this job run will have a unique task_id 2. Each task of this job run will have a unique run_id 3. Each task of this job run will have a unique job_id 4. Each task of this job run will have a unique orchestration_id 5. Tasks do not have any unique identifier within a job run

2. Each task of this job run will have a unique run_id Each task of this job run will have a unique run_id to retrieve its output with endpoint ‘api/2.1/jobs/runs/get-output’

A data engineer has noticed the comment ‘# Databricks notebook source’ on the first line of each Databricks Python file’s source code pushed to Github. Which of the following explain the purpose of this comment ? 1. This comment makes it easier for humans to understand the source of the generated code from Databricks 2. This comment established the Python files as Databricks notebooks 3. This comment is used for Python auto-generated documentation 4. This comment adds the Python file to the search index in the Databricks workspace 5. There is no special purpose for this comment

2. This comment established the Python files as Databricks notebooks You can convert Python, SQL, Scala, and R scripts to single-cell notebooks by adding a comment to the first cell of the file: Databricks notebook source

Which of the following statements best describes DBFS? 1. Database File System that organizes and maintains data files in Databricks workspace 2. Database File System to interact with files in cloud-based object storage 3. Abstraction on top of Databricks Lakehouse that provides an open solution to share data to any computing platform 4. Abstraction on top of scalable object storage that maps Unix-like file system calls to native cloud storage API calls 5. None of the above

4. Abstraction on top of scalable object storage that maps Unix-like file system calls to native cloud storage API calls The Databricks File System (DBFS) is a distributed file system mounted into a Databricks workspace and available on Databricks clusters. DBFS is an abstraction on top of scalable object storage that maps Unix-like filesystem calls to native cloud storage API calls.

A data engineer wants to install a Python wheel scoped to the current notebook’s session, so only the current notebook and any jobs associated with this notebook have access to that library. Which of the following commands can the data engineer use to complete this task? 1. %fs install my_package.whl 2. %pip install my_package.whl 3. %python install my_package.whl 4. %whl install my_package 5. Python wheels can not be installed at the notebook level. They can only be installed at the cluster level

2. %pip install my_package.whl ‘%pip install’ allows you to install a Python wheel scoped to the current notebook’s session. This library will be only accessible in the current notebook and any jobs associated with this notebook.

Which of the following statements correctly describes the sys.path Python variable ? 1. The sys.path variable contains a list of all the parameters passed to a Python notebook 2. The sys.path variable contains a list of all the necessary dependancies for a Python notebook 3. The sys.path variable contains a list of directories where the Python interpreter searches for modules 4. The sys.path variable contains the full pathname of the current working directory of a Python notebook 5. The sys.path variable is an alias for os.path

3. The sys.path variable contains a list of directories where the Python interpreter searches for modules

Which of the following statements correctly describes assertions in unit testing ? 1. An assertion is a boolean expression that checks if two code blocks are integrated logically and interacted as a group 2. An assertion is a boolean expression that checks if assumptions made in the code remain true 3. An assertion is a command that logs failed units of code in production for later debugging and analysis 4. An assertion is a command that shows the differences between the current version of a code unit and the most recently edited version 5. An assertion is a set of actions that simulates a user experience to ensure that the application can run properly under real-world scenarios

2. An assertion is a boolean expression that checks if assumptions made in the code remain true

When running an existing job via Databricks REST API, which of the following represents the globally unique identifier of the newly triggered run? 1. job_id 2. run_id 3. run_key 4. task_id 5. task_key

2. run_id Running an existing job via the endpoint ‘/api/2.1/jobs/run-now’ returns the run_id of the triggered run. This represents the globally unique identifier of this newly triggered run.

There are three tasks running in parallel. If there is an error in the notebook 2 that is associated with Task 2, which statement describes the run result of this job ? 1. Task 1 will succeed. Task 2 will partially fail. Task 3 will be skipped. 2. Task 1 will succeed. Task 2 will completely fail. Task 3 will be skipped. 3. Tasks 1 and 3 will succeed, while Task 2 will partially fail. 4. Tasks 1 and 3 will succeed, while Task 2 will completely fail. 5. All tasks will completely fail.

3. Tasks 1 and 3 will succeed, while Task 2 will partially fail. If a task fails during a job run, only the dependent tasks, if any, will be skipped. Parallel tasks will run and complete. The failure of a task will always be partial, which means that the operations in the notebook before the code failure will be successfully run and committed, while the operations after the code failure will be skipped.

The data engineering team has a Delta Lake table created with following query: CREATE TABLE customers_clone AS SELECT * FROM customers A data engineer wants to drop the table with the following query: DROP TABLE customers_clone Which statement describes the result of running this drop command ? 1. An error will occur as the table is deep cloned from the customers table 2. An error will occur as the table is shallow cloned from the customers table 3. Only the table's metadata will be deleted from the catalog, while the data files will be kept in the storage 4. Both the tables metadata and the data files will be deleted 5. The table will not be dropped until VACUUM command is run

4. Both the tables metadata and the data files will be deleted

For production Structured Streaming jobs, which of the following retry policies is recommended to use 1. Unlimited Retries, with 1 Maximum Concurrent Run 2. Unlimited Retries, with Unlimited Concurrent Runs 3. No Retries, with 1 Maximum Concurrent Run 4. No Retries, with Unlimited Concurrent Runs 5. 1 Retry, with 1 Maximum Concurrent Run

1. Unlimited Retries, with 1 Maximum Concurrent Run Retries: Set to Unlimited. Maximum concurrent runs: Set to 1. There must be only one instance of each query concurrently active. Cluster: Set this always to use a new job cluster and use the latest Spark version (or at least version 2.1). Queries started in Spark 2.1 and above are recoverable after query and Spark version upgrades. Notifications: Set this if you want email notification on failures. Schedule: Do not set a schedule. Timeout: Do not set a timeout. Streaming queries run for an indefinitely long time.

A data engineer has a MLFlow model logged in a given “model_url”. They have registered the model as a Spark UDF using the following code: predict_udf = mlflow.pyfunc.spark_udf(spark, "model_url") The data engineer wants to apply this model UDF to a test dataset loaded in the “test_df” DataFrame in order to calculate predictions in a new column “prediction” Which of the following code blocks allows the data engineer to accomplish this task? 1. test_df.apply(predict_udf, *column_list).select("record_id","prediction") 2. test_df.select("record_id", predict_udf("column_list).alias("prediction")) 3. predict_udf("record_id", test_df).select("record_id","prediction") 4. mlflow.pyfunc.map(predict_udf, test_df, "record_id").alias("prediction") 5. mlflow.pyfunc.map(predict_udf, test_df, "record_id").alias("prediction")

2. test_df.select("record_id", predict_udf("column_list).alias("prediction")) In PySpark Dataframe, you can create a new column based on function return value.

In Delta Lake tables, which of the following is the file format for the transaction log ? 1. Delta 2. Parquet 3. JSON 4. Hive-specific format 5. Both Parquet and JSON

5. Both Parquet and JSON Delta Lake builds upon standard data formats. Delta lake table gets stored on the storage in one or more data files in Parquet format, along with transaction logs in JSON format. In addition, Databricks automatically creates Parquet checkpoint files every 10 commits to accelerate the resolution of the current table state.

Which of the following describes the minimal permissions a data engineer needs to modify permissions of an existing cluster ? 1. Can Restart 2. Can Manage 3. Cluster creation allowed + Can Restart 4. Cluster creation allowed + Can Mange 5. Only administrators can modify permissions

2. Can Manage

Which of the following is the default target file size when compacting small files of a Delta table by manually running OPTIMIZE command ? 1. 64 MB 2. 128 MB 3. 256 MB 4. 512 MB 5. 1024 MB

5. 1024 MB The OPTIMIZE command compact small data files into larger ones. The default value is 1073741824, which sets the size to 1 GB.

A junior data engineer is using the following code to de-duplicate raw streaming data and insert them in a target Delta table spark.readStream .table("orders_raw") .dropDuplicates(["order_id", "order_timestamp"]) .writeStream .option("checkpointLocation", "dbfs:/checkpoints") .table("orders_unique") A senior data engineer pointed out that this approach is not enough for having distinct records in the target table when there are late-arriving, duplicate records. Which of the following could explain the senior data engineer’s remark? 1. Watermarking is also needed to only track state information for a window of time in which we expected records could be delayed 2. A ranking function is also needed to ensure processing only the most recent records 3. A window function is also needed to apply deduplication for each non-overlapping interval 4. The new records need also to be deduplicated against previously inserted data into the table 5. More information is needed

4. The new records need also to be deduplicated against previously inserted data into the table To perform streaming deduplication, we use dropDuplicates() function to eliminate duplicate records within each new micro batch. In addition, we need to ensure that records to be inserted are not already in the target table. We can achieve this using insert-only merge.

“A feature built into Delta Lake that allows to automatically generate CDC feeds about Delta Lake tables” Which of the following is being described in the above statement? 1. Auto Optimize 2. Optimized writes 3. Spark Watermarking 4. Slowly Changing Dimension (SCD) 5. Change Data Feed (CDF)

5. Change Data Feed Change Data Feed ,or CDF, is a new feature built into Delta Lake that allows it to automatically generate CDC feeds about Delta Lake tables. CDF records row-level changes for all the data written into a Delta table. This includes the row data along with metadata indicating whether the specified row was inserted, deleted, or updated.

A data engineer uses the following SQL query: GRANT MODIFY ON TABLE employees TO hr_team Which of the following describes the ability given by the MODIFY privilege ? 1. It gives the ability to add data from the table 2. It gives the ability to delete data from the table 3. It gives the ability to modify data in the table 4. All the above abilities are given by the MODIFY privilege 5. None of these options correctly describe the ability given by the MODIFY privilege

4. All of the above

Which of the following statements regarding the retention policy of Delta Lake CDF is correct? 1. Running the VACUUM command on the table deletes CDF data as well 2. Running the VACUUM command on the table does not delete CDF data 3. Running the VACUUM command on the table does not delete CDF data unless CASCADE clause is set to true 4. CDF data files can be purged by running VACUUM CHANGES command 5. CDF data files can never be permanently purged from Delta Lake

1. Running the VACUUM command on the table deletes CDF data as well Databricks records change data for UPDATE, DELETE, and MERGE operations in the _change_data folder under the table directory. The files in the _change_data folder follow the retention policy of the table. Therefore, if you run the VACUUM command, change data feed data is also deleted.

Given the following Structured Streaming query: (spark.readStream .table("orders") .writeStream .option("checkpointLocation", checkpointPath) .table("Output_Table") ) Which of the following is the trigger interval for this query? 1. Every half second 2. Every half min 3. Every half hour 4. The query will run in batch mode to process all available data, then stop 5. More information is needed

1. Every half second By default, if you don’t provide any trigger interval, the data will be processed every half second. This is equivalent to trigger(processingTime=”500ms")

A data engineer run the following CTAS statement in a SQL notebook attached to an All-purpose cluster: CREATE TABLE course_students AS ( SELECT c.course_name, t.student_id, t.student_name FROM courses c LEFT JOIN ( SELECT s.student_id, s.student_name, e.course_id FROM students s INNER JOIN enrollments e ON s.student_id = e.student_id ) t ON c.course_id = t.course_id WHERE c.active = true ) Which statement describes the resulting course_students table ? 1. It's a virtual table that has no physical data. The SELECT statement will be executed each time the course_students table is queried. 2. It's a cluster-scoped virtual table. The SELECT statement will be executed only the first time the course_students table is queried. The query output will be stored in the memory of the currently active cluster. 3. It's a Delta Lake table. The SELECT statement will be executed at the table creation, but its output will be stored in the memory of the currently active cluster. 4. It's a cluster-scoped table. The SELECT statement will be executed at the table creation, but its output will be stored in the memory of the currently active cluster. 5. It's a session-scoped table. The SELECT statement will be executed at the table creation, but its output will be stored in the cache of the current active Spark session.

3. It's a Delta Lake table. The SELECT statement will be executed at the table creation, but its output will be stored in the memory of the currently active cluster.

A data engineer has a streaming job that updates a Delta table named ‘user_activities’ by the results of a join between a streaming Delta table ‘activity_logs’ and a static Delta table ‘users’. They noticed that adding new users into the ‘users’ table does not automatically trigger updates to the ‘user_activities’ table, even when there were activities for those users in the ‘activity_logs’ table. Which of the following likely explains this issue? 1. The users table must be refreshed with REFRESH TABLE command for each microbatch of this join 2. This stream-static join is not stateful by default unless they set the spark configuration delta.statefulStreamStaticJoin to true 3. The streaming portion of this stream-static join drives the join process. Only new data appearing on the streaming side of the join will trigger the processing. 4. The static portion of the stream-static join drives this join process only in batch mode. 5. In Delta Lake, static tables can not be joined with streaming tables.

3. The streaming portion of this stream-static join drives the join process. Only new data appearing on the streaming side of the join will trigger the processing. In stream-static join, the streaming portion of this join drives this join process. So, only new data appearing on the streaming side of the join will trigger the processing. While, adding new records into the static table will not automatically trigger updates to the results of the stream-static join.

Given the following query on the Delta table ‘customers’ on which Change Data Feed is enabled: spark.read .option("readChangeFeed", "true") .option("startingVersion", 0) .table ("customers") .filter (col("_change_type").isin(["update_postimage"])) .write .mode("append") .table("customers_updates") Which statement describes the result of this query each time it is executed? 1. Newly updated records will be merged into the target table, modifying previous entries will the same primary keys 2. Newly updated records will be appended to the target table 3. Newly updated records will overwrite the target table 4. The entire history of updated records will be appended to the target table at each execution, which leads to duplicate entries 5. The entire history of updated records will overwrite the target table at each execution

4. The entire history of updated records will be appended to the target table at each execution, which leads to duplicate entries Reading table’s changes, captured by CDF, using spark.read means that you are reading them as a static source. So, each time you run the query, all table’s changes (starting from the specified startingVersion) will be read. The query in the question then appends the data to the target table at each execution since it’s using the ‘append’ writing mode.

A data engineer wants to use Autoloader to ingest input data into a target table, and automatically evolve the schema of the table when new fields are detected. They use the below query with a blank: spark.readStream .format("cloudFiles") .option("cloudFiles.format", "json") .option("cloudFiles.schemaLocation", checkpointPath) .load(source_path) .writeStream .option("checkpointLocation", checkpointPath) .___________ .start("target_table") Which option correctly fills in the blank to meet the specified requirement ? 1. option("cloudFiles.schemaEvolutionMode","addNewColumns") 2. option("cloudFiles.mergeSchema",True) 3. option("mergeSchema",True) 4. schema(schema_definition, mergeSchema=True) 5. Autoloader can not automatically evolve the schema of the table when new fields are detected

3. option("mergeSchema",True) Schema evolution is a feature that allows adding new detected fields to the table. It’s activated by adding .option('mergeSchema', 'true') to your .write or .writeStream Spark command.

Question 18: Correct Given the following query: spark.table("stream_sink") .filter("recent = true") .dropDuplicates(["item_id", "item_timestamp"]) .write .mode ("overwrite") .table("stream_data_stage") Which statement describes the result of executing this query ? 1. An incremental job will overwrite the stream_sink table by those deduplicated records from stream_data_stage that have been added since the last time the job was run 2. An incremental job will overwrite the stream_data_stage table by those deduplicated records from stream_sink that have been added since the last time the job was run 3. A batch job will overwrite the stream_data_stage table by deduplicated records calculated from all "recent" items in the stream_sink table 4. A batch job will overwrite the stream_data_stage table by deduplicated records calculated from all "recent" items in the stream_sink table 5. A batch job will overwrite the stream_data_stage table by those deduplicated records from stream_sink that have been added since the last time the job was run

4. A batch job will overwrite the stream_data_stage table by deduplicated records calculated from all "recent" items in the stream_sink table Reading a Delta table using spark.table() function means that you are reading it as a static source. So, each time you run the query, all records in the current version of the ‘stream_sink’ table will be read, filtered and deduplicated. There is no difference between spark.table() and spark.read.table() function. Actually, spark.read.table() internally calls spark.table(). The query in the question then writes the data in mode “overwrite” to the ‘stream_data_stage’ table, which completely overwrites the table at each execution.

The data engineering team has a Delta table named ‘users’. A recent CHECK constraint has been added to the table using the following command: ALTER TABLE users ADD CONSTRAINT valid_age CHECK (age> 0); The team attempted to insert a batch of new records to the table, but there were some records with negative age values which caused the write to fail because of the constraint violation. Which statement describes the outcome of this batch insert? 1. All records except those that violate the table constraint have been insert in the table. Records violating the constraint have been ignored 2. All records except those that violate the table constraint have been inserted in the table. Records violating the constraint have been recorded into the transaction log. 3. Only records processed before reaching the first violating record have been inserted in the table 4. None of the records have been inserted into the table 5. The outcome depends on the action defined using the ON VIOLATION clause in the table properies

4. None of the records have been inserted into the table Write operations failed because of the constraint violation. However, ACID guarantees on Delta Lake ensure that all transactions are atomic. That is, they will either succeed or fail completely. So in this case, none of these records have been inserted into the table, even the ones that don't violate the constraints.

Which statement regarding Delta Lake File Statistics is Not correct? 1. The statistics are leveraged for data skipping when executing selective queries 2. The statistics are generally uninformative for string fields with very high cardinality 3. Delta Lake captures statistics in the transaction log for each added data file 4. By default, Delta Lake captures statistics on the first 32 columns of the table 5. Nested fields do not count when determining the first 32 columns in the table

5. Nested fields do not count when determining the first 32 columns in the table Delta Lake automatically captures statistics in the transaction log for each added data file of the table. By default, Delta Lake collects the statistics on the first 32 columns of each table. Nested fields count when determining the first 32 columns Example: 4 struct fields with 8 nested fields will total to the 32 columns.

Which of the following is not a valid Delta Lake File Statistic? 1. The total number of records in the added file 2. The minimum value in each of the first 32 columns 3. The maximum value for each of the first 32 columns 4. The average value for each of the first 32 columns 5. The number of null values for each of the first 32 columns

4. The average value for each of the first 32 columns Delta Lake automatically captures statistics in the transaction log for each added data file of the table. These statistics indicate per file: Total number of records Minimum value in each column of the first 32 columns of the table Maximum value in each column of the first 32 columns of the table Null value counts for in each column of the first 32 columns of the table The average value in the columns is Not part of Delta Lake File Statistics

Which of the following definitions correctly describes a Slowly Changing Dimension of Type 0 ? 1. It's a table where no changes are allowed 2. It's a table where history will be kept in the additional column 3. It's a table where the new arriving data overwrites the existing one 4. It's a table that stores and manages both current and historical data over time 5. Its a table that maintains current records, while older records are stored in another tale

1. It's a table where no changes are allowed Type 0 SCD tables never change. Tables of this type are usually static. For example, static lookup tables.

In which type of SCD table is a new record is added with the changed data values, and this new record becomes the current active record, while the old record is marked as no longer active. This retains the full history of values

SCD2

The data engineering team maintains a Delta Lake table of SCD Type 1. A junior data engineer noticed a folder named ‘_change_data’ in the table directory, and wants to understand what this folder is used for. Which of the following describes the purpose of this folder ? 1. The '_change_data' folder is a metadata directory to track any update to the table definition 2. The '_change_data' folder is the default directory to track the evolution in schema definition 3. All SCD Type 1 tables have the '_change_data' folder to track the updates applied on the table's data 4. Optimized Writes feature is enabled on the table. The '_change_data' folder is the location where the optimized data is stored. 5. CDF feature is enabled on the table. The '_change_data' folder is the location where CDF data is stored.

Databricks records change data for UPDATE, DELETE, and MERGE operations in the _change_data folder under the table directory. The files in the _change_data folder follow the retention policy of the table. Therefore, if you run the VACUUM command, change data feed data is also deleted.

A data engineer has the following streaming query with a blank: spark.readStream .table("orders_cleaned") .groupBy( ___________________________, "author") .agg( count("order_id").alias("orders_count"), avg("quantity").alias("avg_quantity")) .writeStream .option("checkpointLocation", "dbfs:/path/checkpoint") .table("orders_stats") They want to calculate the orders count and average quantity for each non-overlapping 15-minute interval. Which option correctly fills in the blank to meet this requirement? 1. trigger(processingTime="15 minutes") 2. window("order_timestamp", "15 minutes") 3. withWindow("order_timestamp", "15 minutes") 4. withWatermark("order_timestamp", "15 minutes") 5. interval("order_timestamp", "15 minutes")

2. window("order_timestamp", "15 minutes") Pyspark.sql.functions.window function bucketizes rows into one or more time windows given a timestamp specifying column.

The data engineering team is looking for a simple solution to share part of a large Delta Lake table with the data science team. Only department-specific columns in the table need to be shared, but with different names. In addition, there is some sensitive data that must be filtered out before sharing. Which of the following objects can be created to meet the specified requirements ? 1. A new Delta table created using DEEP CLONE from the existing table 2. A new Delta Table created using SHALLOW CLONE from the existing table 3. A new Delta Table created using CTAS statement on the existing table 4. A stored view on the existing table 5. An ad-hoc query on the existing table

4. A stored view on the existing table The solution in this case is to create a view on the table where the required columns can be renamed, and the sensitive data that can be filtered out with the WHERE clause.

A junior data engineer has created the table ‘orders_backup’ as a copy of the table “orders”. Recently, the team started getting an error when querying the orders_backup indicating that some data files are no longer present. The transaction logs for the orders tables show a recent run of VACUUM command. Which of the following explains how the data engineer created the orders_backup table ? 1. The orders_backup table was created using CTAS statement from orders table 2. The orders_backup table was created using CRAS statement from orders table 3. The orders_backup table was created using Delta Lake's SHALLOW CLONE functionality from the orders table 4. The orders_backup table was created using Delta Lake's DEEP CLONE functionality from the orders table 5. The orders_backup table was created by fully copying the orders table's directory

3. The orders_backup table was created using Delta Lake's SHALLOW CLONE functionality from the orders table With Shallow Clone, you create a copy of a table by just copying the Delta transaction logs. That means that there is no data moving during Shallow Cloning. Running the VACUUM command on the source table may purge data files referenced in the transaction log of the clone. In this case, you will get an error when querying the clone indicating that some data files are no longer present.

Which of the following statements regarding cloning tables on Databricks is correct? 1. Any changes made to either deep or shallow clones affect only the clones themselves and not the source table 2. Any changes to deep clones affect only the clones themselves and not the source table. While, changes made to shallow clones affect the source table. 3. Any changes made to shallow clones affect only the clones themselves and not the source table. While, changes made to deep clones affect the source table. 4. Changes made to either deep or shallow clones affect the source table. 5. Changes made to either deep or shallow clones affect the source table unless CASCADE option is False.

1. Any changes made to either deep or shallow clones affect only the clones themselves and not the source table

A data engineer wants to optimize the following join operation by allowing the smaller dataFrame to be sent to all executor nodes in the cluster: largeDF.join(smallerDF, [“key”], "inner") Which of the following functions can be used to mark a dataFrame as small enough to fit in memory on all executors? 1. pyspark.sql.functions.distribute 2. pyspark.sql.functions.explode 3. pyspark.sql.functions.broadcast 4. pyspark.sql.functions.diffuse 5. pyspark.sql.functions.shuffle

3. pyspark.sql.functions.broadcast

The data engineering team wants to create a multiplex bronze Delta table from a Kafka source. The Delta Table has the following schema: key BINARY, value BINARY, topic STRING, partition LONG, offset LONG, timestamp LONG Since the “value” column contains Personal Identifiable Information (PII) for some topics, the team wants to apply Access Control Lists (ACLs) at partition boundaries to restrict access to this PII data. Based on the above schema and the specified requirement, which column is a good candidate for partitioning? 1. key 2. value 3. topic 4. partition 5. timestamp

3. topic Table partitioning helps improve security. You can separate sensitive and nonsensitive data into different partitions and apply different security controls to the sensitive data. * Personally Identifiable Information or PII represents any information that allows identifying individuals by either direct or indirect means, such as the name and the email of the user.

The data engineering team wants to know if the tables that they maintain in the Lakehouse are over-partitioned. Which of the following is an indicator that a Delta Lake table is over-partitioned ? 1. If the number of partitions in the table are too low 2. If most partitions in the table have less than 1 GB of data 3. If most partitions in the table have more than 1 GB of data 4. If the partitioning columns are fields of low cardinality 5. If the data in the table continues to arrive indefinitely

2. If most partitions in the table have less than 1 GB of data Data that is over-partitioned or incorrectly partitioned will suffer greatly. Files cannot be combined or compacted across partition boundaries, so partitioned small tables increase storage costs and total number of files to scan. This leads to slowdowns for most general queries. If most partitions in a table have less than 1GB of data, the table is likely over-partitioned

A data engineer is using the following spark configurations in a pipeline to enable Optimized Writes and Auto Compaction: spark.conf.set("spark.databricks.delta.optimizeWrite.enabled", True) spark.conf.set("spark.databricks.delta.autoCompact.enabled", True) They also want to enable Z-order indexing with Auto Compaction to leverage data skipping on all the pipeline’s tables. Which of the following solutions allows the data engineer to complete this task ? 1. Use spark.conf.set("spark.databricks.delta.autoZorder.enabled", True) 2. Use spark.conf.set("spark.databricks.delta.autoCompact.zorder.enabled", True) 3. Z-order indexing with Auto Compaction can only be enabled on each table separately using: ALTER TABLE table_name SETTBLPROPERTIES (delta.autoOptimize.zorder.enabled = true) 4. There is no need for extra configurations. Z-Ordering is enabled by default with Auto Compaction. 5. There is no way to enable Z-order indexing with Auto Compaction since it does not support Z-Ordering

5. There is no way to enable Z-order indexing with Auto Compaction since it does not support Z-Ordering Auto Compaction does not support Z-Ordering as Z-Ordering is significantly more expensive than just compaction.

The data engineering team maintains the following join logic between three Delta tables: df_students = spark.table("students") df_courses = spark.table("courses") df_enrollments = spark.table("enrollments") df_join_1 = (df_students.join(df_enrollments, df_students.student_id == df_enrollments.student_id) .select(df_students.student_id, df_students.student_name, df_enrollments.course_id) ) df_join_2 = (df_join_1.join(df_courses, df_join_1.course_id == df_courses.course_id) .select(df_join_1.student_id, df_join_1.student_name, df_courses.course_name) ) (df_join_2.write .mode("overwrite") .table("students_courses_details")) Which statement describes the result of this code block each time it is executed? 1 All records in the current version of the source tables will be considered in the join operations. The matched records will overwrite the students_coursses_details table. 2. All records in the current version of the source tables will be considered in the join operations. The unmatched records will overwrite the students_courses_details table. 3. Only newly added records to any of the source tables will be considered in the join operations. The matched records will overwrite the students_courses_details table. 4. Only newly added records to any of the source tables will be considered in the join operations. The unmatched records will overwrite the students_courses_details table. 5. These join operations are stateful, meaning that they will wait for unmatched records to be added to the source tables prior to calculating the results

1 All records in the current version of the source tables will be considered in the join operations. The matched records will overwrite the students_coursses_details table. The query reads three static Delta tables using spark.table() function, which means that all records in the current version of these tables will be read and considered in the join operations. There is no difference between spark.table() and spark.read.table() function. Actually, spark.read.table() internally calls spark.table(). The pyspark.sql.DataFrame.join() function performs inner join operation by default, so the matched records will be written to the target table. In our case, the query writes the data in mode “overwrite” to the target table, which completely overwrites the table.

The data engineering team has a large Delta Lake table named ‘user_posts’ which is partitioned over the ‘year’ column. The table is used as an input streaming source in a streaming job. The streaming query is displayed below with a blank: spark.readStream .table("user_posts") ________________ .groupBy( "post_category", "post_date") .agg( count("psot_id").alias("posts_count"), sum("likes").alias("total_likes")) .writeStream .option("checkpointLocation", "dbfs:/path/checkpoint") .table("psots_stats") They want to remove previous 2 years data from the table without breaking the append-only requirement of streaming sources. Which option correctly fills in the blank to enable stream processing from the table after deleting the partitions ? 1. .withWatermark("year", "INTERVAL 2 YEARS") 2. .window("year", "INTERVAL 2 YEARS") 3. .option("year", "ignoreDeletes") 4. .option("ignoreDeletes", "year") 5. .option("ignoreDeletes", True)

5. .option("ignoreDeletes", True) Partitioning on datetime columns can be leveraged when removing data older than a certain age from the table. For example, you can decide to delete previous years data. In this case, file deletion will be cleanly along partition boundaries. However, if you are using this table as a streaming source, deleting data breaks the append-only requirement of streaming sources, which makes the table no more streamable. To avoid this, you can use the ignoreDeletes option when streaming from this table. This option enables streaming processing from Delta tables with partition deletes. option("ignoreDeletes", True)

Question 35: Incorrect A data engineer created a new table along with a comment using the following query: CREATE TABLE payments COMMENT "This table contains sensitive information" AS SELECT * FROM bank_transactions Which of the following commands allows the data engineer to review the comment of the table ? 1. SHOW TABLES payments 2. SHOW TBLPROPERTIES payments 3. SHOW COMMENTS payments 4. DESCRIBE TABLE payments 5. DESCRIBE EXTENDED payments

5. DESCRIBE EXTENDED payments DESCRIBE TABLE EXTENDED or simply DESCRIBE EXTENDED allows you to show none only table’s comment, but also columns’ comments, and other custom table properties

Which of the following commands allows data engineers to perform an insert-only merge? 1. MERGE INTO orders USING new_orders ON orders.orders_id = new_orders.orders_id WHEN MATCHED INSERT * 2. MERGE INTO orders USING new_orders ON orders.orders_id = new_orders.orders_id WHEN NOT MATCHED INSERT * 3. MERGE INTO orders USING new_orders ON orders.orders_id = new_orders.orders_id WHEN MATCHED INSERT * WHEN NOT MATCHED IGNORE * 4. MERGE INTO orders USING new_orders ON orders.orders_id = new_orders.orders_id WHEN NOT MATCHED INSERT * WHEN MATCHED IGNORE * 5. MERGE INTO orders USING new_orders ON orders.orders_id = new_orders.orders_id WHEN NOT MATCHED INSERT * ELSE IGNORE *

2. MERGE INTO orders USING new_orders ON orders.orders_id = new_orders.orders_id WHEN NOT MATCHED INSERT * The syntax for insert-only merge: MERGE INTO target_table USING soruce_table ON merge_condition WHEN NOT MATCHED INSERT * You just need to specify the NOT MATCHED clause, which inserts a row when a source row does not match any target row based on the merge_condition (merge keys). Records that have the same keys as an existing record in the table will be simply ignored.

Which of the following is considered a limitation when using the MERGE INTO command ? 1. Merge can not be performed in streaming jobs unless it uses Watermarking 2. Merge can not be performed if multiple source rows matched and attempted to modify the same target row in the table 3. Merge can not be performed if single source row matched and attempted to modify multiple target rows in the table 4. Merge does not support record deletion, it supports only upsert operations 5. All of the above

2. Merge can not be performed if multiple source rows matched and attempted to modify the same target row in the table

A data engineer is using a foreachBatch logic to upsert data in a target Delta table. The function to be called at each new microbatch processing is displayed below with a blank: def upsert_data(microBatchDF, batch_id): microBatchDF.createOrReplaceTempView("sales_microbatch") sql_query = """ MERGE INTO sales_silver a USING sales_microbatch b ON a.item_id=b.item_id AND a.item_timestamp=b.item_timestamp WHEN NOT MATCHED THEN INSERT * """ ________________ Which option correctly fills in the blank to execute the sql query in the function on a cluster with recent Databricks Runtime above 10.5 ? 1. spark.sql(sql_query) 2. batch_id.sql(sql_query) 3. microBatchDF.sql(sql_query) 4. microBatchDF.sparkSession.sql(sql_query) 5. microBatchDF._jdf.sparkSession().sql(sql_query)

4. microBatchDF.sparkSession.sql(sql_query) Usually, we use spark.sq() function to run SQL queries. However, in this particular case, the spark session can not be accessed from within the microbatch process. Instead, we can access the local spark session from the microbatch dataframe. For clusters with recent Databricks Runtime version above 10.5, the syntax to access the local spark session is: microBatchDF.sparkSession.sql(sql_query)

A data engineer has been asked to develop a nightly batch job for workforce productivity analytics. The job will process events of employees productivity of the previous day, and store the performance of each employee in the Delta table “employees_performance“. The table has the following schema: "date DATE, employee_id STRING, rating DOUBLE" The data engineering team wants data to be stored in the table with the ability to compare employees' performance across time. Which of the following code blocks accomplishes this task ? 1. performance_df.write.format("delta").saveAsTable("employees_performance") 2. performance_df.write.mode("append").saveAsTable("employees_performance") 3. performance_df.write.mode("overwrite").saveAsTable("employees_performance") 4. performance_df.write.option("dateFormat", "yyyy-MM-dd").saveAsTable("employees_performance") 5. performance_df.write.partitionBy("date").saveAsTable("employees_performance")

2. performance_df.write.mode("append").saveAsTable("employees_performance") Explanation DataFrameWriter.mode defines the writing behaviour when data or table already exists. Options include: append: Append contents of the DataFrame to existing data. overwrite: Overwrite existing data. error or errorifexists: Throw an exception if data already exists. ignore: Silently ignore this operation if data already exists. This errorifexists or error is the default save mode. If the table already exists, it will throw the error message Error: pyspark.sql.utils.AnalysisException: table already exists. The "employees_performance" table has a date column. So, in order to be able to compare employees' performance across time, each new batch of data with new date should be appended into the table using the append mode.

The data engineering team has a large Delta table named ‘user_messages’ with the following schema: msg_id INT, user_id INT, msg_time TIMESTAMP, msg_title STRING, msg_body STRING The msg_body field represents user messages in free-form text. The table has a performance issue when it’s queried with filters on this field. Which of the following could explain the reason for this performance issue ? 1. The table does not leverage file skipping because it's not partitioned on the msg_body column 2. The table does not leverage file skipping because it's not optimized with Z-ORDER on the msg_body column 3. The table does not leverage file skipping because Delta Lake statistics are uninformative for string fields with very high cardinality 4. The table does not leverage file skipping because Delta Lake statistics are only captured on the first 3 columns in a table

3. The table does not leverage file skipping because Delta Lake statistics are uninformative for string fields with very high cardinality Delta Lake automatically captures statistics in the transaction log for each added data file of the table. By default, Delta Lake collects the statistics on the first 32 columns of each table. However, statistics are generally uninformative for string fields with very high cardinality (such as free text fields). Calculating statistics on free-form text fields (like user messages, product reviews, etc) can be time consuming. So, you need to omit these fields from statistics collection by setting them later in the schema after the first 32 columns.

The data engineering team has a pipeline that ingest Kafka source data into a Multiplex bronze table. This Delta table is partitioned based on the topic and month columns. A new data engineer notices that the ‘user_activity’ topic contains Personal Identifiable Information (PII) that needs to to be deleted every two months based on the company’s Service-Level Agreement (SLA). Which statement describes how table partitioning can help to meet this requirement? 1. Table partitioning allows immediate file deletion without running VACUUM command 2. Table partitioning allows delete queries to leverage partition boundaries 3. Table partitioning reduces query latency when deleting large data files 4. Table partitioning does not allowed to time travel the PII data after deletion 5. None of the above

2. Table partitioning allows delete queries to leverage partition boundaries Partitioning on datetime columns can be leveraged when removing data older than a certain age from the table. For example, you can decide to delete previous months data. In this case, file deletion will be cleanly along partition boundaries. Similarly, data could be archived and backed up at partition boundaries to a cheaper storage tier. This drives a huge savings on cloud storage.

Given the following commands: CREATE DATABASE db_hr; USE db_hr; CREATE TABLE employees; In which of the following locations will the employees table be located? 1. dbfs:/user/hive/warehouse 2. dbfs:/user/hive/warehouse/db_hr.db 3. dbfs:/user/hive/warehouse/db_hr 4. dbfs:/user/hive/databases/db_hr.db 5. More information is needed

2. dbfs:/user/hive/warehouse/db_hr.db Since we are creating the database here without specifying a location, the database will be created in the default warehouse directory under dbfs:/user/hive/warehouse. The database folder has the extension (.db) And since we are creating the table also without specifying a location, the table becomes a managed table created under the database directory (in db_hr.db folder)

The data engineering team has a dynamic view with following definition: CREATE VIEW students_vw AS SELECT * FROM students WHERE CASE WHEN is_member("instructors") THEN TRUE ELSE is_active IS FALSE END Which statement describes the results returned by querying this view? 1. Members of the instructors group will only see the records of active students. While users who are not members of the specified group will see null values for the records of inactive students 3. Only members of the instructors group will see the records of all students no matter if they are active or not. While users that are not members of the specified group will only see the records of inactive students 4. Only members of the instructors group will see the records of all students no matter if they are active or not. While users that are not members of the specified group will see null values for the records of inactive students 5. Only members of the instructors group will see the records of all students no matter if they are active or not. While users that are not members of the specified group will see "REDACTED" values for the records of inactive students.

3. Only members of the instructors group will see the records of all students no matter if they are active or not. While users that are not members of the specified group will only see the records of inactive students Only members of the instructors group will have full access to the underlying data since the WHERE condition will be True for every record. On the other hand, users that are not members of the specified group will only be able to see records of students with active status = false.

Which of the following commands can a data engineer use to grant full permissions to the HR team on the table employees? 1. GRANT FULL PRIVILEGES ON TABLE employees TO hr_team 2. GRANT FULL PRIVILEGES ON TABLE hr_team TO employees 3. GRANT ALL PRIVILEGES ON TABLE employees TO hr_team 4. GRANT ALL PRIVILEGES ON TABLE hr_team TO EMPLOYEES 5. GRANT SELECT, MODIFY, CREATE, READ_METADATA ON TABLE employees TO hr_team

3. GRANT ALL PRIVILEGES ON TABLE hr_team TO employees ALL PRIVILEGES is used to grant full permissions on an object to a user or group of users. It is translated into all the below privileges: SELECT CREATE MODIFY USAGE READ_METADATA

The data engineering team has a secret scope named “prod-scope” that contains sensitive secrets in a production workspace. A data engineer in the team is writing a security and compliance documentation, and wants to explain who could use the secrets in this secret scope. Which of the following roles is able to use the secrets in the specified secret scope ? 1. Workspace Adminstrators 2. Secret creators 3. Users with MANAGE permission on the secret scope 4. Users with READ permission on the secret scope 5. All of the above are able to use the secrets in the secret scope

5. All of the above are able to use the secrets in the secret scope Administrators*, secret creators, and users granted access permission can use Databricks secrets. The secret access permissions are as follows: MANAGE - Allowed to change ACLs, and read and write to this secret scope. WRITE - Allowed to read and write to this secret scope. READ - Allowed to read this secret scope and list what secrets are available. Each permission level is a subset of the previous level’s permissions (that is, a principal with WRITE permission for a given scope can perform all actions that require READ permission). * Workspace administrators have MANAGE permissions to all secret scopes in the workspace.

In Spark UI, which of the following SQL metrics is displayed on the query's details page? 1. Query duration 2. Query execution time 3. Succeeded Jobs 4. Spill size 5. Number of input rows

4. Spill size In Spark UI, the query’s details page displays general information about the query execution time, its duration, the list of associated jobs, and the query execution DAG. In addition, it shows SQL metrics in the block of physical operators. The SQL metrics can be useful when we want to dive into the execution details of each operator. For example, “number of output rows” can answer how many rows are output after a Filter operator, “Spill size” which is the number of bytes spilled to disk from memory in the operator.

A data engineer has heard recently that users who have access to Databricks Secrets could be able to display the values of secrets in notebooks. Which of the following could be a workaround to print the value of a Databricks secret in plain text ? 1. db_password = dbutils.secrets.get("prod-scope","db_password") display(db_password) 2. db_password = dbutils.secrets.get("prod-scope","db-password", redacted=False) 3. db_password = dbutils.secrets.get("prod-scope","db-password") print(db_password, redacted=False) 4. db_password = dbutils.secrets.get("prod-scope","db-password") for char in db_password: print(char) 5. There is no workaround to print secrets values in plain text in notebooks. A string "REDACTED" will always be displayed when trying to print out a secret value.

4. db_password = dbutils.secrets.get("prod-scope","db-password") for char in db_password: print(char) Databricks redacts secret values that are read using dbutils.secrets.get(). When displayed in notebook cell output, the secret values are replaced with [REDACTED] string. However, is there a workaround to print the values of Databricks secrets in plain text by Iterating through the secret and printing each character.

A data engineer wanted to create the job ‘process-sales’ using Databricks REST API. However, they sent by mistake 2 POST requests to the endpoint ‘api/2.1/jobs/create’ Which statement describes the result of these requests ? 1. Only the first job will be created in the workspace. The second request will fail with an error indicating that a job named "process-sales" is already created. 2. The second job will overwrite the previous one created using the first request. 3. 2 jobs will be created in the workspace, but the second on will be renamed to "process-sales (1)" 4. 2 jobs named "process-sales" will be created in the workspace, but with the same job_id 5. 2 jobs named "process-sales" will be created in the workspace, but with different job_id

5. 2 jobs named "process-sales" will be created in the workspace, but with different job_id Sending the same job definition in multiple POST requests to the endpoint ‘api/2.1/jobs/create’ will create a new job for each request, but each job will have its own unique job_id.

A data engineer wants to use Databricks REST API to retrieve the metadata of a job run using its run_id Which of the following REST API calls achieves this requirement? 1. Send POST request to the endpoint 'api/2.1/jobs/runs/get' 2. Send GET request to the endpoint 'api/2.1/jobs/runs/get' 3. Send POST request to the endpoint 'api/2.1/jobs/runs/get-output' 4. Send GET request to the endpoint 'api/2.1/jobs/runs/get-output' 5. Send GET request to the endpoint 'api/2.1/jobs/runs/get-metadata'

2. Send GET request to the endpoint 'api/2.1/jobs/runs/get'

Which of the following commands prints the current working directory of a notebook in Databricks Repos ? 1. %sh pwd 2. print(sys.path) 3. os.path.abspath() 4. os.environment['PYTHONPATH'] 5. In Databricks Repos, the working directory of any notebook is '/databricks/driver'

1. %sh pwd The %sh magic command allows you to run shell code in a notebook. The pwd is an acronym for print working directory.

Which of the following establishes a Python file as a notebook in Databricks? 1. The magic command %databricks on the first line of the file's source code 2. The comment # Databricks notebook source' on the first line of the file's source code 3. The import of the dbutils.notebook module in the file's source code 4. The creation of a spark session using SparkSession.builer.getOrCreate() in the file's source code 5. None of the above

2. The comment # Databricks notebook source' on the first line of the file's source code

A data engineer wants to upload a CSV file from local system storage to the DBFS of a Databricks workspace. They have Databricks CLI already configured on the local system. Which of the following Databricks CLI commands can the data engineer use to complete this task ? 1. workspace 2. fs 3. jobs 4. configure 5. libraries

2. fs Databricks CLI provides the ‘fs’ utility to interact with DBFS. Usage: databricks fs -h The ‘fs cp’ command allows copying files to and from DBFS. To upload a csv file from a local system to DBFS, use: databricks fs cp C:\source\file.csv dbfs:/target/file.csv --overwrite

Which of the following statements correctly describes Unit Testing? 1. It's an approach to simulate a user experience to ensure that the application can run properly under real-world scenarios 2. It's an approach to test the interaction between subsystems of an application to ensure that modules work properly as a group 3. It's an approach to test individual units of code to determine whether they still work as expected if new changes are made to them in the future 4. It's an approach to verify if each feature of the application works as per the business requirements 5. It's an approach to measure the reliability, speed, scalability, and responsiveness of an application

3. It's an approach to test individual units of code to determine whether they still work as expected if new changes are made to them in the future

Which of the following statements correctly describes Integration Testing ? 1. It's an approach to simulate a user experience to ensure that the application can run properly under real-world scenarios 2. It's an approach to test the interaction between subsystems of an application to ensure that modules work properly as a group 3. It's an approach to test individual units of code to determine whether they still work as expected if new changes are made in the future 4. It's an approach to verify if each feature of the application works as per the business requirements 5. It's an approach to measure the reliability, speed, scalability, and responsiveness of an application

2. It's an approach to test the interaction between subsystems of an application to ensure that modules work properly as a group

Which of the following describes Cron syntax in Databricks Jobs? 1. It's an expression to represent the maximum concurrent runs of a job 2. It's an expression to represent the retry policy of a job 3. It's an expression to describe the email notification events (start, success, failure) 4. It's an expression to represent the run timeout of a job 5. It's an expression to represent complex job schedule that can be defined programmatically

5. It's an expression to represent complex job schedule that can be defined programmatically To define a schedule for a Databricks job, you can either interactively specify the period and starting time, or write a Cron Syntax expression. The Cron Syntax allows to represent complex job schedule that can be defined programmatically

A data engineer has a Job with multiple tasks that takes more than 2 hours to complete. In the last run, the final task unexpectedly failed. Which of the following actions can the data engineer perform to complete this run while minimizing the execution time? 1. They can rerun this Job Run to execute all the tasks 2. They can repair this Job Run so only the failed tasks will be re-executed 3. They need to delete the failed Run, and start a new Run for the Job 4. They can keep the failed Run, and simply start a new Run for the Job 5. They can run the Job in product mode which automatically retries execution in case of errors

2. They can repair this Job Run so only the failed tasks will be re-executed You can repair failed multi-task jobs by running only the subset of unsuccessful tasks and any dependent tasks. Because successful tasks are not re-run, this feature reduces the time and resources required to recover from unsuccessful job runs.

As a general rule, before scheduling notebooks in production, which of the following commands should be removed from the code ? 1. Magic commands 2. Overwrite table commands 3. Markup language commands 4. Display commands 5. Import commands

4. Display commands

Which of the following statements best describes the use of Python wheels in Databricks ? 1. A Python %wheel is a magic command which allows you to install Python packages on Databricks clusters 2. A Python wheel is a virtual environment for isolating the Python interpreter, libraries and modules in a notebook from other notebooks 3. A Python wheel is a repository for hosting, managing and distributing Python binaries and artefacts in a Databricks workspace 4. A Python wheel is a binary distribution format for installing custom Python code packages on Databricks Clusters 5. A Python wheel is package installer tool alternative to 'pip'

4. A Python wheel is a binary distribution format for installing custom Python code packages on Databricks Clusters Python wheel is a binary distribution format for installing custom Python code packages on Databricks Clusters. A wheel is a ZIP-format archive with the .whl extension.

If there is an error in the notebook 1 that is associated with Task 1, and Task 2 and Task 3 are dependent on Task 1, which statement describes the run result of this job? 1. Task 1 will partially fail, Tasks 2 and 3 will be skipped 2. Task 1 will partially fail. Tasks 2 and 3 will run and succeed 3. Task 1 will completely fail. Tasks 2 and 3 will be skipped 4. Task 1 will completely fail. Tasks 2 and 3 will run and succeed. 5. All tasks will partially fail.

1. Task 1 will partially fail, Tasks 2 and 3 will be skipped If a task fails during a job run, all dependent tasks will be skipped. The failure of a task will always be partial, which means that the operations in the notebook before the code failure will be successfully run and committed, while the operations after the code failure will be skipped.

Which distribution does Databricks support for installing custom Python code packages? A. sbt B. CRANC. npm D. Wheels E. jars

D Wheels

Which REST API call can be used to review the notebooks configured to run as tasks in a multi-task job? A. /jobs/runs/list B. /jobs/runs/get-output C. /jobs/runs/get D. /jobs/get E. /jobs/list

D. /jobs/get

The business reporting team requires that data for their dashboards be updated every hour. The total processing time for the pipeline that extracts transforms, and loads the data for their pipeline runs in 10 minutes. Assuming normal operating conditions, which configuration will meet their service-level agreement requirements with the lowest cost? A. Manually trigger a job anytime the business reporting team refreshes their dashboards B. Schedule a job to execute the pipeline once an hour on a new job cluster C. Schedule a Structured Streaming job with a trigger interval of 60 minutes D. Schedule a job to execute the pipeline once an hour on a dedicated interactive cluster E. Configure a job that executes every time new data lands in a given directory

B. Schedule a job to execute the pipeline once an hour on a new job cluster

Two of the most common data locations on Databricks are the DBFS root storage and external object storage mounted with dbutils.fs.mount(). Which of the following statements is correct? A. DBFS is a file system protocol that allows users to interact with files stored in object storage using syntax and guarantees similar to Unix file systems. B. By default, both the DBFS root and mounted data sources are only accessible to workspace administrators. C. The DBFS root is the most secure location to store data, because mounted storage volumes must have full public read and write permissions. D. Neither the DBFS root nor mounted storage can be accessed when using %sh in a Databricks notebook. E. The DBFS root stores files in ephemeral block volumes attached to the driver, while mounted directories will always persist saved data to external storage between sessions.

A. DBFS is a file system protocol that allows users to interact with files stored in object storage using syntax and guarantees similar to Unix file systems.

Assuming that the Databricks CLI has been installed and configured correctly, which Databricks CLI command can be used to upload a custom Python Wheel to object storage mounted with the DBFS for use with a production job? A. configure B. fs C. jobs D. libraries E. workspace

B. fs

The data engineering team has configured a job to process customer requests to be forgotten (have their data deleted). All user data that needs to be deleted is stored in Delta Lake tables using default table settings. The team has decided to process all deletions from the previous week as a batch job at 1am each Sunday. The total duration of this job is less than one hour. Every Monday at 3am, a batch job executes a series of VACUUM commands on all Delta Lake tables throughout the organization. The compliance officer has recently learned about Delta Lake's time travel functionality. They are concerned that this might allow continued access to deleted data. Assuming all delete logic is correctly implemented, which statement correctly addresses this concern? A. Because the VACUUM command permanently deletes all files containing deleted records, deleted records may be accessible with time travel for around 24 hours. B. Because the default data retention threshold is 24 hours, data files containing deleted records will be retained until the VACUUM job is run the following day. C. Because Delta Lake time travel provides full access to the entire history of a table, deleted records can always be recreated by users with full admin privileges. D. Because Delta Lake's delete statements have ACID guarantees, deleted records will be permanently purged from all storage systems as soon as a delete job completes. E. Because the default data retention threshold is 7 days, data files containing deleted records will be retained until the VACUUM job is run 8 days later.

E. Because the default data retention threshold is 7 days, data files containing deleted records will be retained until the VACUUM job is run 8 days later.

A distributed team of data analysts share computing resources on an interactive cluster with autoscaling configured. In order to better manage costs and query throughput, the workspace administrator is hoping to evaluate whether cluster upscaling is caused by many concurrent users or resource-intensive queries. In which location can one review the timeline for cluster resizing events? A. Workspace audit logs B. Driver's log file C. Ganglia D. Cluster Event Log E. Executor's log file

D. Cluster Event Log

Which statement regarding Spark configuration on the Databricks platform is true? A. The Databricks REST API can be used to modify the Spark configuration properties for an interactive cluster without interrupting jobs currently running on the cluster. B. Spark configurations set within a notebook will affect all SparkSessions attached to the same interactive cluster. C. Spark configuration properties can only be set for an interactive cluster by creating a global init script. D. Spark configuration properties set for an interactive cluster with the Clusters UI will impact all notebooks attached to that cluster. E. When the same Spark configuration property is set for an interactive cluster and a notebook attached to that cluster, the notebook setting will always be ignored.

D. Spark configuration properties set for an interactive cluster with the Clusters UI will impact all notebooks attached to that cluster.

You are performing a join operation to combine values from a static userLookup table with a streaming DataFrame streamingDF. Which code block attempts to perform an invalid stream-static join? A. userLookup.join(streamingDF, ["userid"], how="inner") B. streamingDF.join(userLookup, ["user_id"], how="outer") C. streamingDF.join(userLookup, ["user_id”], how="left") D. streamingDF.join(userLookup, ["userid"], how="inner") E. userLookup.join(streamingDF, ["user_id"], how="right")

B. streamingDF.join(userLookup, ["user_id"], how="outer") Specifically, outer joins are not supported with a static DataFrame on the right and a streaming DataFrame on the left.

Which statement characterizes the general programming model used by Spark Structured Streaming? A. Structured Streaming leverages the parallel processing of GPUs to achieve highly parallel data throughput. B. Structured Streaming is implemented as a messaging bus and is derived from Apache Kafka. C. Structured Streaming uses specialized hardware and I/O streams to achieve sub-second latency for data transfer. D. Structured Streaming models new data arriving in a data stream as new rows appended to an unbounded table. E. Structured Streaming relies on a distributed network of nodes that hold incremental state values for cached stages.

D. Structured Streaming models new data arriving in a data stream as new rows appended to an unbounded table.

A large company seeks to implement a near real-time solution involving hundreds of pipelines with parallel updates of many tables with extremely high volume and high velocity data. Which of the following solutions would you implement to achieve this requirement? A. Use Databricks High Concurrency clusters, which leverage optimized cloud storage connections to maximize data throughput. B. Partition ingestion tables by a small time duration to allow for many data files to be written in parallel. C. Configure Databricks to save all data to attached SSD volumes instead of object storage, increasing file I/O significantly. D. Isolate Delta Lake tables in their own storage containers to avoid API limits imposed by cloud vendors. E. Store all tables in a single database to ensure that the Databricks Catalyst Metastore can load balance overall throughput.

A. Use Databricks High Concurrency clusters, which leverage optimized cloud storage connections to maximize data throughput.

A junior data engineer has manually configured a series of jobs using the Databricks Jobs UI. Upon reviewing their work, the engineer realizes that they are listed as the "Owner" for each job. They attempt to transfer "Owner" privileges to the "DevOps" group, but cannot successfully accomplish this task. Which statement explains what is preventing this privilege transfer? A. Databricks jobs must have exactly one owner; "Owner" privileges cannot be assigned to a group. B. The creator of a Databricks job will always have "Owner" privileges; this configuration cannot be changed. C. Other than the default "admins" group, only individual users can be granted privileges on jobs. D. A user can only transfer job ownership to a group if they are also a member of that group. E. Only workspace administrators can grant "Owner" privileges to a group.

A. Databricks jobs must have exactly one owner; "Owner" privileges cannot be assigned to a group.

A Delta Lake table was created with the below query: image27 Realizing that the original query had a typographical error, the below code was executed: ALTER TABLE prod.sales_by_stor RENAME TO prod.sales_by_store Which result will occur after running the second command? A. The table reference in the metastore is updated and no data is changed. B. The table name change is recorded in the Delta transaction log. C. All related files and metadata are dropped and recreated in a single ACID transaction. D. The table reference in the metastore is updated and all data files are moved. E. A new Delta transaction log Is created for the renamed table.

A. The table reference in the metastore is updated and no data is changed.

A junior data engineer has configured a workload that posts the following JSON to the Databricks REST API endpoint 2.0/jobs/create. image9 Assuming that all configurations and referenced resources are available, which statement describes the result of executing this workload three times? { "name": "Ingest new data", "existing_cluster_id": "545473958" "notebook_task": { "notebook_path": "/Prod/ingest.py" ) { A. Three new jobs named "Ingest new data" will be defined in the workspace, and they will each run once daily. B. The logic defined in the referenced notebook will be executed three times on new clusters with the configurations of the provided cluster ID. C. Three new jobs named "Ingest new data" will be defined in the workspace, but no jobs will be executed. D. One new job named "Ingest new data" will be defined in the workspace, but it will not be executed. E. The logic defined in the referenced notebook will be executed three times on the referenced existing all purpose cluster.

C. Three new jobs named "Ingest new data" will be defined in the workspace, but no jobs will be executed.

A user new to Databricks is trying to troubleshoot long execution times for some pipeline logic they are working on. Presently, the user is executing code cell-by-cell, using display() calls to confirm code is producing the logically correct results as new transformations are added to an operation. To get a measure of average time to execute, the user is running each cell multiple times interactively. Which of the following adjustments will get a more accurate measure of how code is likely to perform in production? A. Scala is the only language that can be accurately tested using interactive notebooks; because the best performance is achieved by using Scala code compiled to JARs, all PySpark and Spark SQL logic should be refactored. B. The only way to meaningfully troubleshoot code execution times in development notebooks Is to use production-sized data and production-sized clusters with Run All execution. C. Production code development should only be done using an IDE; executing code against a local build of open source Spark and Delta Lake will provide the most accurate benchmarks for how code will perform in production. D. Calling display() forces a job to trigger, while many transformations will only add to the logical query plan; because of caching, repeated execution of the same logic does not provide meaningful results. E. The Jobs UI should be leveraged to occasionally run the notebook as a job and track execution time during incremental code development because Photon can only be enabled on clusters launched for scheduled jobs.

D. Calling display() forces a job to trigger, while many transformations will only add to the logical query plan; because of caching, repeated execution of the same logic does not provide meaningful results.

A developer has successfully configured their credentials for Databricks Repos and cloned a remote Git repository. They do not have privileges to make changes to the main branch, which is the only branch currently visible in their workspace. Which approach allows this user to share their code updates without the risk of overwriting the work of their teammates? A. Use Repos to checkout all changes and send the git diff log to the team. B. Use Repos to create a fork of the remote repository, commit all changes, and make a pull request on the source repository. C. Use Repos to pull changes from the remote Git repository; commit and push changes to a branch that appeared as changes were pulled. D. Use Repos to merge all differences and make a pull request back to the remote repository. E. Use Repos to create a new branch, commit all changes, and push changes to the remote Git repository.

E. Use Repos to create a new branch, commit all changes, and push changes to the remote Git repository.

A Spark job is taking longer than expected. Using the Spark UI, a data engineer notes that the Min, Median, and Max Durations for tasks in a particular stage show the minimum and median time to complete a task as roughly the same, but the max duration for a task to be roughly 100 times as long as the minimum. Which situation is causing increased duration of the overall job? A. Task queueing resulting from improper thread pool assignment. B. Spill resulting from attached volume storage being too small. C. Network latency due to some cluster nodes being in different regions from the source data D. Skew caused by more data being assigned to a subset of spark-partitions. E. Credential validation errors while pulling data from an external system.

D. Skew caused by more data being assigned to a subset of spark-partitions.

The data engineering team maintains a table of aggregate statistics through batch nightly updates. This includes total sales for the previous day alongside totals and averages for a variety of time periods including the 7 previous days, year-to-date, and quarter-to-date. This table is named store_saies_summary and the schema is as follows: image28 The table daily_store_sales contains all the information needed to update store_sales_summary. The schema for this table is: store_id INT, sales_date DATE, total_sales FLOAT If daily_store_sales is implemented as a Type 1 table and the total_sales column might be adjusted after manual data auditing, which approach is the safest to generate accurate reports in the store_sales_summary table? A. Implement the appropriate aggregate logic as a batch read against the daily_store_sales table and overwrite the store_sales_summary table with each Update. B. Implement the appropriate aggregate logic as a batch read against the daily_store_sales table and append new rows nightly to the store_sales_summary table. C. Implement the appropriate aggregate logic as a batch read against the daily_store_sales table and use upsert logic to update results in the store_sales_summary table. Most Voted D. Implement the appropriate aggregate logic as a Structured Streaming read against the daily_store_sales table and use upsert logic to update results in the store_sales_summary table. E. Use Structured Streaming to subscribe to the change data feed for daily_store_sales and apply changes to the aggregates in the store_sales_summary table with each update.

C. Implement the appropriate aggregate logic as a batch read against the daily_store_sales table and use upsert logic to update results in the store_sales_summary table.

A data ingestion task requires a one-TB JSON dataset to be written out to Parquet with a target part-file size of 512 MB. Because Parquet is being used instead of Delta Lake, built-in file-sizing features such as Auto-Optimize & Auto-Compaction cannot be used. Which strategy will yield the best performance without shuffling data? A. Set spark.sql.files.maxPartitionBytes to 512 MB, ingest the data, execute the narrow transformations, and then write to parquet. B. Set spark.sql.shuffle.partitions to 2,048 partitions (1TB*1024*1024/512), ingest the data, execute the narrow transformations, optimize the data by sorting it (which automatically repartitions the data), and then write to parquet. C. Set spark.sql.adaptive.advisoryPartitionSizeInBytes to 512 MB bytes, ingest the data, execute the narrow transformations, coalesce to 2,048 partitions (1TB*1024*1024/512), and then write to parquet. D. Ingest the data, execute the narrow transformations, repartition to 2,048 partitions (1TB* 1024*1024/512), and then write to parquet. E. Set spark.sql.shuffle.partitions to 512, ingest the data, execute the narrow transformations, and then write to parquet.

A. Set spark.sql.files.maxPartitionBytes to 512 MB, ingest the data, execute the narrow transformations, and then write to parquet.

A Structured Streaming job deployed to production has been resulting in higher than expected cloud storage costs. At present, during normal execution, each microbatch of data is processed in less than 3s; at least 12 times per minute, a microbatch is processed that contains 0 records. The streaming write was configured using the default trigger settings. The production job is currently scheduled alongside many other Databricks jobs in a workspace with instance pools provisioned to reduce start-up time for jobs with batch execution. Holding all other variables constant and assuming records need to be processed in less than 10 minutes, which adjustment will meet the requirement? A. Set the trigger interval to 3 seconds; the default trigger interval is consuming too many records per batch, resulting in spill to disk that can increase volume costs. B. Increase the number of shuffle partitions to maximize parallelism, since the trigger interval cannot be modified without modifying the checkpoint directory. C. Set the trigger interval to 10 minutes; each batch calls APIs in the source storage account, so decreasing trigger frequency to maximum allowable threshold should minimize this cost. D. Set the trigger interval to 500 milliseconds; setting a small but non-zero trigger interval ensures that the source is not queried too frequently. E. Use the trigger once option and configure a Databricks job to execute the query every 10 minutes; this approach minimizes costs for both compute and storage.

E. Use the trigger once option and configure a Databricks job to execute the query every 10 minutes; this approach minimizes costs for both compute and storage.

A Delta table of weather records is partitioned by date and has the below schema: date DATE, device_id INT, temp FLOAT, latitude FLOAT, longitude FLOAT To find all the records from within the Arctic Circle, you execute a query with the below filter: latitude > 66.3 Which statement describes how the Delta engine identifies which files to load? A. All records are cached to an operational database and then the filter is applied B. The Parquet file footers are scanned for min and max statistics for the latitude column C. All records are cached to attached storage and then the filter is applied D. The Delta log is scanned for min and max statistics for the latitude column E. The Hive metastore is scanned for min and max statistics for the latitude column

B. The Parquet file footers are scanned for min and max statistics for the latitude column

The business intelligence team has a dashboard configured to track various summary metrics for retail stores. This includes total sales for the previous day alongside totals and averages for a variety of time periods. The fields required to populate this dashboard have the following schema: For demand forecasting, the Lakehouse contains a validated table of all itemized sales updated incrementally in near real-time. This table, named products_per_order, includes the following fields: Because reporting on long-term sales trends is less volatile, analysts using the new dashboard only require data to be refreshed once daily. Because the dashboard will be queried interactively by many users throughout a normal business day, it should return results quickly and reduce total compute associated with each materialization. Which solution meets the expectations of the end users while controlling and limiting possible costs? A. Populate the dashboard by configuring a nightly batch job to save the required values as a table overwritten with each update. Most Voted B. Use Structured Streaming to configure a live dashboard against the products_per_order table within a Databricks notebook. C. Configure a webhook to execute an incremental read against products_per_order each time the dashboard is refreshed. D. Use the Delta Cache to persist the products_per_order table in memory to quickly update the dashboard with each query. E. Define a view against the products_per_order table and define the dashboard against this view.

A. Populate the dashboard by configuring a nightly batch job to save the required values as a table overwritten with each update.

An upstream system is emitting change data capture (CDC) logs that are being written to a cloud object storage directory. Each record in the log indicates the change type (insert, update, or delete) and the values for each field after the change. The source table has a primary key identified by the field pk_id. For auditing purposes, the data governance team wishes to maintain a full record of all values that have ever been valid in the source system. For analytical purposes, only the most recent value for each record needs to be recorded. The Databricks job to ingest these records occurs once per hour, but each individual record may have changed multiple times over the course of an hour. Which solution meets these requirements? A. Create a separate history table for each pk_id resolve the current state of the table by running a union all filtering the history tables for the most recent state. B. Use MERGE INTO to insert, update, or delete the most recent entry for each pk_id into a bronze table, then propagate all changes throughout the system. C. Iterate through an ordered set of changes to the table, applying each in turn; rely on Delta Lake's versioning ability to create an audit log. D. Use Delta Lake's change data feed to automatically process CDC data from an external system, propagating all changes to all dependent tables in the Lakehouse. E. Ingest all log information into a bronze table; use MERGE INTO to insert, update, or delete the most recent entry for each pk_id into a silver table to recreate the current table state.

E. Ingest all log information into a bronze table; use MERGE INTO to insert, update, or delete the most recent entry for each pk_id into a silver table to recreate the current table state.

A small company based in the United States has recently contracted a consulting firm in India to implement several new data engineering pipelines to power artificial intelligence applications. All the company's data is stored in regional cloud storage in the United States. The workspace administrator at the company is uncertain about where the Databricks workspace used by the contractors should be deployed. Assuming that all data governance considerations are accounted for, which statement accurately informs this decision? A. Databricks runs HDFS on cloud volume storage; as such, cloud virtual machines must be deployed in the region where the data is stored. B. Databricks workspaces do not rely on any regional infrastructure; as such, the decision should be made based upon what is most convenient for the workspace administrator. C. Cross-region reads and writes can incur significant costs and latency; whenever possible, compute should be deployed in the same region the data is stored. D. Databricks leverages user workstations as the driver during interactive development; as such, users should always use a workspace deployed in a region they are physically near. E. Databricks notebooks send all executable code from the user’s browser to virtual machines over the open internet; whenever possible, choosing a workspace region near the end users is the most secure.

C. Cross-region reads and writes can incur significant costs and latency; whenever possible, compute should be deployed in the same region the data is stored.

A nightly batch job is configured to ingest all data files from a cloud object storage container where records are stored in a nested directory structure YYYY/MM/DD. The data for each date represents all records that were processed by the source system on that date, noting that some records may be delayed as they await moderator approval. Each entry represents a user review of a product and has the following schema: user_id STRING, review_id BIGINT, product_id BIGINT, review_timestamp TIMESTAMP, review_text STRING The ingestion job is configured to append all data for the previous date to a target table reviews_raw with an identical schema to the source system. The next step in the pipeline is a batch write to propagate all new records inserted into reviews_raw to a table where data is fully deduplicated, validated, and enriched. Which solution minimizes the compute costs to propagate this batch of data? A. Perform a batch read on the reviews_raw table and perform an insert-only merge using the natural composite key user_id, review_id, product_id, review_timestamp. B. Configure a Structured Streaming read against the reviews_raw table using the trigger once execution mode to process new records as a batch job. C. Use Delta Lake version history to get the difference between the latest version of reviews_raw and one version prior, then write these records to the next table. D. Filter all records in the reviews_raw table based on the review_timestamp; batch append those records produced in the last 48 hours. E. Reprocess all records in reviews_raw and overwrite the next table in the pipeline.

B. Configure a Structured Streaming read against the reviews_raw table using the trigger once execution mode to process new records as a batch job.

A Delta Lake table representing metadata about content posts from users has the following schema: user_id LONG, post_text STRING, post_id STRING, longitude FLOAT, latitude FLOAT, post_time TIMESTAMP, date DATE This table is partitioned by the date column. A query is run with the following filter: longitude < 20 & longitude > -20 Which statement describes how data will be filtered? A. Statistics in the Delta Log will be used to identify partitions that might Include files in the filtered range. B. No file skipping will occur because the optimizer does not know the relationship between the partition column and the longitude. C. The Delta Engine will use row-level statistics in the transaction log to identify the flies that meet the filter criteria. D. Statistics in the Delta Log will be used to identify data files that might include records in the filtered range. E. The Delta Engine will scan the parquet file footers to identify each row that meets the filter criteria.

D. Statistics in the Delta Log will be used to identify data files that might include records in the filtered range.

A junior data engineer is working to implement logic for a Lakehouse table named silver_device_recordings. The source data contains 100 unique fields in a highly nested JSON structure. The silver_device_recordings table will be used downstream to power several production monitoring dashboards and a production model. At present, 45 of the 100 fields are being used in at least one of these applications. The data engineer is trying to determine the best approach for dealing with schema declaration given the highly-nested structure of the data and the numerous fields. Which of the following accurately presents information about Delta Lake and Databricks that may impact their decision-making process? A. The Tungsten encoding used by Databricks is optimized for storing string data; newly-added native support for querying JSON strings means that string types are always most efficient. B. Because Delta Lake uses Parquet for data storage, data types can be easily evolved by just modifying file footer information in place. C. Human labor in writing code is the largest cost associated with data engineering workloads; as such, automating table declaration logic should be a priority in all migration workloads. D. Because Databricks will infer schema using types that allow all observed data to be processed, setting types manually provides greater assurance of data quality enforcement. E. Schema inference and evolution on Databricks ensure that inferred types will always accurately match the data types used by downstream systems.

D. Because Databricks will infer schema using types that allow all observed data to be processed, setting types manually provides greater assurance of data quality enforcement.

To reduce storage and compute costs, the data engineering team has been tasked with curating a series of aggregate tables leveraged by business intelligence dashboards, customer-facing applications, production machine learning models, and ad hoc analytical queries. The data engineering team has been made aware of new requirements from a customer-facing application, which is the only downstream workload they manage entirely. As a result, an aggregate table used by numerous teams across the organization will need to have a number of fields renamed, and additional fields will also be added. Which of the solutions addresses the situation while minimally interrupting other teams in the organization without increasing the number of tables that need to be managed? A. Send all users notice that the schema for the table will be changing; include in the communication the logic necessary to revert the new table schema to match historic queries. B. Configure a new table with all the requisite fields and new names and use this as the source for the customer-facing application; create a view that maintains the original data schema and table name by aliasing select fields from the new table. C. Create a new table with the required schema and new fields and use Delta Lake's deep clone functionality to sync up changes committed to one table to the corresponding table. D. Replace the current table definition with a logical view defined with the query logic currently writing the aggregate table; create a new table to power the customer-facing application. E. Add a table comment warning all users that the table schema and field names will be changing on a given date; overwrite the table in place to the specifications of the customer-facing application.

B. Configure a new table with all the requisite fields and new names and use this as the source for the customer-facing application; create a view that maintains the original data schema and table name by aliasing select fields from the new table.

A Delta Lake table representing metadata about content posts from users has the following schema: user_id LONG, post_text STRING, post_id STRING, longitude FLOAT, latitude FLOAT, post_time TIMESTAMP, date DATE Based on the above schema, which column is a good candidate for partitioning the Delta Table? A. post_time B. latitude C. post_id D. user_id E. date

E. date

Which statement describes the default execution mode for Databricks Auto Loader? A. Cloud vendor-specific queue storage and notification services are configured to track newly arriving files; the target table is materialized by directly querying all valid files in the source directory. B. New files are identified by listing the input directory; the target table is materialized by directly querying all valid files in the source directory. C. Webhooks trigger a Databricks job to run anytime new data arrives in a source directory; new data are automatically merged into target tables using rules inferred from the data. D. New files are identified by listing the input directory; new files are incrementally and idempotently loaded into the target Delta Lake table. E. Cloud vendor-specific queue storage and notification services are configured to track newly arriving files; new files are incrementally and idempotently loaded into the target Delta Lake table.

D. New files are identified by listing the input directory; new files are incrementally and idempotently loaded into the target Delta Lake table.

Which statement describes Delta Lake optimized writes? A. Before a Jobs cluster terminates, OPTIMIZE is executed on all tables modified during the most recent job. B. An asynchronous job runs after the write completes to detect if files could be further compacted; if yes, an OPTIMIZE job is executed toward a default of 1 GB. C. Data is queued in a messaging bus instead of committing data directly to memory; all data is committed from the messaging bus in one batch once the job is complete. D. Optimized writes use logical partitions instead of directory partitions; because partition boundaries are only represented in metadata, fewer small files are written. E. A shuffle occurs prior to writing to try to group similar data together resulting in fewer files instead of each executor writing multiple files based on directory partitions.

E. A shuffle occurs prior to writing to try to group similar data together resulting in fewer files instead of each executor writing multiple files based on directory partitions.

The Databricks CLI is used to trigger a run of an existing job by passing the job_id parameter. The response that the job run request has been submitted successfully includes a field run_id. Which statement describes what the number alongside this field represents? A. The job_id and number of times the job has been run are concatenated and returned. B. The total number of jobs that have been run in the workspace. C. The number of times the job definition has been run in this workspace. D. The job_id is returned in this field. E. The globally unique ID of the newly triggered run.

E. The globally unique ID of the newly triggered run.

Which statement describes a key benefit of an end-to-end test? A. Makes it easier to automate your test suite B. Pinpoints errors in the building blocks of your application C. Provides testing coverage for all code paths and branches D. Closely simulates real world usage of your application E. Ensures code is optimized for a real-life workflow

D. Closely simulates real world usage of your application

Which indicators would you look for in the Spark UI’s Storage tab to signal that a cached table is not performing optimally? Assume you are using Spark’s MEMORY_ONLY storage level. A. Size on Disk is < Size in Memory B. The RDD Block Name includes the “*” annotation signaling a failure to cache C. Size on Disk is > 0 D. The number of Cached Partitions > the number of Spark Partitions E. On Heap Memory Usage is within 75% of Off Heap Memory Usage

C. Size on Disk is > 0

The data engineering team has been tasked with configuring connections to an external database that does not have a supported native connector with Databricks. The external database already has data security configured by group membership. These groups map directly to user groups already created in Databricks that represent various teams within the company. A new login credential has been created for each group in the external database. The Databricks Utilities Secrets module will be used to make these credentials available to Databricks users. Assuming that all the credentials are configured correctly on the external database and group membership is properly configured on Databricks, which statement describes how teams can be granted the minimum necessary access to using these credentials? A. "Manage" permissions should be set on a secret key mapped to those credentials that will be used by a given team. B. "Read" permissions should be set on a secret key mapped to those credentials that will be used by a given team. C. "Read" permissions should be set on a secret scope containing only those credentials that will be used by a given team. D. "Manage" permissions should be set on a secret scope containing only those credentials that will be used by a given team. No additional configuration is necessary as long as all users are configured as administrators in the workspace where secrets have been added.

C. "Read" permissions should be set on a secret scope containing only those credentials that will be used by a given team.

A data architect has heard about Delta Lake’s built-in versioning and time travel capabilities. For auditing purposes, they have a requirement to maintain a full record of all valid street addresses as they appear in the customers table. The architect is interested in implementing a Type 1 table, overwriting existing records with new values and relying on Delta Lake time travel to support long-term auditing. A data engineer on the project feels that a Type 2 table will provide better performance and scalability. Which piece of information is critical to this decision? A. Data corruption can occur if a query fails in a partially completed state because Type 2 tables require setting multiple fields in a single update. B. Shallow clones can be combined with Type 1 tables to accelerate historic queries for long-term versioning. C. Delta Lake time travel cannot be used to query previous versions of these tables because Type 1 changes modify data files in place. D. Delta Lake time travel does not scale well in cost or latency to provide a long-term versioning solution. E. Delta Lake only supports Type 0 tables; once records are inserted to a Delta Lake table, they cannot be modified.

D. Delta Lake time travel does not scale well in cost or latency to provide a long-term versioning solution.

A junior data engineer is migrating a workload from a relational database system to the Databricks Lakehouse. The source system uses a star schema, leveraging foreign key constraints and multi-table inserts to validate records on write. Which consideration will impact the decisions made by the engineer while migrating this workload? A. Databricks only allows foreign key constraints on hashed identifiers, which avoid collisions in highly-parallel writes. B. Databricks supports Spark SQL and JDBC; all logic can be directly migrated from the source system without refactoring. C. Committing to multiple tables simultaneously requires taking out multiple table locks and can lead to a state of deadlock. D. All Delta Lake transactions are ACID compliant against a single table, and Databricks does not enforce foreign key constraints. E. Foreign keys must reference a primary key field; multi-table inserts must leverage Delta Lake’s upsert functionality.

D. All Delta Lake transactions are ACID compliant against a single table, and Databricks does not enforce foreign key constraints.

Spill occurs as a result of executing various wide transformations. However, diagnosing spill requires one to proactively look for key indicators. Where in the Spark UI are two of the primary indicators that a partition is spilling to disk? A. Query’s detail screen and Job’s detail screen B. Stage’s detail screen and Executor’s log files C. Driver’s and Executor’s log files D. Executor’s detail screen and Executor’s log files E. Stage’s detail screen and Query’s detail screen

B. Stage’s detail screen and Executor’s log files

In order to prevent accidental commits to production data, a senior data engineer has instituted a policy that all development work will reference clones of Delta Lake tables. After testing both DEEP and SHALLOW CLONE, development tables are created using SHALLOW CLONE. A few weeks after initial table creation, the cloned versions of several tables implemented as Type 1 Slowly Changing Dimension (SCD) stop working. The transaction logs for the source tables show that VACUUM was run the day before. Which statement describes why the cloned tables are no longer working? A. Because Type 1 changes overwrite existing records, Delta Lake cannot guarantee data consistency for cloned tables. B. Running VACUUM automatically invalidates any shallow clones of a table; DEEP CLONE should always be used when a cloned table will be repeatedly queried. C. Tables created with SHALLOW CLONE are automatically deleted after their default retention threshold of 7 days. D. The metadata created by the CLONE operation is referencing data files that were purged as invalid by the VACUUM command. E. The data files compacted by VACUUM are not tracked by the cloned metadata; running REFRESH on the cloned table will pull in recent changes.

D. The metadata created by the CLONE operation is referencing data files that were purged as invalid by the VACUUM command.

You are testing a collection of mathematical functions, one of which calculates the area under a curve as described by another function. assert(myIntegrate(lambda x: x*x, 0, 3) [0] == 9) Which kind of test would the above line exemplify? A. Unit B. Manual C. Functional D. Integration E. End-to-end

A. Unit

Which of the following technologies can be used to identify key areas of text when parsing Spark Driver log4j output? A. Regex B. Julia C. pyspsark.ml.feature D. Scala Datasets E. C++

A. Regex

When evaluating the Ganglia Metrics for a given cluster with 3 executor nodes, which indicator would signal proper utilization of the VM's resources? A. The five Minute Load Average remains consistent/flat B. Bytes Received never exceeds 80 million bytes per second C. Network I/O never spikes D. Total Disk Space remains constant E. CPU Utilization is around 75%

E. CPU Utilization is around 75%

A CHECK constraint has been successfully added to the Delta table named activity_details using the following logic: A batch job is attempting to insert new records to the table, including a record which violates the constraint. Which statement describes the outcome of this batch insert? A. The write will fail when the violating record is reached; any records previously processed will be recorded to the target table. B. The write will fail completely because of the constraint violation and no records will be inserted into the target table. C. The write will insert all records except those that violate the table constraints; the violating records will be recorded to a quarantine table. D. The write will include all records in the target table; any violations will be indicated in the boolean column named valid_coordinates. E. The write will insert all records except those that violate the table constraints; the violating records will be reported in a warning log.

B. The write will fail completely because of the constraint violation and no records will be inserted into the target table.

The marketing team is looking to share data in an aggregate table with the sales organization, but the field names used by the teams do not match, and a number of marketing-specific fields have not been approved for the sales org. Which of the following solutions addresses the situation while emphasizing simplicity? A. Create a view on the marketing table selecting only those fields approved for the sales team; alias the names of any fields that should be standardized to the sales naming conventions. B. Create a new table with the required schema and use Delta Lake's DEEP CLONE functionality to sync up changes committed to one table to the corresponding table. C. Use a CTAS statement to create a derivative table from the marketing table; configure a production job to propagate changes. D. Add a parallel table write to the current production pipeline, updating a new sales table that varies as required from the marketing table. E. Instruct the marketing team to download results as a CSV and email them to the sales organization.

A. Create a view on the marketing table selecting only those fields approved for the sales team; alias the names of any fields that should be standardized to the sales naming conventions.

The data architect has mandated that all tables in the Lakehouse should be configured as external (also known as "unmanaged") Delta Lake tables. Which approach will ensure that this requirement is met? A. When a database is being created, make sure that the LOCATION keyword is used. B. When configuring an external data warehouse for all table storage, leverage Databricks for all ELT. C. When data is saved to a table, make sure that a full file path is specified alongside the Delta format. D. When tables are created, make sure that the EXTERNAL keyword is used in the CREATE TABLE statement. E. When the workspace is being configured, make sure that external cloud object storage has been mounted.

C. When data is saved to a table, make sure that a full file path is specified alongside the Delta format.

A data pipeline uses Structured Streaming to ingest data from Apache Kafka to Delta Lake. Data is being stored in a bronze table, and includes the Kafka-generated timestamp, key, and value. Three months after the pipeline is deployed, the data engineering team has noticed some latency issues during certain times of the day. A senior data engineer updates the Delta Table's schema and ingestion logic to include the current timestamp (as recorded by Apache Spark) as well as the Kafka topic and partition. The team plans to use these additional metadata fields to diagnose the transient processing delays. Which limitation will the team face while diagnosing this problem? A. New fields will not be computed for historic records. B. Spark cannot capture the topic and partition fields from a Kafka source. C. New fields cannot be added to a production Delta table. D. Updating the table schema will invalidate the Delta transaction log metadata. E. Updating the table schema requires a default value provided for each field added.

A. New fields will not be computed for historic records.

Where in the Spark UI can one diagnose a performance problem induced by not leveraging predicate push-down? A. In the Executor’s log file, by grepping for "predicate push-down" B. In the Stage’s Detail screen, in the Completed Stages table, by noting the size of data read from the Input column C. In the Storage Detail screen, by noting which RDDs are not stored on disk D. In the Delta Lake transaction log. by noting the column statistics E. In the Query Detail screen, by interpreting the Physical Plan

E. In the Query Detail screen, by interpreting the Physical Plan

A production cluster has 3 executor nodes and uses the same virtual machine type for the driver and executor. When evaluating the Ganglia Metrics for this cluster, which indicator would signal a bottleneck caused by code executing on the driver? A. The five Minute Load Average remains consistent/flat B. Bytes Received never exceeds 80 million bytes per second C. Total Disk Space remains constant D. Network I/O never spikes E. Overall cluster CPU utilization is around 25%

D. Network I/O never spikes

An hourly batch job is configured to ingest data files from a cloud object storage container where each batch represent all records produced by the source system in a given hour. The batch job to process these records into the Lakehouse is sufficiently delayed to ensure no late-arriving data is missed. The user_id field represents a unique key for the data, which has the following schema: user_id BIGINT, username STRING, user_utc STRING, user_region STRING, last_login BIGINT, auto_pay BOOLEAN, last_updated BIGINT New records are all ingested into a table named account_history which maintains a full record of all data in the same schema as the source. The next table in the system is named account_current and is implemented as a Type 1 table representing the most recent value for each unique user_id. Assuming there are millions of user accounts and tens of thousands of records processed hourly, which implementation can be used to efficiently update the described account_current table as part of each hourly batch job? A. Use Auto Loader to subscribe to new files in the account_history directory; configure a Structured Streaming trigger once job to batch update newly detected files into the account_current table. B. Overwrite the account_current table with each batch using the results of a query against the account_history table grouping by user_id and filtering for the max value of last_updated. C. Filter records in account_history using the last_updated field and the most recent hour processed, as well as the max last_iogin by user_id write a merge statement to update or insert the most recent value for each user_id. D. Use Delta Lake version history to get the difference between the latest version of account_history and one version prior, then write these records to account_current. E. Filter records in account_history using the last_updated field and the most recent hour processed, making sure to deduplicate on username; write a merge statement to update or insert the most recent value for each username.

C. Filter records in account_history using the last_updated field and the most recent hour processed, as well as the max last_iogin by user_id write a merge statement to update or insert the most recent value for each user_id.

Each configuration below is identical to the extent that each cluster has 400 GB total of RAM, 160 total cores and only one Executor per VM. Given a job with at least one wide transformation, which of the following cluster configurations will result in maximum performance? A. * Total VMs; 1 * 400 GB per Executor * 160 Cores / Executor B. * Total VMs: 8 * 50 GB per Executor * 20 Cores / Executor C. * Total VMs: 16 * 25 GB per Executor * 10 Cores/Executor D. * Total VMs: 4 * 100 GB per Executor * 40 Cores/Executor E. * Total VMs:2 * 200 GB per Executor * 80 Cores / Executor

A. * Total VMs; 1 * 400 GB per Executor * 160 Cores / Executor

A production workload incrementally applies updates from an external Change Data Capture feed to a Delta Lake table as an always-on Structured Stream job. When data was initially migrated for this table, OPTIMIZE was executed and most data files were resized to 1 GB. Auto Optimize and Auto Compaction were both turned on for the streaming production job. Recent review of data files shows that most data files are under 64 MB, although each partition in the table contains at least 1 GB of data and the total table size is over 10 TB. Which of the following likely explains these smaller file sizes? A. Databricks has autotuned to a smaller target file size to reduce duration of MERGE operations B. Z-order indices calculated on the table are preventing file compaction C. Bloom filter indices calculated on the table are preventing file compaction D. Databricks has autotuned to a smaller target file size based on the overall size of data in the table E. Databricks has autotuned to a smaller target file size based on the amount of data in each partition

A. Databricks has autotuned to a smaller target file size to reduce duration of MERGE operations

A Structured Streaming job deployed to production has been experiencing delays during peak hours of the day. At present, during normal execution, each microbatch of data is processed in less than 3 seconds. During peak hours of the day, execution time for each microbatch becomes very inconsistent, sometimes exceeding 30 seconds. The streaming write is currently configured with a trigger interval of 10 seconds. Holding all other variables constant and assuming records need to be processed in less than 10 seconds, which adjustment will meet the requirement? A. Decrease the trigger interval to 5 seconds; triggering batches more frequently allows idle executors to begin processing the next batch while longer running tasks from previous batches finish. B. Increase the trigger interval to 30 seconds; setting the trigger interval near the maximum execution time observed for each batch is always best practice to ensure no records are dropped. C. The trigger interval cannot be modified without modifying the checkpoint directory; to maintain the current stream state, increase the number of shuffle partitions to maximize parallelism. D. Use the trigger once option and configure a Databricks job to execute the query every 10 seconds; this ensures all backlogged records are processed with each batch. E. Decrease the trigger interval to 5 seconds; triggering batches more frequently may prevent records from backing up and large batches from causing spill.

E. Decrease the trigger interval to 5 seconds; triggering batches more frequently may prevent records from backing up and large batches from causing spill.

Which statement describes Delta Lake Auto Compaction? A. An asynchronous job runs after the write completes to detect if files could be further compacted; if yes, an OPTIMIZE job is executed toward a default of 1 GB. B. Before a Jobs cluster terminates, OPTIMIZE is executed on all tables modified during the most recent job. C. Optimized writes use logical partitions instead of directory partitions; because partition boundaries are only represented in metadata, fewer small files are written. D. Data is queued in a messaging bus instead of committing data directly to memory; all data is committed from the messaging bus in one batch once the job is complete. E. An asynchronous job runs after the write completes to detect if files could be further compacted; if yes, an OPTIMIZE job is executed toward a default of 128 MB.

E. An asynchronous job runs after the write completes to detect if files could be further compacted; if yes, an OPTIMIZE job is executed toward a default of 128 MB.

A new data engineer notices that a critical field was omitted from an application that writes its Kafka source to Delta Lake. This happened even though the critical field was in the Kafka source. That field was further missing from data written to dependent, long-term storage. The retention threshold on the Kafka service is seven days. The pipeline has been in production for three months. Which describes how Delta Lake can help to avoid data loss of this nature in the future? A. The Delta log and Structured Streaming checkpoints record the full history of the Kafka producer. B. Delta Lake schema evolution can retroactively calculate the correct value for newly added fields, as long as the data was in the original source. C. Delta Lake automatically checks that all fields present in the source data are included in the ingestion layer. D. Data can never be permanently dropped or deleted from Delta Lake, so data loss is not possible under any circumstance. E. Ingesting all raw data and metadata from Kafka to a bronze Delta table creates a permanent, replayable history of the data state.

E. Ingesting all raw data and metadata from Kafka to a bronze Delta table creates a permanent, replayable history of the data state.

The data engineering team is migrating an enterprise system with thousands of tables and views into the Lakehouse. They plan to implement the target architecture using a series of bronze, silver, and gold tables. Bronze tables will almost exclusively be used by production data engineering workloads, while silver tables will be used to support both data engineering and machine learning workloads. Gold tables will largely serve business intelligence and reporting purposes. While personal identifying information (PII) exists in all tiers of data, pseudonymization and anonymization rules are in place for all data at the silver and gold levels. The organization is interested in reducing security concerns while maximizing the ability to collaborate across diverse teams. Which statement exemplifies best practices for implementing this system? A. Isolating tables in separate databases based on data quality tiers allows for easy permissions management through database ACLs and allows physical separation of default storage locations for managed tables. B. Because databases on Databricks are merely a logical construct, choices around database organization do not impact security or discoverability in the Lakehouse. C. Storing all production tables in a single database provides a unified view of all data assets available throughout the Lakehouse, simplifying discoverability by granting all users view privileges on this database. D. Working in the default Databricks database provides the greatest security when working with managed tables, as these will be created in the DBFS root. E. Because all tables must live in the same storage containers used for the database they're created in, organizations should be prepared to create between dozens and thousands of databases depending on their data isolation requirements.

A. Isolating tables in separate databases based on data quality tiers allows for easy permissions management through database ACLs and allows physical separation of default storage locations for managed tables.

All records from an Apache Kafka producer are being ingested into a single Delta Lake table with the following schema: key BINARY, value BINARY, topic STRING, partition LONG, offset LONG, timestamp LONG There are 5 unique topics being ingested. Only the "registration" topic contains Personal Identifiable Information (PII). The company wishes to restrict access to PII. The company also wishes to only retain records containing PII in this table for 14 days after initial ingestion. However, for non-PII information, it would like to retain these records indefinitely. Which of the following solutions meets the requirements? A. All data should be deleted biweekly; Delta Lake's time travel functionality should be leveraged to maintain a history of non-PII information. B. Data should be partitioned by the registration field, allowing ACLs and delete statements to be set for the PII directory. C. Because the value field is stored as binary data, this information is not considered PII and no special precautions should be taken. D. Separate object storage containers should be specified based on the partition field, allowing isolation at the storage level. E. Data should be partitioned by the topic field, allowing ACLs and delete statements to leverage partition boundaries.

E. Data should be partitioned by the topic field, allowing ACLs and delete statements to leverage partition boundaries.

A table in the Lakehouse named customer_churn_params is used in churn prediction by the machine learning team. The table contains information about customers derived from a number of upstream sources. Currently, the data engineering team populates this table nightly by overwriting the table with the current valid values derived from upstream data sources. The churn prediction model used by the ML team is fairly stable in production. The team is only interested in making predictions on records that have changed in the past 24 hours. Which approach would simplify the identification of these changed records? A. Apply the churn model to all rows in the customer_churn_params table, but implement logic to perform an upsert into the predictions table that ignores rows where predictions have not changed. B. Convert the batch job to a Structured Streaming job using the complete output mode; configure a Structured Streaming job to read from the customer_churn_params table and incrementally predict against the churn model. C. Calculate the difference between the previous model predictions and the current customer_churn_params on a key identifying unique customers before making new predictions; only make predictions on those customers not in the previous predictions. D. Modify the overwrite logic to include a field populated by calling spark.sql.functions.current_timestamp() as data are being written; use this field to identify records written on a particular date. E. Replace the current overwrite logic with a merge statement to modify only those records that have changed; write logic to make predictions on the changed records identified by the change data feed.

E. Replace the current overwrite logic with a merge statement to modify only those records that have changed; write logic to make predictions on the changed records identified by the change data feed.

Although the Databricks Utilities Secrets module provides tools to store sensitive credentials and avoid accidentally displaying them in plain text users should still be careful with which credentials are stored here and which users have access to using these secrets. Which statement describes a limitation of Databricks Secrets? A. Because the SHA256 hash is used to obfuscate stored secrets, reversing this hash will display the value in plain text. B. Account administrators can see all secrets in plain text by logging on to the Databricks Accounts console. C. Secrets are stored in an administrators-only table within the Hive Metastore; database administrators have permission to query this table by default. D. Iterating through a stored secret and printing each character will display secret contents in plain text. E. The Databricks REST API can be used to list secrets in plain text if the personal access token has proper credentials.

D. Iterating through a stored secret and printing each character will display secret contents in plain text.

What statement is true regarding the retention of job run history? A. It is retained until you export or delete job run logs B. It is retained for 30 days, during which time you can deliver job run logs to DBFS or S3 C. It is retained for 60 days, during which you can export notebook run results to HTML D. It is retained for 60 days, after which logs are archived E. It is retained for 90 days or until the run-id is re-used through custom run configuration

C. It is retained for 60 days, during which you can export notebook run results to HTML

A data engineer, User A, has promoted a new pipeline to production by using the REST API to programmatically create several jobs. A DevOps engineer, User B, has configured an external orchestration tool to trigger job runs through the REST API. Both users authorized the REST API calls using their personal access tokens. Which statement describes the contents of the workspace audit logs concerning these events? A. Because the REST API was used for job creation and triggering runs, a Service Principal will be automatically used to identify these events. B. Because User B last configured the jobs, their identity will be associated with both the job creation events and the job run events. C. Because these events are managed separately, User A will have their identity associated with the job creation events and User B will have their identity associated with the job run events. D. Because the REST API was used for job creation and triggering runs, user identity will not be captured in the audit logs. E. Because User A created the jobs, their identity will be associated with both the job creation events and the job run events.

C. Because these events are managed separately, User A will have their identity associated with the job creation events and User B will have their identity associated with the job run events.

An upstream system has been configured to pass the date for a given batch of data to the Databricks Jobs API as a parameter. The notebook to be scheduled will use this parameter to load data with the following code: df = spark.read.format("parquet").load(f"/mnt/source/(date)") Which code block should be used to create the date Python variable used in the above code block? A. date = spark.conf.get("date") B. input_dict = input() date= input_dict["date"] C. import sys date = sys.argv[1] D. date = dbutils.notebooks.getParam("date") E. dbutils.widgets.text("date", "null") date = dbutils.widgets.get("date")

E. dbutils.widgets.text("date", "null") date = dbutils.widgets.get("date")

The DevOps team has configured a production workload as a collection of notebooks scheduled to run daily using the Jobs UI. A new data engineering hire is onboarding to the team and has requested access to one of these notebooks to review the production logic. What are the maximum notebook permissions that can be granted to the user without allowing accidental changes to production code or data? A. Can Manage B. Can Edit C. No permissions D. Can Read E. Can Run

D. Can Read

%sh git clone https://github.com/foo/data_loader; python ./data_loader/run.py; mv ./output/dbfs/mnt/new_data The code executes successfully and provides the logically correct results, however, it takes over 20 minutes to extract and load around 1 GB of data. Which statement is a possible explanation for this behavior? A. %sh triggers a cluster restart to collect and install Git. Most of the latency is related to cluster startup time. B. Instead of cloning, the code should use %sh pip install so that the Python code can get executed in parallel across all nodes in a cluster. C. %sh does not distribute file moving operations; the final line of code should be updated to use %fs instead. D. Python will always execute slower than Scala on Databricks. The run.py script should be refactored to Scala. E. %sh executes shell code on the driver node. The code does not take advantage of the worker nodes or Databricks optimized Spark.

E. %sh executes shell code on the driver node. The code does not take advantage of the worker nodes or Databricks optimized Spark.

The data science team has requested assistance in accelerating queries on free form text from user reviews. The data is currently stored in Parquet with the below schema: item_id INT, user_id INT, review_id INT, rating FLOAT, review STRING The review column contains the full text of the review left by the user. Specifically, the data science team is looking to identify if any of 30 key words exist in this field. A junior data engineer suggests converting this data to Delta Lake will improve query performance. Which response to the junior data engineer s suggestion is correct? A. Delta Lake statistics are not optimized for free text fields with high cardinality. B. Text data cannot be stored with Delta Lake. C. ZORDER ON review will need to be run to see performance gains. D. The Delta log creates a term matrix for free text fields to support selective filtering. E. Delta Lake statistics are only collected on the first 4 columns in a table.

A. Delta Lake statistics are not optimized for free text fields with high cardinality.

A Databricks SQL dashboard has been configured to monitor the total number of records present in a collection of Delta Lake tables using the following query pattern: SELECT COUNT (*) FROM table - Which of the following describes how results are generated each time the dashboard is updated? A. The total count of rows is calculated by scanning all data files B. The total count of rows will be returned from cached results unless REFRESH is run C. The total count of records is calculated from the Delta transaction logs D. The total count of records is calculated from the parquet file metadata E. The total count of records is calculated from the Hive metastore

C. The total count of records is calculated from the Delta transaction logs Most Voted

A machine learning engineer is converting a Hyperopt-based hyperparameter tuning process from manual MLflow logging to MLflow Autologging. They are trying to determine how to manage nested Hyperopt runs with MLflow Autologging. Which of the following approaches will create a single parent run for the process and a child run for each unique combination of hyperparameter values when using Hyperopt and MLflow Autologging? A. Starting a manual parent run before calling fmin B. Ensuring that a built-in model flavor is used for the model logging C. Starting a manual child run within the objective_function D. There is no way to accomplish nested runs with MLflow Autologging and Hyperopt E. MLflow Autologging will automatically accomplish this task with Hyperopt

A. Starting a manual parent run before calling fmin

The data science team has created and logged a production model using MLflow. The model accepts a list of column names and returns a new column of type DOUBLE. The following code correctly imports the production model, loads the customers table containing the customer_id key column into a DataFrame, and defines the feature columns needed for the model. model = mlflow.pyfunc.spark_udf(spark, model_uri="models:/churn/prod") df = spark.table("customers") columns = ["account_age", "time_since_last_seen", "app_rating"] Which code block will output a DataFrame with the schema "customer_id LONG, predictions DOUBLE"? A. df.map(lambda x:model(x[columns])).select("customer_id, predictions") B. df.select("customer_id", model(*columns).alias("predictions")) C. model.predict(df, columns) D. df.select("customer_id", pandas_udf(model, columns).alias("predictions")) E. df.apply(model, columns).select("customer_id, predictions")

B. df.select("customer_id", model(*columns).alias("predictions"))

The data governance team has instituted a requirement that all tables containing Personal Identifiable Information (PH) must be clearly annotated. This includes adding column comments, table comments, and setting the custom table property "contains_pii" = true. The following SQL DDL statement is executed to create a new table: Which command allows manual confirmation that these three requirements have been met? A. DESCRIBE EXTENDED dev.pii_test B. DESCRIBE DETAIL dev.pii_test C. SHOW TBLPROPERTIES dev.pii_test D. DESCRIBE HISTORY dev.pii_test E. SHOW TABLES dev

A. DESCRIBE EXTENDED dev.pii_test

Which statement regarding stream-static joins and static Delta tables is correct? A. Each microbatch of a stream-static join will use the most recent version of the static Delta table as of each microbatch. B. Each microbatch of a stream-static join will use the most recent version of the static Delta table as of the job's initialization. C. The checkpoint directory will be used to track state information for the unique keys present in the join. D. Stream-static joins cannot use static Delta tables because of consistency issues. E. The checkpoint directory will be used to track updates to the static Delta table.

A. Each microbatch of a stream-static join will use the most recent version of the static Delta table as of each microbatch.

The data engineering team has configured a Databricks SQL query and alert to monitor the values in a Delta Lake table. The recent_sensor_recordings table contains an identifying sensor_id alongside the timestamp and temperature for the most recent 5 minutes of recordings. The below query is used to create the alert: image1 The query is set to refresh each minute and always completes in less than 10 seconds. The alert is set to trigger when mean (temperature) > 120. Notifications are triggered to be sent at most every 1 minute. If this alert raises notifications for 3 consecutive minutes and then stops, which statement must be true? A. The total average temperature across all sensors exceeded 120 on three consecutive executions of the query B. The recent_sensor_recordings table was unresponsive for three consecutive runs of the query C. The source query failed to update properly for three consecutive minutes and then restarted D. The maximum temperature recording for at least one sensor exceeded 120 on three consecutive executions of the query E. The average temperature recordings for at least one sensor exceeded 120 on three consecutive executions of the query

E. The average temperature recordings for at least one sensor exceeded 120 on three consecutive executions of the query

Data Engineer Professional Flashcards

(186 cards)