SET - 3 Flashcards

(50 cards)

1
Q

A data engineer has realized that the data files associated with a Delta table are incredibly small. They want to compact the small files to form larger files to improve performance.

Which keyword can be used to compact the small files?

A. OPTIMIZE
B. VACUUM
C. COMPACTION
D. REPARTITION

A

A. OPTIMIZE

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

….

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

A data engineer runs a statement every day to copy the previous day’s sales into the table transactions. Each day’s sales are in their own file in the location “/transactions/raw”.

Today, the data engineer runs the following command to complete this task:

After running the command today, the data engineer notices that the number of records in table transactions has not changed.

What explains why the statement might not have copied any new records into the table?

A. The format of the files to be copied were not included with the FORMAT_OPTIONS keyword.
B. The COPY INTO statement requires the table to be refreshed to view the copied rows.
C. The previous day’s file has already been copied into the table.
D. The PARQUET file format does not support COPY INTO.

A

C. The previous day’s file has already been copied into the table.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Which command can be used to write data into a Delta table while avoiding the writing of duplicate records?

A. DROP
B. INSERT
C. MERGE
D. APPEND

A

C. MERGE

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

A data analyst has created a Delta table sales that is used by the entire data analysis team. They want help from the data engineering team to implement a series of tests to ensure the data is clean. However, the data engineering team uses Python for its tests rather than SQL.

Which command could the data engineering team use to access sales in PySpark?

A. SELECT * FROM sales
B. spark.table(“sales”)
C. spark.sql(“sales”)
D. spark.delta.table(“sales”)

A

B. spark.table(“sales”)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

A data engineer has created a new database using the following command:

CREATE DATABASE IF NOT EXISTS customer360;

In which location will the customer360 database be located?

A. dbfs:/user/hive/database/customer360
B. dbfs:/user/hive/warehouse
C. dbfs:/user/hive/customer360
D. dbfs:/user/hive/database

A

B. dbfs:/user/hive/warehouse

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

A data engineer is attempting to drop a Spark SQL table my_table and runs the following command:

DROP TABLE IF EXISTS my_table;

After running this command, the engineer notices that the data files and metadata files have been deleted from the file system.

What is the reason behind the deletion of all these files?

A. The table was managed
B. The table’s data was smaller than 10 GB
C. The table did not have a location
D. The table was external

A

A. The table was managed

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

A data engineer needs to create a table in Databricks using data from a CSV file at location /path/to/csv.

They run the following command:

Which of the following lines of code fills in the above blank to successfully complete the task?

A. FROM “path/to/csv”
B. USING CSV
C. FROM CSV
D. USING DELTA

A

B. USING CSV

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

….

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Which SQL keyword can be used to convert a table from a long format to a wide format?

A. TRANSFORM
B. PIVOT
C. SUM
D. CONVERT

A

B. PIVOT

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

A data engineer has a Python variable table_name that they would like to use in a SQL query. They want to construct a Python code block that will run the query using table_name.

They have the following incomplete code block:

____(f”SELECT customer_id, spend FROM {table_name}”)

What can be used to fill in the blank to successfully complete the task?

A. spark.delta.sql
B. spark.sql
C. spark.table
D. dbutils.sql

A

B. spark.sql

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Image

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Image

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Image

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

A data engineer has configured a Structured Streaming job to read from a table, manipulate the data, and then perform a streaming write into a new table.

The code block used by the data engineer is below:

Which line of code should the data engineer use to fill in the blank if the data engineer only wants the query to execute a micro-batch to process data every 5 seconds?

A. trigger(“5 seconds”)
B. trigger(continuous=”5 seconds”)
C. trigger(once=”5 seconds”)
D. trigger(processingTime=”5 seconds”)

A

D. trigger(processingTime=”5 seconds”)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

A data engineer is maintaining a data pipeline. Upon data ingestion, the data engineer notices that the source data is starting to have a lower level of quality. The data engineer would like to automate the process of monitoring the quality level.

Which of the following tools can the data engineer use to solve this problem?

A. Auto Loader
B. Unity Catalog
C. Delta Lake
D. Delta Live Tables

A

D. Delta Live Tables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

A data engineer has three tables in a Delta Live Tables (DLT) pipeline. They have configured the pipeline to drop invalid records at each table. They notice that some data is being dropped due to quality concerns at some point in the DLT pipeline. They would like to determine at which table in their pipeline the data is being dropped.

Which approach can the data engineer take to identify the table that is dropping the records?

A. They can set up separate expectations for each table when developing their DLT pipeline.
B. They can navigate to the DLT pipeline page, click on the “Error” button, and review the present errors.
C. They can set up DLT to notify them via email when records are dropped.
D. They can navigate to the DLT pipeline page, click on each table, and view the data quality statistics.

A

D. They can navigate to the DLT pipeline page, click on each table, and view the data quality statistics.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is used by Spark to record the offset range of the data being processed in each trigger in order for Structured Streaming to reliably track the exact progress of the processing so that it can handle any kind of failure by restarting and/or reprocessing?

A. Checkpointing and Write-ahead Logs
B. Replayable Sources and Idempotent Sinks
C. Write-ahead Logs and Idempotent Sinks
D. Checkpointing and Idempotent Sinks

A

A. Checkpointing and Write-ahead Logs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What describes the relationship between Gold tables and Silver tables?

A. Gold tables are more likely to contain aggregations than Silver tables.
B. Gold tables are more likely to contain valuable data than Silver tables.
C. Gold tables are more likely to contain a less refined view of data than Silver tables.
D. Gold tables are more likely to contain truthful data than Silver tables.

A

A. Gold tables are more likely to contain aggregations than Silver tables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What describes when to use the CREATE STREAMING LIVE TABLE (formerly CREATE INCREMENTAL LIVE TABLE) syntax over the CREATE LIVE TABLE syntax when creating Delta Live Tables (DLT) tables using SQL?

A. CREATE STREAMING LIVE TABLE should be used when the subsequent step in the DLT pipeline is static.
B. CREATE STREAMING LIVE TABLE should be used when data needs to be processed incrementally.
C. CREATE STREAMING LIVE TABLE should be used when data needs to be processed through complicated aggregations.
D. CREATE STREAMING LIVE TABLE should be used when the previous step in the DLT pipeline is static.

A

B. CREATE STREAMING LIVE TABLE should be used when data needs to be processed incrementally.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

A Delta Live Table pipeline includes two datasets defined using STREAMING LIVE TABLE. Three datasets are defined against Delta Lake table sources using LIVE TABLE.

The table is configured to run in Production mode using the Continuous Pipeline Mode.

What is the expected outcome after clicking Start to update the pipeline assuming previously unprocessed data exists and all definitions are valid?

A. All datasets will be updated at set intervals until the pipeline is shut down. The compute resources will persist to allow for additional testing.
B. All datasets will be updated once and the pipeline will shut down. The compute resources will persist to allow for additional testing.
C. All datasets will be updated at set intervals until the pipeline is shut down. The compute resources will be deployed for the update and terminated when the pipeline is stopped.
D. All datasets will be updated once and the pipeline will shut down. The compute resources will be termin

A

C. All datasets will be updated at set intervals until the pipeline is shut down. The compute resources will be deployed for the update and terminated when the pipeline is stopped

22
Q

Which type of workloads are compatible with Auto Loader?

A. Streaming workloads
B. Machine learning workloads
C. Serverless workloads
D. Batch workloads

A

A. Streaming workloads

23
Q

A data engineer has developed a data pipeline to ingest data from a JSON source using Auto Loader, but the engineer has not provided any type inference or schema hints in their pipeline. Upon reviewing the data, the data engineer has noticed that all of the columns in the target table are of the string type despite some of the fields only including float or boolean values.

Why has Auto Loader inferred all of the columns to be of the string type?

A. Auto Loader cannot infer the schema of ingested data
B. JSON data is a text-based format
C. Auto Loader only works with string data
D. All of the fields had at least one null value

A

B. JSON data is a text-based format

24
Q

Which statement regarding the relationship between Silver tables and Bronze tables is always true?

A. Silver tables contain a less refined, less clean view of data than Bronze data.
B. Silver tables contain aggregates while Bronze data is unaggregated.
C. Silver tables contain more data than Bronze tables.
D. Silver tables contain less data than Bronze tables

A

D. Silver tables contain less data than Bronze tables.

25
Image
26
A dataset has been defined using Delta Live Tables and includes an expectations clause: CONSTRAINT valid_timestamp EXPECT (timestamp > '2020-01-01') ON VIOLATION DROP ROW What is the expected behavior when a batch of data containing data that violates these constraints is processed? A. Records that violate the expectation cause the job to fail. B. Records that violate the expectation are added to the target dataset and flagged as invalid in a field added to the target dataset. C. Records that violate the expectation are dropped from the target dataset and recorded as invalid in the event log. D. Records that violate the expectation are added to the target dataset and recorded as invalid in the event log.
C. Records that violate the expectation are dropped from the target dataset and recorded as invalid in the event log.
27
A data engineer has a Job with multiple tasks that runs nightly. Each of the tasks runs slowly because the clusters take a long time to start. Which action can the data engineer perform to improve the start up time for the clusters used for the Job? A. They can use endpoints available in Databricks SQL B. They can use jobs clusters instead of all-purpose clusters C. They can configure the clusters to autoscale for larger data sizes D. They can use clusters that are from a cluster pool
D. They can use clusters that are from a cluster pool
28
A data engineer has a single-task Job that runs each morning before they begin working. After identifying an upstream data issue, they need to set up another task to run a new notebook prior to the original task. Which approach can the data engineer use to set up the new task? A. They can clone the existing task in the existing Job and update it to run the new notebook. B. They can create a new task in the existing Job and then add it as a dependency of the original task. C. They can create a new task in the existing Job and then add the original task as a dependency of the new task. D. They can create a new job from scratch and add both tasks to run concurrently.
B. They can create a new task in the existing Job and then add it as a dependency of the original task.
29
A single Job runs two notebooks as two separate tasks. A data engineer has noticed that one of the notebooks is running slowly in the Job’s current run. The data engineer asks a tech lead for help in identifying why this might be the case. Which approach can the tech lead use to identify why the notebook is running slowly as part of the Job? A. They can navigate to the Runs tab in the Jobs UI to immediately review the processing notebook. B. They can navigate to the Tasks tab in the Jobs UI and click on the active run to review the processing notebook. C. They can navigate to the Runs tab in the Jobs UI and click on the active run to review the processing notebook. D. They can navigate to the Tasks tab in the Jobs UI to immediately review the processing notebook.
C. They can navigate to the Runs tab in the Jobs UI and click on the active run to review the processing notebook.
30
A data analysis team has noticed that their Databricks SQL queries are running too slowly when connected to their always-on SQL endpoint. They claim that this issue is present when many members of the team are running small queries simultaneously. They ask the data engineering team for help. The data engineering team notices that each of the team’s queries uses the same SQL endpoint. Which approach can the data engineering team use to improve the latency of the team’s queries? A. They can increase the cluster size of the SQL endpoint. B. They can increase the maximum bound of the SQL endpoint’s scaling range. C. They can turn on the Auto Stop feature for the SQL endpoint. D. They can turn on the Serverless feature for the SQL endpoint.
B. They can increase the maximum bound of the SQL endpoint’s scaling range.
31
A data engineer has been using a Databricks SQL dashboard to monitor the cleanliness of the input data to an ELT job. The ELT job has its Databricks SQL query that returns the number of input records containing unexpected NULL values. The data engineer wants their entire team to be notified via a messaging webhook whenever this value reaches 100. Which approach can the data engineer use to notify their entire team via a messaging webhook whenever the number of NULL values reaches 100? A. They can set up an Alert with a custom template. B. They can set up an Alert with a new email alert destination. C. They can set up an Alert with a new webhook alert destination. D. They can set up an Alert with one-time notifications.
C. They can set up an Alert with a new webhook alert destination
32
A data engineer wants to schedule their Databricks SQL dashboard to refresh once per day, but they only want the associated SQL endpoint to be running when it is necessary. Which approach can the data engineer use to minimize the total running time of the SQL endpoint used in the refresh schedule of their dashboard? A. They can ensure the dashboard’s SQL endpoint matches each of the queries’ SQL endpoints. B. They can set up the dashboard’s SQL endpoint to be serverless. C. They can turn on the Auto Stop feature for the SQL endpoint. D. They can ensure the dashboard’s SQL endpoint is not one of the included query’s SQL endpoint.
C. They can turn on the Auto Stop feature for the SQL endpoint.
33
....
34
A new data engineering team has been assigned to work on a project. The team will need access to database customers in order to see what tables already exist. The team has its own group team. Which command can be used to grant the necessary permission on the entire database to the new team? A. GRANT VIEW ON CATALOG customers TO team; B. GRANT CREATE ON DATABASE customers TO team; C. GRANT USAGE ON CATALOG team TO customers; D. GRANT USAGE ON DATABASE customers TO team;
D. GRANT USAGE ON DATABASE customers TO team;
35
A new data engineering team team has been assigned to an ELT project. The new data engineering team will need full privileges on the table sales to fully manage the project. Which command can be used to grant full permissions on the database to the new data engineering team? A. GRANT ALL PRIVILEGES ON TABLE sales TO team; B. GRANT SELECT CREATE MODIFY ON TABLE sales TO team; C. GRANT SELECT ON TABLE sales TO team; D. GRANT ALL PRIVILEGES ON TABLE team TO sales;
A. GRANT ALL PRIVILEGES ON TABLE sales TO team;
36
Differentiate between all-purpose clusters and jobs clusters. A data engineering team has created a python notebook to load data from cloud storage, this job has been tested and now needs to be scheduled in production. Which would be the best cluster to be used in this case? A. All purpose cluster B. Any Unity Catalog-enabled cluster C. Jobs Cluster D. Serverless SQL warehouse
C. Jobs Cluster
37
Identify how the count_if function and the count where x is null can be used Consider a table random_values with below data. What would be the output of below query? select count_if(col > 1) as count_a. count(*) as count_b.count(col1) as count_c from random_values col1 0 1 2 NULL - 2 3 A. 3 6 5 B. 4 6 5 C. 3 6 6 D. 4 6 6
A. 3 6 5
38
Which two components function in the DB platform architecture’s control plane? (Choose two.) A. Virtual Machines B. Compute Orchestration C. Serverless Compute D. Compute E. Unity Catalog
B. Compute Orchestration E. Unity Catalog
39
In a healthcare provider organization using Delta Lake to store electronic health records (EHRs), a data analyst needs to analyze a snapshot of the patient_records table from two weeks ago before some recent data corrections were applied. What approach should the Data Engineer take to allow the analyst to query that specific prior version? A. Truncate the table to remove all data, then reload the data from two weeks ago into the truncated table for the analyst to query. B. Identify the version number corresponding to two weeks ago from the Delta transaction log, share that version number with the analyst to query using VERSION AS OF syntax, or export that version to a new Delta table for the analyst to query. C. Restore the table to the version from two weeks ago using the RESTORE command, and have the analyst query the restored table. D. Use the VACUUM command to remove all versions of the table older than two weeks, then the analyst can query the remaining version.
B. Identify the version number corresponding to two weeks ago from the Delta transaction log, share that version number with the analyst to query using VERSION AS OF syntax, or export that version to a new Delta table for the analyst to query.
40
What can be used to simplify and unify siloed data architectures that are specialized for specific use cases? A. Delta Lake B. Data lake C. Data warehouse D. Data lakehouse
D. Data lakehouse
41
A data engineer has configured a Structured Streaming job to read from a table, manipulate the data, and then perform a streaming write into a new table. The code block used by the data engineer is below: The data engineer only wants the query to process all of the available data in as many batches as required. Which line of code should the data engineer use to fill in the blank? A. trigger(availableNow=True) B. trigger(processingTime= “once”) C. trigger(continuous= “once”) D. trigger(once=True)
A. trigger(availableNow=True)
42
Data engineer and data analysts are working together on a data pipeline. The data engineer is working on the raw, bronze, and silver layers of the pipeline using Python, and the data analyst is working on the gold layer of the pipeline using SQL. The raw source of the pipeline is a streaming input. They now want to migrate their pipeline to use Delta Live Tables. Which of the following changes will need to be made to the pipeline when migrating to Delta Live Tables? A. The pipeline can have different notebook sources in SQL & Python B. The pipeline will need to be written entirely in SQL C. The pipeline will need to use a batch source in place of a streaming source D. The pipeline will need to be written entirely in Python
A. The pipeline can have different notebook sources in SQL & Python
43
Identify a scenario to use an external table. A Data Engineer needs to create a parquet bronze table and wants to ensure that it gets stored in a specific path in an external location. Which table can be created in this scenario? A. An external table where the location is pointing to specific path in external location. B. An external table where the schema has managed location pointing to specific path in external location. C. A managed table where the catalog has managed location pointing to specific path in external location. D. A managed table where the location is pointing to specific path in external location.
A. An external table where the location is pointing to specific path in external location.
44
Identify the impact of ON VIOLATION DROP ROW and ON VIOLATION FAIL UPDATE for a constraint violation. A data engineer has created an ETL pipeline using Delta Live table to manage their company travel reimbursement detail, they want to ensure that the if the location details has not been provided by the employee, the pipeline needs to be terminated. How can the scenario be implemented? A. CONSTRAINT valid_location EXPECT (location = NULL) B. CONSTRAINT valid_location EXPECT (location != NULL) ON VIOLATION FAIL UPDATE C. CONSTRAINT valid_location EXPECT (location != NULL) ON DROP ROW D. CONSTRAINT valid_location EXPECT (location != NULL) ON VIOLATION FAIL
B. CONSTRAINT valid_location EXPECT (location != NULL) ON VIOLATION FAIL UPDATE
45
Which two conditions are applicable for governance in Databricks Unity Catalog? (Choose two.) A. You can have more than 1 metastore within a databricks account console but only 1 per region. B. Both catalog and schema must have a managed location in Unity Catalog provided metastore is not associated with a location C. You can have multiple catalogs within metastore and 1 catalog can be associated with multiple metastore D. If catalog is not associated with location, it’s mandatory to associate schema with managed locations E. If metastore is not associated with location, it’s mandatory to associate catalog with managed locations
A. You can have more than 1 metastore within a databricks account console but only 1 per region. E. If metastore is not associated with location, it’s mandatory to associate catalog with managed locations
46
A data engineer needs to access the view created by the sales team, using a shared cluster. The data engineer has been provided usage permissions on the catalog and schema. In order to access the view created by sales team. What are the minimum permissions the data engineer would require in addition? A. Needs SELECT permission on the VIEW and the underlying TABLE. B. Needs SELECT permission only on the VIEW C. Needs ALL PRIVILEGES on the VIEW D. Needs ALL PRIVILEGES at the SCHEMA level
A. Needs SELECT permission on the VIEW and the underlying TABLE.
47
Which method should a Data Engineer apply to ensure Workflows are being triggered on schedule? A. Scheduled Workflows require an always-running cluster, which is more expensive but reduces processing latency. B. Scheduled Workflows process data as it arrives at configured sources. C. Scheduled Workflows can reduce resource consumption and expense since the cluster runs only long enough to execute the pipeline. D. Scheduled Workflows run continuously until manually stopped.
C. Scheduled Workflows can reduce resource consumption and expense since the cluster runs only long enough to execute the pipeline.
48
The Delta transaction log for the ‘students’ tables is shown using the ‘DESCRIBE HISTORY students’ command. A Data Engineer needs to query the table as it existed before the UPDATE operation listed in the log. Which command should the Data Engineer use to achieve this? (Choose two.) A. SELECT * FROM students@v4 B. SELECT * FROM students TIMESTAMP AS OF ‘2024-04-22T 14:32:47.000+00:00’ C. SELECT * FROM students FROM HISTORY VERSION AS OF 3 D. SELECT * FROM students VERSION AS OF 5 E. SELECT * FROM students TIMESTAMP AS OF ‘2024-04-22T 14:32:58.000+00:00’
A. SELECT * FROM students@v4 B. SELECT * FROM students TIMESTAMP AS OF ‘2024-04-22T 14:32:47.000+00:00’
49
An engineering manager uses a Databricks SQL query to monitor ingestion latency for each data source. The manager checks the results of the query every day, but they are manually rerunning the query each day and waiting for the results. Which of the following approaches can the manager use to ensure the results of the query are updated each day? A. They can schedule the query to refresh every 1 day from the SQL endpoint's page in Databricks SQL. B. They can schedule the query to refresh every 12 hours from the SQL endpoint's page in Databricks SQL. C. They can schedule the query to refresh every 1 day from the query's page in Databricks SQL. D. They can schedule the query to run every 12 hours from the Jobs UI.
C. They can schedule the query to refresh every 1 day from the query's page in Databricks SQL.
50
Image