SET - 4 Flashcards
(23 cards)
A data engineer has realized that they made a mistake when making a daily update to a table. They need to use Delta time travel to restore the table to a version that is 3 days old. However, when the data engineer attempts to time travel to the older version, they are unable to restore the data because the data files have been deleted.
Which of the following explains why the data files are no longer present?
A. The VACUUM command was run on the table
B. The TIME TRAVEL command was run on the table
C. The DELETE HISTORY command was run on the table
D. The OPTIMIZE command was nun on the table
A. The VACUUM command was run on the table
Which of the following describes the relationship between Bronze tables and raw data?
A. Bronze tables contain less data than raw data files.
B. Bronze tables contain more truthful data than raw data.
C. Bronze tables contain raw data with a schema applied.
D. Bronze tables contain a less refined view of data than raw data.
C. Bronze tables contain raw data with a schema applied.
A data engineer only wants to execute the final block of a Python program if the Python variable day_of_week is equal to 1 and the Python variable review_period is True.
Which of the following control flow statements should the data engineer use to begin this conditionally executed code block?
A. if day_of_week = 1 and review_period:
B. if day_of_week = 1 and review_period = “True”:
C. if day_of_week = 1 & review_period: = “True”:
D. if day_of_week == 1 and review_period:
D. if day_of_week == 1 and review_period:
Which of the following must be specified when creating a new Delta Live Tables pipeline?
A. A key-value pair configuration
B. At least one notebook library to be executed
C. A path to cloud storage location for the written data
D. A location of a target database for the written data
B. At least one notebook library to be executed
In which of the following scenarios should a data engineer select a Task in the Depends On field of a new Databricks Job Task?
A. When another task needs to be replaced by the new task
B. When another task needs to successfully complete before the new task begins
C. When another task has the same dependency libraries as the new task
D. When another task needs to use as little compute resources as possible
B. When another task needs to successfully complete before the new task begins
A data engineering team has two tables. The first table march_transactions is a collection of all retail transactions in the month of March. The second table april_transactions is a collection of all retail transactions in the month of April. There are no duplicate records between the tables.
Which of the following commands should be run to create a new table all_transactions that contains all records from march_transactions and april_transactions without duplicate records?
A. CREATE TABLE all_transactions AS
SELECT * FROM march_transactions
INNER JOIN SELECT * FROM april_transactions;
B. CREATE TABLE all_transactions AS
SELECT * FROM march_transactions
UNION SELECT * FROM april_transactions;
C. CREATE TABLE all_transactions AS
SELECT * FROM march_transactions
OUTER JOIN SELECT * FROM april_transactions;
D. CREATE TABLE all_transactions AS
SELECT * FROM march_transactions
INTERSECT SELECT * from april_transactions;
B. CREATE TABLE all_transactions AS
SELECT * FROM march_transactions
UNION SELECT * FROM april_transactions;
How can Git operations must be performed outside of Databricks Repos?
A. Commit
B. Pull
C. Merge
D. Clone
C. Merge
A data engineer has joined an existing project and they see the following query in the project repository:
CREATE STREAMING LIVE TABLE loyal_customers AS
SELECT customer_id -
FROM STREAM(LIVE.customers)
WHERE loyalty_level = ‘high’;
Which of the following describes why the STREAM function is included in the query?
A. The STREAM function is not needed and will cause an error.
B. The data in the customers table has been updated since its last run.
C. The customers table is a streaming live table.
D. The customers table is a reference to a Structured Streaming query on a PySpark DataFrame
C. The customers table is a streaming live table.
Image
A data organization leader is upset about the data analysis team’s reports being different from the data engineering team’s reports. The leader believes the siloed nature of their organization’s data engineering and data analysis architectures is to blame.
Which of the following describes how a data lakehouse could alleviate this issue?
A. Both teams would respond more quickly to ad-hoc requests
B. Both teams would use the same source of truth for their work
C. Both teams would reorganize to report to the same department
D. Both teams would be able to collaborate on projects in real-time
B. Both teams would use the same source of truth for their work
A data analyst has developed a query that runs against Delta table. They want help from the data engineering team to implement a series of tests to ensure the data returned by the query is clean. However, the data engineering team uses Python for its tests rather than SQL.
Which of the following operations could the data engineering team use to run the query and operate with the results in PySpark?
A. SELECT * FROM sales
B. spark.delta.table
C. spark.sql
D. spark.table
C. spark.sql
A data engineer has a Job that has a complex run schedule, and they want to transfer that schedule to other Jobs.
Rather than manually selecting each value in the scheduling form in Databricks, which of the following tools can the data engineer use to represent and submit the schedule programmatically?
A. pyspark.sql.types.DateType
B. datetime
C. pyspark.sql.types.TimestampType
D. Cron syntax
D. Cron syntax
A data engineer and data analyst are working together on a data pipeline. The data engineer is working on the raw, bronze, and silver layers of the pipeline using Python, and the data analyst is working on the gold layer of the pipeline using SQL. The raw source of the pipeline is a streaming input. They now want to migrate their pipeline to use Delta Live Tables.
Which of the following changes will need to be made to the pipeline when migrating to Delta Live Tables?
A. The pipeline will need to be written entirely in Python
B. The pipeline will need to stop using the medallion-based multi-hop architecture
C. The pipeline will need to be written entirely in SQL
D. The pipeline will need to use a batch source in place of a streaming source
D. The pipeline will need to use a batch source in place of a streaming source
A data engineer needs access to a table new_table, but they do not have the correct permissions. They can ask the table owner for permission, but they do not know who the table owner is.
Which of the following approaches can be used to identify the owner of new_table?
A. Review the Permissions tab in the table’s page in Data Explorer
B. There is no way to identify the owner of the table
C. Review the Owner field in the table’s page in Data Explorer
D. Review the Owner field in the table’s page in the cloud storage solution
C. Review the Owner field in the table’s page in Data Explorer
….
In which of the following scenarios should a data engineer use the MERGE INTO command instead of the INSERT INTO command?
A. When the location of the data needs to be changed
B. When the target table is an external table
C. When the source is not a Delta table
D. When the target table cannot contain duplicate records
D. When the target table cannot contain duplicate records
A data engineer is designing a data pipeline. The source system generates files in a shared directory that is also used by other processes. As a result, the files should be kept as is and will accumulate in the directory. The data engineer needs to identify which files are new since the previous run in the pipeline, and set up the pipeline to only ingest those new files with each run.
Which of the following tools can the data engineer use to solve this problem?
A. Unity Catalog
B. Delta Lake
C. Databricks SQL
D. Auto Loader
D. Auto Loader
….
….
Which of the following commands will return the number of null values in the member_id column?
A. SELECT count(member_id) FROM my_table;
B. SELECT count(member_id) - count_null(member_id) FROM my_table;
C. SELECT count_if(member_id IS NULL) FROM my_table;
D. SELECT null(member_id) FROM my_table;
C. SELECT count_if(member_id IS NULL) FROM my_table;
Which tool is used by Auto Loader to process data incrementally?
A. Checkpointing
B. Spark Structured Streaming
C. Databricks SQL
D. Unity Catalog
B. Spark Structured Streaming
Which of the following benefits is provided by the array functions from Spark SQL?
A. An ability to work with data in a variety of types at once
B. An ability to work with data within certain partitions and windows
C. An ability to work with time-related data in specified intervals
D. An ability to work with complex, nested data ingested from JSON files
D. An ability to work with complex, nested data ingested from JSON files
Image