SET - 2 Flashcards
(50 cards)
A data engineer has realized that the data files associated with a Delta table are incredibly small. They want to compact the small files to form larger files to improve performance.
Which of the following keywords can be used to compact the small files?
A. REDUCE
B. OPTIMIZE
C. COMPACTION
D. REPARTITION
E. VACUUM
B. OPTIMIZE
In which of the following file formats is data from Delta Lake tables primarily stored?
A. Delta
B. CSV
C. Parquet
D. JSON
E. A proprietary, optimized format specific to Databricks
C. Parquet
Which of the following is stored in the Databricks customer’s cloud account?
A. Databricks web application
B. Cluster management metadata
C. Repos
D. Data
E. Notebooks
D. Data
Which of the following can be used to simplify and unify siloed data architectures that are specialized for specific use cases?
A. None of these
B. Data lake
C. Data warehouse
D. All of these
E. Data lakehouse
E. Data lakehouse
Image
A data engineer has a Python notebook in Databricks, but they need to use SQL to accomplish a specific task within a cell. They still want all of the other cells to use Python without making any changes to those cells.
Which of the following describes how the data engineer can use SQL within a cell of their Python notebook?
A. It is not possible to use SQL in a Python notebook
B. They can attach the cell to a SQL endpoint rather than a Databricks cluster
C. They can simply write SQL syntax in the cell
D. They can add %sql to the first line of the cell
E. They can change the default language of the notebook to SQL
D. They can add %sql to the first line of the cell
Which of the following SQL keywords can be used to convert a table from a long format to a wide format?
A. TRANSFORM
B. PIVOT
C. SUM
D. CONVERT
E. WHERE
B. PIVOT
Which of the following describes a benefit of creating an external table from Parquet rather than CSV when using a CREATE TABLE AS SELECT statement?
A. Parquet files can be partitioned
B. CREATE TABLE AS SELECT statements cannot be used on files
C. Parquet files have a well-defined schema
D. Parquet files have the ability to be optimized
E. Parquet files will become Delta tables
C. Parquet files have a well-defined schema
A data engineer wants to create a relational object by pulling data from two tables. The relational object does not need to be used by other data engineers in other sessions. In order to save on storage costs, the data engineer wants to avoid copying and storing physical data.
Which of the following relational objects should the data engineer create?
A. Spark SQL Table
B. View
C. Database
D. Temporary view
E. Delta Table
D. Temporary view
A data analyst has developed a query that runs against Delta table. They want help from the data engineering team to implement a series of tests to ensure the data returned by the query is clean. However, the data engineering team uses Python for its tests rather than SQL.
Which of the following operations could the data engineering team use to run the query and operate with the results in PySpark?
A. SELECT * FROM sales
B. spark.delta.table
C. spark.sql
D. There is no way to share data between PySpark and SQL.
E. spark.table
C. spark.sql
Which of the following commands will return the number of null values in the member_id column?
A. SELECT count(member_id) FROM my_table;
B. SELECT count(member_id) - count_null(member_id) FROM my_table;
C. SELECT count_if(member_id IS NULL) FROM my_table;
D. SELECT null(member_id) FROM my_table;
E. SELECT count_null(member_id) FROM my_table;
C. SELECT count_if(member_id IS NULL) FROM my_table;
Image
A data engineer has a Python variable table_name that they would like to use in a SQL query. They want to construct a Python code block that will run the query using table_name.
They have the following incomplete code block:
____(f”SELECT customer_id, spend FROM {table_name}”)
Which of the following can be used to fill in the blank to successfully complete the task?
A. spark.delta.sql
B. spark.delta.table
C. spark.table
D. dbutils.sql
E. spark.sql
E. spark.sql
A data engineer has created a new database using the following command:
CREATE DATABASE IF NOT EXISTS customer360;
In which of the following locations will the customer360 database be located?
A. dbfs:/user/hive/database/customer360
B. dbfs:/user/hive/warehouse
C. dbfs:/user/hive/customer360
D. More information is needed to determine the correct response
E. dbfs:/user/hive/database
B. dbfs:/user/hive/warehouse
A data engineer is attempting to drop a Spark SQL table my_table and runs the following command:
DROP TABLE IF EXISTS my_table;
After running this command, the engineer notices that the data files and metadata files have been deleted from the file system.
Which of the following describes why all of these files were deleted?
A. The table was managed
B. The table’s data was smaller than 10 GB
C. The table’s data was larger than 10 GB
D. The table was external
E. The table did not have a location
A. The table was managed
Image
In which of the following scenarios should a data engineer use the MERGE INTO command instead of the INSERT INTO command?
A. When the location of the data needs to be changed
B. When the target table is an external table
C. When the source table can be deleted
D. When the target table cannot contain duplicate records
E. When the source is not a Delta table
D. When the target table cannot contain duplicate records
Image
A data engineer needs to create a table in Databricks using data from a CSV file at location /path/to/csv.
They run the following command:
Which of the following lines of code fills in the above blank to successfully complete the task?
A. None of these lines of code are needed to successfully complete the task
B. USING CSV
C. FROM CSV
D. USING DELTA
E. FROM “path/to/csv”
B. USING CSV
A data engineer has configured a Structured Streaming job to read from a table, manipulate the data, and then perform a streaming write into a new table.
The code block used by the data engineer is below:
Image
If the data engineer only wants the query to process all of the available data in as many batches as required, which of the following lines of code should the data engineer use to fill in the blank?
A. processingTime(1)
B. trigger(availableNow=True)
C. trigger(parallelBatch=True)
D. trigger(processingTime=”once”)
E. trigger(continuous=”once”)
B. trigger(availableNow=True)
A data engineer has developed a data pipeline to ingest data from a JSON source using Auto Loader, but the engineer has not provided any type inference or schema hints in their pipeline. Upon reviewing the data, the data engineer has noticed that all of the columns in the target table are of the string type despite some of the fields only including float or boolean values.
Which of the following describes why Auto Loader inferred all of the columns to be of the string type?
A. There was a type mismatch between the specific schema and the inferred schema
B. JSON data is a text-based format
C. Auto Loader only works with string data
D. All of the fields had at least one null value
E. Auto Loader cannot infer the schema of ingested da
B. JSON data is a text-based format
A Delta Live Table pipeline includes two datasets defined using STREAMING LIVE TABLE. Three datasets are defined against Delta Lake table sources using LIVE TABLE.
The table is configured to run in Development mode using the Continuous Pipeline Mode.
Assuming previously unprocessed data exists and all definitions are valid, what is the expected outcome after clicking Start to update the pipeline?
A. All datasets will be updated once and the pipeline will shut down. The compute resources will be terminated.
B. All datasets will be updated at set intervals until the pipeline is shut down. The compute resources will persist until the pipeline is shut down.
C. All datasets will be updated once and the pipeline will persist without any processing. The compute resources will persist but go unused.
D. All datasets will be updated once and the pipeline will shut down. The compute resources will persist to allow for additional testing.
E. All datasets will be updated at set intervals until the pipeline is shut down. The compute resources will persist to allow for additional testing
E. All datasets will be updated at set intervals until the pipeline is shut down. The compute resources will persist to allow for additional testing.
Which of the following data workloads will utilize a Gold table as its source?
A. A job that enriches data by parsing its timestamps into a human-readable format
B. A job that aggregates uncleaned data to create standard summary statistics
C. A job that cleans data by removing malformatted records
D. A job that queries aggregated data designed to feed into a dashboard
E. A job that ingests raw data from a streaming source into the Lakehouse
D. A job that queries aggregated data designed to feed into a dashboard
Which of the following must be specified when creating a new Delta Live Tables pipeline?
A. A key-value pair configuration
B. The preferred DBU/hour cost
C. A path to cloud storage location for the written data
D. A location of a target database for the written data
E. At least one notebook library to be executed
E. At least one notebook library to be executed