SET - 2 Flashcards by BR K

A data engineer has realized that the data files associated with a Delta table are incredibly small. They want to compact the small files to form larger files to improve performance.

Which of the following keywords can be used to compact the small files?

A. REDUCE
B. OPTIMIZE
C. COMPACTION
D. REPARTITION
E. VACUUM

B. OPTIMIZE

How well did you know this?

Not at all

Perfectly

In which of the following file formats is data from Delta Lake tables primarily stored?

A. Delta
B. CSV
C. Parquet
D. JSON
E. A proprietary, optimized format specific to Databricks

C. Parquet

How well did you know this?

Not at all

Perfectly

Which of the following is stored in the Databricks customer’s cloud account?

A. Databricks web application
B. Cluster management metadata
C. Repos
D. Data
E. Notebooks

D. Data

How well did you know this?

Not at all

Perfectly

Which of the following can be used to simplify and unify siloed data architectures that are specialized for specific use cases?

A. None of these
B. Data lake
C. Data warehouse
D. All of these
E. Data lakehouse

E. Data lakehouse

How well did you know this?

Not at all

Perfectly

Image

How well did you know this?

Not at all

Perfectly

A data engineer has a Python notebook in Databricks, but they need to use SQL to accomplish a specific task within a cell. They still want all of the other cells to use Python without making any changes to those cells.

Which of the following describes how the data engineer can use SQL within a cell of their Python notebook?

A. It is not possible to use SQL in a Python notebook
B. They can attach the cell to a SQL endpoint rather than a Databricks cluster
C. They can simply write SQL syntax in the cell
D. They can add %sql to the first line of the cell
E. They can change the default language of the notebook to SQL

D. They can add %sql to the first line of the cell

How well did you know this?

Not at all

Perfectly

Which of the following SQL keywords can be used to convert a table from a long format to a wide format?

A. TRANSFORM
B. PIVOT
C. SUM
D. CONVERT
E. WHERE

B. PIVOT

How well did you know this?

Not at all

Perfectly

Which of the following describes a benefit of creating an external table from Parquet rather than CSV when using a CREATE TABLE AS SELECT statement?

A. Parquet files can be partitioned
B. CREATE TABLE AS SELECT statements cannot be used on files
C. Parquet files have a well-defined schema
D. Parquet files have the ability to be optimized
E. Parquet files will become Delta tables

C. Parquet files have a well-defined schema

How well did you know this?

Not at all

Perfectly

A data engineer wants to create a relational object by pulling data from two tables. The relational object does not need to be used by other data engineers in other sessions. In order to save on storage costs, the data engineer wants to avoid copying and storing physical data.

Which of the following relational objects should the data engineer create?

A. Spark SQL Table
B. View
C. Database
D. Temporary view
E. Delta Table

D. Temporary view

How well did you know this?

Not at all

Perfectly

A data analyst has developed a query that runs against Delta table. They want help from the data engineering team to implement a series of tests to ensure the data returned by the query is clean. However, the data engineering team uses Python for its tests rather than SQL.

Which of the following operations could the data engineering team use to run the query and operate with the results in PySpark?

A. SELECT * FROM sales
B. spark.delta.table
C. spark.sql
D. There is no way to share data between PySpark and SQL.
E. spark.table

C. spark.sql

How well did you know this?

Not at all

Perfectly

Which of the following commands will return the number of null values in the member_id column?

A. SELECT count(member_id) FROM my_table;
B. SELECT count(member_id) - count_null(member_id) FROM my_table;
C. SELECT count_if(member_id IS NULL) FROM my_table;
D. SELECT null(member_id) FROM my_table;
E. SELECT count_null(member_id) FROM my_table;

C. SELECT count_if(member_id IS NULL) FROM my_table;

How well did you know this?

Not at all

Perfectly

Image

How well did you know this?

Not at all

Perfectly

A data engineer has a Python variable table_name that they would like to use in a SQL query. They want to construct a Python code block that will run the query using table_name.

They have the following incomplete code block:

____(f”SELECT customer_id, spend FROM {table_name}”)

Which of the following can be used to fill in the blank to successfully complete the task?

A. spark.delta.sql
B. spark.delta.table
C. spark.table
D. dbutils.sql
E. spark.sql

E. spark.sql

How well did you know this?

Not at all

Perfectly

A data engineer has created a new database using the following command:

CREATE DATABASE IF NOT EXISTS customer360;

In which of the following locations will the customer360 database be located?

A. dbfs:/user/hive/database/customer360
B. dbfs:/user/hive/warehouse
C. dbfs:/user/hive/customer360
D. More information is needed to determine the correct response
E. dbfs:/user/hive/database

B. dbfs:/user/hive/warehouse

How well did you know this?

Not at all

Perfectly

A data engineer is attempting to drop a Spark SQL table my_table and runs the following command:

DROP TABLE IF EXISTS my_table;

After running this command, the engineer notices that the data files and metadata files have been deleted from the file system.

Which of the following describes why all of these files were deleted?

A. The table was managed
B. The table’s data was smaller than 10 GB
C. The table’s data was larger than 10 GB
D. The table was external
E. The table did not have a location

A. The table was managed

How well did you know this?

Not at all

Perfectly

Image

How well did you know this?

Not at all

Perfectly

In which of the following scenarios should a data engineer use the MERGE INTO command instead of the INSERT INTO command?

A. When the location of the data needs to be changed
B. When the target table is an external table
C. When the source table can be deleted
D. When the target table cannot contain duplicate records
E. When the source is not a Delta table

D. When the target table cannot contain duplicate records

How well did you know this?

Not at all

Perfectly

Image

How well did you know this?

Not at all

Perfectly

A data engineer needs to create a table in Databricks using data from a CSV file at location /path/to/csv.

They run the following command:

Which of the following lines of code fills in the above blank to successfully complete the task?

A. None of these lines of code are needed to successfully complete the task
B. USING CSV
C. FROM CSV
D. USING DELTA
E. FROM “path/to/csv”

B. USING CSV

How well did you know this?

Not at all

Perfectly

A data engineer has configured a Structured Streaming job to read from a table, manipulate the data, and then perform a streaming write into a new table.

The code block used by the data engineer is below:
Image
If the data engineer only wants the query to process all of the available data in as many batches as required, which of the following lines of code should the data engineer use to fill in the blank?

A. processingTime(1)
B. trigger(availableNow=True)
C. trigger(parallelBatch=True)
D. trigger(processingTime=”once”)
E. trigger(continuous=”once”)

B. trigger(availableNow=True)

How well did you know this?

Not at all

Perfectly

A data engineer has developed a data pipeline to ingest data from a JSON source using Auto Loader, but the engineer has not provided any type inference or schema hints in their pipeline. Upon reviewing the data, the data engineer has noticed that all of the columns in the target table are of the string type despite some of the fields only including float or boolean values.

Which of the following describes why Auto Loader inferred all of the columns to be of the string type?

A. There was a type mismatch between the specific schema and the inferred schema
B. JSON data is a text-based format
C. Auto Loader only works with string data
D. All of the fields had at least one null value
E. Auto Loader cannot infer the schema of ingested da

Study These Flashcards

B. JSON data is a text-based format

A Delta Live Table pipeline includes two datasets defined using STREAMING LIVE TABLE. Three datasets are defined against Delta Lake table sources using LIVE TABLE.

The table is configured to run in Development mode using the Continuous Pipeline Mode.

Assuming previously unprocessed data exists and all definitions are valid, what is the expected outcome after clicking Start to update the pipeline?

A. All datasets will be updated once and the pipeline will shut down. The compute resources will be terminated.
B. All datasets will be updated at set intervals until the pipeline is shut down. The compute resources will persist until the pipeline is shut down.
C. All datasets will be updated once and the pipeline will persist without any processing. The compute resources will persist but go unused.
D. All datasets will be updated once and the pipeline will shut down. The compute resources will persist to allow for additional testing.
E. All datasets will be updated at set intervals until the pipeline is shut down. The compute resources will persist to allow for additional testing

Study These Flashcards

E. All datasets will be updated at set intervals until the pipeline is shut down. The compute resources will persist to allow for additional testing.

Which of the following data workloads will utilize a Gold table as its source?

A. A job that enriches data by parsing its timestamps into a human-readable format
B. A job that aggregates uncleaned data to create standard summary statistics
C. A job that cleans data by removing malformatted records
D. A job that queries aggregated data designed to feed into a dashboard
E. A job that ingests raw data from a streaming source into the Lakehouse

Study These Flashcards

D. A job that queries aggregated data designed to feed into a dashboard

Which of the following must be specified when creating a new Delta Live Tables pipeline?

A. A key-value pair configuration
B. The preferred DBU/hour cost
C. A path to cloud storage location for the written data
D. A location of a target database for the written data
E. At least one notebook library to be executed

Study These Flashcards

E. At least one notebook library to be executed

A data engineer has joined an existing project and they see the following query in the project repository: CREATE STREAMING LIVE TABLE loyal_customers AS SELECT customer_id - FROM STREAM(LIVE.customers) WHERE loyalty_level = 'high'; Which of the following describes why the STREAM function is included in the query? A. The STREAM function is not needed and will cause an error. B. The table being created is a live table. C. The customers table is a streaming live table. D. The customers table is a reference to a Structured Streaming query on a PySpark DataFrame. E. The data in the customers table has been updated since its last run.

C. The customers table is a streaming live table.

Which of the following describes the type of workloads that are always compatible with Auto Loader? A. Streaming workloads B. Machine learning workloads C. Serverless workloads D. Batch workloads E. Dashboard workloads

A. Streaming workloads

A data engineer and data analyst are working together on a data pipeline. The data engineer is working on the raw, bronze, and silver layers of the pipeline using Python, and the data analyst is working on the gold layer of the pipeline using SQL. The raw source of the pipeline is a streaming input. They now want to migrate their pipeline to use Delta Live Tables. Which of the following changes will need to be made to the pipeline when migrating to Delta Live Tables? A. None of these changes will need to be made B. The pipeline will need to stop using the medallion-based multi-hop architecture C. The pipeline will need to be written entirely in SQL D. The pipeline will need to use a batch source in place of a streaming source E. The pipeline will need to be written entirely in Python

A. None of these changes will need to be made

A data engineer is using the following code block as part of a batch ingestion pipeline to read from a composable table: Which of the following changes needs to be made so this code block will work when the transactions table is a stream source? A. Replace predict with a stream-friendly prediction function B. Replace schema(schema) with option ("maxFilesPerTrigger", 1) C. Replace "transactions" with the path to the location of the Delta table D. Replace format("delta") with format("stream") E. Replace spark.read with spark.readStream

E. Replace spark.read with spark.readStream

Image

IA dataset has been defined using Delta Live Tables and includes an expectations clause: CONSTRAINT valid_timestamp EXPECT (timestamp > '2020-01-01') ON VIOLATION FAIL UPDATE What is the expected behavior when a batch of data containing data that violates these constraints is processed? A. Records that violate the expectation are dropped from the target dataset and recorded as invalid in the event log. B. Records that violate the expectation cause the job to fail. C. Records that violate the expectation are dropped from the target dataset and loaded into a quarantine table. D. Records that violate the expectation are added to the target dataset and recorded as invalid in the event log. E. Records that violate the expectation are added to the target dataset and flagged as invalid in a field added to the target dataset.ge

B. Records that violate the expectation cause the job to fail.

Which of the following statements regarding the relationship between Silver tables and Bronze tables is always true? A. Silver tables contain a less refined, less clean view of data than Bronze data. B. Silver tables contain aggregates while Bronze data is unaggregated. C. Silver tables contain more data than Bronze tables. D. Silver tables contain a more refined and cleaner view of data than Bronze tables. E. Silver tables contain less data than Bronze tables.

D. Silver tables contain a more refined and cleaner view of data than Bronze tables.

A data engineering team has noticed that their Databricks SQL queries are running too slowly when they are submitted to a non-running SQL endpoint. The data engineering team wants this issue to be resolved. Which of the following approaches can the team use to reduce the time it takes to return results in this scenario? A. They can turn on the Serverless feature for the SQL endpoint and change the Spot Instance Policy to "Reliability Optimized." B. They can turn on the Auto Stop feature for the SQL endpoint. C. They can increase the cluster size of the SQL endpoint. D. They can turn on the Serverless feature for the SQL endpoint. E. They can increase the maximum bound of the SQL endpoint's scaling range.

D. They can turn on the Serverless feature for the SQL endpoint.

A data engineer has a Job that has a complex run schedule, and they want to transfer that schedule to other Jobs. Rather than manually selecting each value in the scheduling form in Databricks, which of the following tools can the data engineer use to represent and submit the schedule programmatically? A. pyspark.sql.types.DateType B. datetime C. pyspark.sql.types.TimestampType D. Cron syntax E. There is no way to represent and submit this information programmatically

D. Cron syntax

Which of the following approaches should be used to send the Databricks Job owner an email in the case that the Job fails? A. Manually programming in an alert system in each cell of the Notebook B. Setting up an Alert in the Job page C. Setting up an Alert in the Notebook D. There is no way to notify the Job owner in the case of Job failure E. MLflow Model Registry Webhooks

B. Setting up an Alert in the Job page

An engineering manager uses a Databricks SQL query to monitor ingestion latency for each data source. The manager checks the results of the query every day, but they are manually rerunning the query each day and waiting for the results. Which of the following approaches can the manager use to ensure the results of the query are updated each day? A. They can schedule the query to refresh every 1 day from the SQL endpoint's page in Databricks SQL. B. They can schedule the query to refresh every 12 hours from the SQL endpoint's page in Databricks SQL. C. They can schedule the query to refresh every 1 day from the query's page in Databricks SQL. D. They can schedule the query to run every 1 day from the Jobs UI. E. They can schedule the query to run every 12 hours from the Jobs UI.

C. They can schedule the query to refresh every 1 day from the query's page in Databricks SQL.

In which of the following scenarios should a data engineer select a Task in the Depends On field of a new Databricks Job Task? A. When another task needs to be replaced by the new task B. When another task needs to fail before the new task begins C. When another task has the same dependency libraries as the new task D. When another task needs to use as little compute resources as possible E. When another task needs to successfully complete before the new task begins

E. When another task needs to successfully complete before the new task begins

A data engineer has been using a Databricks SQL dashboard to monitor the cleanliness of the input data to a data analytics dashboard for a retail use case. The job has a Databricks SQL query that returns the number of store-level records where sales is equal to zero. The data engineer wants their entire team to be notified via a messaging webhook whenever this value is greater than 0. Which of the following approaches can the data engineer use to notify their entire team via a messaging webhook whenever the number of stores with $0 in sales is greater than zero? A. They can set up an Alert with a custom template. B. They can set up an Alert with a new email alert destination. C. They can set up an Alert with one-time notifications. D. They can set up an Alert with a new webhook alert destination. E. They can set up an Alert without notifications.

D. They can set up an Alert with a new webhook alert destination.

A data engineer wants to schedule their Databricks SQL dashboard to refresh every hour, but they only want the associated SQL endpoint to be running when it is necessary. The dashboard has multiple queries on multiple datasets associated with it. The data that feeds the dashboard is automatically processed using a Databricks Job. Which of the following approaches can the data engineer use to minimize the total running time of the SQL endpoint used in the refresh schedule of their dashboard? A. They can turn on the Auto Stop feature for the SQL endpoint. B. They can ensure the dashboard's SQL endpoint is not one of the included query's SQL endpoint. C. They can reduce the cluster size of the SQL endpoint. D. They can ensure the dashboard's SQL endpoint matches each of the queries' SQL endpoints. E. They can set up the dashboard's SQL endpoint to be serverless.

A. They can turn on the Auto Stop feature for the SQL endpoint.

A data engineer needs access to a table new_table, but they do not have the correct permissions. They can ask the table owner for permission, but they do not know who the table owner is. Which of the following approaches can be used to identify the owner of new_table? A. Review the Permissions tab in the table's page in Data Explorer B. All of these options can be used to identify the owner of the table C. Review the Owner field in the table's page in Data Explorer D. Review the Owner field in the table's page in the cloud storage solution E. There is no way to identify the owner of the table

C. Review the Owner field in the table's page in Data Explorer

A new data engineering team team has been assigned to an ELT project. The new data engineering team will need full privileges on the table sales to fully manage the project. Which of the following commands can be used to grant full permissions on the database to the new data engineering team? A. GRANT ALL PRIVILEGES ON TABLE sales TO team; B. GRANT SELECT CREATE MODIFY ON TABLE sales TO team; C. GRANT SELECT ON TABLE sales TO team; D. GRANT USAGE ON TABLE sales TO team; E. GRANT ALL PRIVILEGES ON TABLE team TO sales;

A. GRANT ALL PRIVILEGES ON TABLE sales TO team;

Which data lakehouse feature results in improved data quality over a traditional data lake? A. A data lakehouse stores data in open formats. B. A data lakehouse allows the use of SQL queries to examine data. C. A data lakehouse provides storage solutions for structured and unstructured data. D. A data lakehouse supports ACID-compliant transactions.

D. A data lakehouse supports ACID-compliant transactions.

In which scenario will a data team want to utilize cluster pools? A. An automated report needs to be version-controlled across multiple collaborators. B. An automated report needs to be runnable by all stakeholders. C. An automated report needs to be refreshed as quickly as possible. D. An automated report needs to be made reproducible.

C. An automated report needs to be refreshed as quickly as possible.

What is hosted completely in the control plane of the classic Databricks architecture? A. Worker node B. Databricks web application C. Driver node D. Databricks Filesystem

B. Databricks web application

A data engineer needs to determine whether to use the built-in Databricks Notebooks versioning or version their project using Databricks Repos. What is an advantage of using Databricks Repos over the Databricks Notebooks versioning? A. Databricks Repos allows users to revert to previous versions of a notebook B. Databricks Repos is wholly housed within the Databricks Data Intelligence Platform C. Databricks Repos provides the ability to comment on specific changes D. Databricks Repos supports the use of multiple branches

D. Databricks Repos supports the use of multiple branches

Repeated

A data engineer needs to use a Delta table as part of a data pipeline, but they do not know if they have the appropriate permissions. In which location can the data engineer review their permissions on the table? A. Jobs B. Dashboards C. Catalog Explorer D. Repos

C. Catalog Explorer

Repeated

A table named user_ltv is being used to create a view that will be used by data analysts on various teams. Users in the workspace are configured into groups, which are used for setting up data access using ACLs. The user_ltv table has the following schema: email STRING, age INT, ltv INT The following view definition is executed: Image An analyst who is not a member of the auditing group executes the following query: SELECT * FROM user_ltv_no_minors Which statement describes the results returned by this query? A. All columns will be displayed normally for those records that have an age greater than 17; records not meeting this condition will be omitted. B. All age values less than 18 will be returned as null values, all other columns will be returned with the values in user_ltv. C. All values for the age column will be returned as null values, all other columns will be returned with the values in user_ltv. D. All records from all columns will be displayed with the values in user_ltv. E. All columns will be displayed normally for those records that have an age greater than 18; records not meeting this condition will be omitted.

A. All columns will be displayed normally for those records that have an age greater than 17; records not meeting this condition will be omitted.

A data architect has determined that a table of the following format is necessary: Which code block is used by SQL DDL command to create an empty Delta table in the above format regardless of whether a table already exists with this name? A. CREATE OR REPLACE TABLE table_name ( employeeId STRING, startDate DATE, avgRating FLOAT ) B. CREATE OR REPLACE TABLE table_name WITH COLUMNS ( employeeId STRING, startDate DATE, avgRating FLOAT ) USING DELTA C. CREATE TABLE IF NOT EXISTS table_name ( employeeId STRING, startDate DATE, avgRating FLOAT ) D. CREATE TABLE table_name AS SELECT employeeId STRING, startDate DATE, avgRating FLOAT

A. CREATE OR REPLACE TABLE table_name ( employeeId STRING, startDate DATE, avgRating FLOAT )

What describes when to use the CREATE STREAMING LIVE TABLE (formerly CREATE INCREMENTAL LIVE TABLE) syntax over the CREATE LIVE TABLE syntax when creating Delta Live Tables (DLT) tables using SQL? A. CREATE STREAMING LIVE TABLE should be used when the subsequent step in the DLT pipeline is static. B. CREATE STREAMING LIVE TABLE should be used when data needs to be processed incrementally. C. CREATE STREAMING LIVE TABLE should be used when data needs to be processed through complicated aggregations. D. CREATE STREAMING LIVE TABLE should be used when the previous step in the DLT pipeline is static.

B. CREATE STREAMING LIVE TABLE should be used when data needs to be processed incrementally.

SET - 2 Flashcards

(50 cards)