General Flashcards

1
Q

In the UI, where can you go to facilitate granting select access to a user

A

Data Explorer

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is Delta Lake

A

an open source storage format, like parquet, with additional capabilities that can provide reliability, security, and performance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How is the data organized in storage when managing a delta table

A

All of the data is broken down into one or more parquet files, log files are broken down to one or more json files, and each transaction creates a new data file and log file

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the underlying technology that makes auto loader work

A

Structured Streaming

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

When should you use AutoLoader vs Copy Into

A
  • You want to load from a location that contains files in the order of millions or higher. Auto Loader can discover files more efficiently
  • Auto Loader supports file notification
  • Data schema evolves frequently
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

When loading with Auto loader, how do you deal with an evolving schema

A

mergeSchema will infer the schema across multiple files and merge the schema of each file

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How can you use merge to deduplicate upon writing

A

MERGE INTO target USING source
ON target.key = source.key
WHEN NOT MATCHED THEN INSERT *

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How do you use merge to delete all target rows that have no matches in the source table?

A

MERGE INTO target USING source
ON target.key = source.key
WHEN NOT MATCHED BY SOURCE THEN DELETE

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Where can you include timeouts in jobs

A

in the task

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How can you automate an alert

A

By having it on a refresh schedule

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How do you grant read capability to a table

A

GRANT SELECT, USAGE ON TABLE customers to some@Email.com

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What type of constraint keeps the bad records and adds them to the target dataset

A

CONSTRAINT valid_timestamp EXPECT (timestamp > ‘2020-01-01’)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What type of constraint drops bad records

A

CONSTRAINT valid_timestamp EXPECT (timestamp > ‘2020-01-01’) ON VIOLATION DROP ROW

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What type of constraint fails when there is a bad record

A

CONSTRAINT valid_timestamp EXPECT (timestamp > ‘2020-01-01’) ON VIOLATION FAIL UPDATE

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

When creating a db what is the default location of the database

A

dbfs:/user/hive/warehouse

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How do you create an external table

A

Answer is CREATE TABLE transactions (id int, desc string) USING DELTA LOCATION ‘/mnt/delta/transactions’

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

When you drop an external DELTA table using the SQL Command DROP TABLE table_name, how does it impact metadata(delta log, history), and data stored in the storage?

A

Drops table from metastore, but keeps metadata (delta log, history, and data in storage

18
Q

What is the best way to query external csv files located on DBFS Storage to inspect the data using SQL?

A

SELECT * FROM CSV. ‘dbfs:/location/csv_files/’

19
Q

How do you create a table from csv with headers, delimiter of | in SQL

A

CREATE TABLE customer(customerId, customerName)
USING CSV
OPTIONS(header=”true”, delimiter=”|”)

20
Q

How can you do a distinct count that ignores null values

A

Count(DISTINCT *)

21
Q

What command combines two datasets while eliminating duplicates

A

UNION

22
Q

What options are not allowed on streaming data

A

Select *
Multiple streaming aggregations
Limit and take the first N
Distinct
Deduplication
Sorting WITHOUT aggregation

23
Q

What are some examples of options allowed on streaming data

A

Sum()
Max()
Count()

24
Q

What technique does structured streaming use to ensure recovery of failures during stream processing?

A

Structured Streaming uses checkpointing and write-ahead logs to record the offset range of data being processed during each trigger interval.

25
Q

What is an example of creating a generated data column

A

GENERATED ALWAYS AS (CAST(orderTime as DATE))

26
Q

Which plane hosts jobs/pipelines and queries

A

Control Plane

27
Q

T/F Databricks lakehouse does not support Stored Procedures

A

True

28
Q

What keyword means you are creating an external table

A

LOCATION
CREATE TABLE table_name ( column column_data_type…) USING format LOCATION “dbfs:/”

29
Q

What is an example of a UDF

A

CREATE FUNCTION udf_convert(temp DOUBLE, measure STRING)
RETURNS DOUBLE
RETURN CASE WHEN measure == ‘F’ then (temp * 9/5) + 32
ELSE (temp – 33 ) * 5/9
END

30
Q

What is a command that will create a view on top of a delta stream

A

Spark.readStream.format(“delta”).table(“sales”).createOrReplaceTempView(“streaming_vw”)

31
Q

How does structured streaming achieve end to end fault tolerance?

A

First, Structured Streaming uses checkpointing and write-ahead logs to record the offset range of data being processed during each trigger interval.

Next, the streaming sinks are designed to be idempotent

32
Q

What are the three type of exceptions with Delta Live tables

A

Except - Records that violate the expectation are added to the target dataset along with valid records. (CONSTRAINT valid_timestamp EXPECT (timestamp > ‘2012-01-01’))
Drop - Drop invalid records (CONSTRAINT valid_current_page EXPECT (current_page_id IS NOT NULL and current_page_title IS NOT NULL) ON VIOLATION DROP ROW)
Fail - halt execution immediately when a record fails validation (CONSTRAINT valid_count EXPECT (count > 0) ON VIOLATION FAIL UPDATE)

33
Q

When building a DLT s pipeline you have two options to create a live tables, what is the main difference between CREATE STREAMING LIVE TABLE vs CREATE LIVE TABLE?

A

CREATE STREAMING LIVE TABLE is used when working with Streaming data sources and Incremental data

34
Q

Where can you see the jobs history in the UI

A

Under jobs UI select the job you are interested, under runs we can see current active runs and last 60 days historical run

35
Q

What are the different ways you can schedule a job in Databricks workspace?

A

Cron, on demand

36
Q

On SQL Endpoint(SQL Warehouse) single cluster, how many queries can run at a time

A

10

37
Q

Which section in the UI can be used to manage permissions and grants to tables?

A

Data Explorer

38
Q

What are two examples of privileges not available in the unity catalog

A

DELETE and UPDATE

39
Q

How do you transfer a table’s ownership

A

ALTER TABLE table_name OWNER to ‘group’

40
Q
A