Databricks Flashcards

Question

What is the syntax for merge and what are the benefits of using merge?

Answer 1

MERGE INTO table_a a USING table_b b ON a.col_name=b.col_name WHEN MATCHED AND b.col = X THEN UPDATE SET * WHEN MATCHED AND a.col = Y THEN DELETE WHEN NOT MATCHED AND b.col = Z THEN INSERT *

Answer 2

(deltaTable .alias("t") .merge(historicalUpdates.alias("s"), "t.loan_id = s.loan_id") .whenNotMatchedInsertAll() .execute())

Answer 3

DROP TABLE students

Answer 4

is built on top of Apache Hadoop, which is an open-source framework used to efficiently store and process large datasets

Answer 5

DESCRIBE EXTENDED students DESCRIBE DETAIL students

Answer 6

%python display(dbutils.fs.ls(f"{DA.paths.user_db}/students")) DESCRIBE DETAIL students

Answer 7

%python display(dbutils.fs.ls(f"{DA.paths.user_db}/students/_delta_log"))

Answer 8

resolve all the files that are valid in the current version, and ignores all other data files. You can look at a particular transaction log and see if records were inserted / updated / deleted. %python display(spark.sql(f"SELECT * FROM json.`{DA.paths.user_db}/students/_delta_log/00000000000000000007.json`"))

Answer 9

OPTIMIZE command allows you to combine files toward an optimal size: OPTIMIZE events OPTIMIZE events WHERE date >= '2017-01-01' OPTIMIZE events WHERE date >= current_timestamp() - INTERVAL 1 day ZORDER BY (eventType)

Answer 10

Because all changes to the Delta Lake table are stored in the transaction log: DESCRIBE HISTORY students

Answer 11

SELECT * FROM students VERSION AS OF 3

Answer 12

RESTORE TABLE students TO VERSION AS OF 8

Answer 13

SET spark.databricks.delta.retentionDurationCheck.enabled = false; SET spark.databricks.delta.vacuum.logging.enabled = true; VACUUM students RETAIN 0 HOURS DRY RUN

Answer 14

optimize performance

Answer 15

CREATE DATABASE IF NOT EXISTS db_name_default_location;

Answer 16

CREATE DATABASE IF NOT EXISTS db_name_custom_location LOCATION 'path/db_name_custom_location.db';

Answer 17

DESCRIBE DATABASE EXTENDED db_name;

Answer 18

declare a location for a given database

Answer 19

USE db_name_default_location; CREATE OR REPLACE TABLE managed_table_in_db_with_default_location (width INT, length INT, height INT); INSERT INTO managed_table_in_db_with_default_location VALUES (3, 2, 1); SELECT * FROM managed_table_in_db_with_default_location;

Answer 20

USE db_name_custom_location; CREATE OR REPLACE TABLE managed_table_in_db_with_custom_location (width INT, length INT, height INT); INSERT INTO managed_table_in_db_with_custom_location VALUES (3, 2, 1); SELECT * FROM managed_table_in_db_with_custom_location;

Answer 21

in the LOCATION of the database it is registered to

Answer 22

DESCRIBE EXTENDED managed_table_in_db;

Answer 23

CREATE OR REPLACE TABLE external_table LOCATION 'path/external_table' AS SELECT * FROM temp_delays; SELECT * FROM external_table; Python: df.write.option("path", "/path/to/empty/directory").saveAsTable("table_name")

Answer 24

the table's directory and its log and data files will be deleted, only the database directory remains. Managed tables: DROP TABLE managed_table_in_db_with_default_location; DROP TABLE managed_table_in_db_with_custom_location;

Answer 25

The table definition no longer exists in the metastore, but the underlying data remain intact. DROP TABLE external_table; We still have access to the underlying data files: %python tbl_path = f"{DA.paths.working_dir}/external_table" files = dbutils.fs.ls(tbl_path) display(files)

Answer 26

Default location: DROP DATABASE db_name_default_location CASCADE; Customer location: DROP DATABASE db_name_custom_location CASCADE;

Answer 27

SHOW TABLES;

Answer 28

View: Persisted across multiple sessions, just like a table. No process or write data. Temp View: not persisted across multiple sessions, is scoped to the query level Global temp view: are scoped to the cluster level, registered to a separate database. Use views with appropriate table ACLs instead of global temporary views.

Answer 29

CREATE VIEW view_delays_abq_lax AS SELECT * FROM external_table WHERE origin = 'ABQ' AND destination = 'LAX'; SELECT * FROM view_delays_abq_lax;

Answer 30

CREATE TEMPORARY VIEW temp_view_delays_gt_120 AS SELECT * FROM external_table WHERE delay > 120 ORDER BY delay ASC; SELECT * FROM temp_view_delays_gt_120;

Answer 31

CREATE GLOBAL TEMPORARY VIEW global_temp_view_dist_gt_1000 AS SELECT * FROM external_table WHERE distance > 1000; SELECT * FROM global_temp.global_temp_view_dist_gt_1000; Show tables for global temp views SHOW TABLES IN global_temp;

Answer 32

SELECT * FROM global_temp.name_of_the_global_temp_view;

Answer 33

SELECT max(total_delay) AS `Longest Delay (in minutes)` FROM (WITH delayed_flights(total_delay) AS ( SELECT delay FROM external_table) SELECT * FROM delayed_flights );

Answer 34

SELECT ( WITH distinct_origins AS ( SELECT DISTINCT origin FROM external_table ) SELECT count(origin) AS `Number of Distinct Origins` FROM distinct_origins) AS `Number of Different Origin Airports`;

Answer 35

CREATE OR REPLACE VIEW BOS_LAX AS WITH origin_destination(origin_airport, destination_airport) AS (SELECT origin, destination FROM external_table) SELECT * FROM origin_destination WHERE origin_airport = 'BOS' AND destination_airport = 'LAX'; SELECT count(origin_airport) AS `Number of Delayed Flights from BOS to LAX` FROM BOS_LAX;

Answer 36

SELECT * FROM file_format.`/path/to/file` SELECT * FROM json.`${da.paths.datasets}/raw/events-kafka/001.json`

Answer 37

SELECT * FROM json.`${da.paths.datasets}/raw/events-kafka`

Answer 38

CREATE OR REPLACE TEMP VIEW events_temp_view AS SELECT * FROM json.`${da.paths.datasets}/raw/events-kafka/`; SELECT * FROM events_temp_view

Answer 39

SELECT * FROM text.`${da.paths.datasets}/raw/events-kafka/`

Answer 40

SELECT * FROM binaryFile.`${da.paths.datasets}/raw/events-kafka/`

Answer 41

the header row can be extracted as a table row, all columns can be loaded as a single column, and the column can contain nested data that is being truncated. SELECT * FROM csv.`${da.paths.working_dir}/sales-csv`

Answer 42

CREATE TABLE table_identifier (col_name1 col_type1, ...) USING data_source OPTIONS (key1 = "val1", key2 = "val2", ...) LOCATION = path

Answer 43

CREATE TABLE sales_csv (order_id LONG, email STRING, transactions_timestamp LONG, total_item_quantity INTEGER, purchase_revenue_in_usd DOUBLE, unique_items INTEGER, items STRING) USING CSV OPTIONS ( header = "true", delimiter = "|" ) LOCATION "${da.paths.working_dir}/sales-csv"

Answer 44

All the metadata and options passed during table declaration will be persisted to the metastore.

Answer 45

DESCRIBE EXTENDED sales_csv

Answer 46

REFRESH TABLE sales_csv

Answer 47

CREATE TABLE USING JDBC OPTIONS ( url = "jdbc:{databaseServerType}://{jdbcHostname}:{jdbcPort}", dbtable = "{jdbcDatabase}.table", user = "{jdbcUsername}", password = "{jdbcPassword}" ) DROP TABLE IF EXISTS users_jdbc; CREATE TABLE users_jdbc USING JDBC OPTIONS ( url = "jdbc:sqlite:/${da.username}_ecommerce.db", dbtable = "users" ) SELECT * FROM users_jdbc

Answer 48

You can move the entire source table(s) to Databricks and then executing logic on the currently active cluster. BUT significant overhead because of network transfer latency. You can push down the query to the external SQL database and only transfer the results back to Databricks. But significant overhead because the execution of query logic in source systems not optimized for big data queries.

Answer 49

CREATE OR REPLACE TABLE sales AS SELECT * FROM parquet.`${da.paths.datasets}/raw/sales-historical/`; DESCRIBE EXTENDED sales;

Answer 50

CREATE OR REPLACE TEMP VIEW sales_tmp_vw (order_id LONG, email STRING, transactions_timestamp LONG, total_item_quantity INTEGER, purchase_revenue_in_usd DOUBLE, unique_items INTEGER, items STRING) USING CSV OPTIONS ( path = "${da.paths.datasets}/raw/sales-csv", header = "true", delimiter = "|" ); CREATE TABLE sales_delta AS SELECT * FROM sales_tmp_vw; SELECT * FROM sales_delta

Answer 51

CREATE OR REPLACE TABLE purchases AS SELECT order_id AS id, transaction_timestamp, purchase_revenue_in_usd AS price FROM sales; SELECT * FROM purchases

Answer 52

CREATE OR REPLACE TABLE purchase_dates ( id STRING, transaction_timestamp STRING, price STRING, date DATE GENERATED ALWAYS AS ( cast(cast(transaction_timestamp/1e6 AS TIMESTAMP) AS DATE)) COMMENT "generated based on `transactions_timestamp` column")

Answer 53

current_timestamp() records the timestamp when the logic is executed; input_file_name() records the source data file for each record in the table

Answer 54

A COMMENT is added to allow for easier discovery of table contents A LOCATION is specified, which will result in an external (rather than managed) table The table is PARTITIONED BY a date column; CREATE OR REPLACE TABLE users_pii COMMENT "Contains PII" LOCATION "${da.paths.working_dir}/tmp/users_pii" PARTITIONED BY (first_touch_date) AS SELECT *, cast(cast(user_first_touch_timestamp/1e6 AS TIMESTAMP) AS DATE) first_touch_date, current_timestamp() updated, input_file_name() source_file FROM parquet.`${da.paths.datasets}/raw/users-historical/`; SELECT * FROM users_pii;

Answer 55

you should default to nonpartitioned tables for most use cases when working with Delta Lake

Answer 56

DEEP CLONE CREATE OR REPLACE TABLE purchases_clone DEEP CLONE purchases SHALLOW CLONE CREATE OR REPLACE TABLE purchases_shallow_clone SHALLOW CLONE purchases

Answer 57

Much faster The old version of the table still exists and can be easily retrieved using Time Travel; Concurrent queries can still read the table while you are deleting the table. if overwriting the table fails, the table will be in its previous state.

Answer 58

CREATE OR REPLACE TABLE: completely redefine the contents of our target table, CREATE OR REPLACE TABLE events AS SELECT * FROM parquet.`${da.paths.datasets}/raw/events-historical` INSERT OVERWRITE: only overwrite an existing table. Will fail if we try to change our schema. INSERT OVERWRITE sales SELECT * FROM parquet.`${da.paths.datasets}/raw/sales-historical/

Answer 59

INSERT INTO sales SELECT * FROM parquet.`${da.paths.datasets}/raw/sales-30m`

Answer 60

MERGE INTO target a USING source b ON {merge_condition} WHEN MATCHED THEN {matched_action} WHEN NOT MATCHED THEN {not_matched_action} Benfits: : 1. updates, inserts, and deletes are completed as a single transaction; 2. multiple conditions can be added in addition to matching fields; 3. it provides extensive options for implementing custom logic

Answer 61

MERGE INTO events a USING events_update b ON a.user_id = b.user_id AND a.event_timestamp = b.event_timestamp WHEN NOT MATCHED AND b.traffic_source = 'email' THEN INSERT *

Answer 62

COPY INTO sales FROM "${da.paths.datasets}/raw/sales-30m" FILEFORMAT = PARQUET

Answer 63

COPY INTO focused on a SQL analyst doing a batch execution. Auto Loader requires Structured Streaming

Answer 64

SELECT * FROM table_name WHERE col_name IS NULL SELECT count_if(col_name IS NULL) AS new_col_name FROM table_name

Answer 65

SELECT COUNT(DISTINCT(col_1, col_2)) FROM table_name WHERE col_1 IS NOT NULL

Answer 66

CREATE OR REPLACE TEMP VIEW events_strings AS SELECT string(key), string(value) FROM events_raw; SELECT * FROM events_strings

Answer 67

CREATE OR REPLACE TEMP VIEW parsed_events AS SELECT from_json(value, schema_of_json('{insert_example_schema_here}')) AS json FROM events_strings; SELECT * FROM parsed_events

Answer 68

CREATE OR REPLACE TEMP VIEW new_events_final AS SELECT json.* FROM parsed_events; SELECT * FROM new_events_final

Answer 69

SELECT user_id, event_timestamp, event_name, explode(items) AS item FROM events Explode function lets us put each element in an array on its own row

Answer 70

SELECT user_id, collect_set(event_name) AS event_history, array_distinct(flatten(collect_set(items.item_id))) AS cart_history FROM events GROUP BY user_id The collect_set function can collect unique values for a field, including fields within arrays. The flatten function allows multiple arrays to be combined into a single array. The array_distinct function removes duplicate elements from an array.

Answer 71

SELECT columns FROM table_1 WHERE EXISTS ( SELECT values FROM table_2 WHERE table_2.column = table_1.column);

Answer 72

FILTER (items, i -> i.item_id LIKE "%K") AS king_items FILTER : the name of the higher-order function items : the name of our input array i : the name of the iterator variable. You choose this name and then use it in the lambda function. It iterates over the array, cycling each value into the function one at a time. -> : Indicates the start of a function i.item_id LIKE "%K" : This is the function. Each value is checked to see if it ends with the capital letter K. If it is, it gets filtered into the new column, king_items

Answer 73

EXISTS (categories, c -> c = "Company Blog") companyFlag Let's say we want to flag all blog posts with "Company Blog" in the categories field. I can use the EXISTS function to mark which entries include that category.

Answer 74

TRANSFORM(king_items, k -> CAST(k.item_revenue_in_usd \* 100 AS INT)) AS item_revenues we extract the item's revenue value, multiply it by 100, and cast the result to integer

Answer 75

REDUCE(co2_level, 0, (c, acc) -> c + acc, acc ->(acc div size(co2_level)))

Answer 76

CREATE OR REPLACE FUNCTION function_name(param TYPE) RETURNS type_to_be_returned RETURN function_itself

Answer 77

by the same Access Control Lists (ACLs) as databases, tables, or views

Answer 78

The user must have USAGE and SELECT permissions on the function to use it.

Answer 79

allow a handful of users to define the complex logic needed for common reporting and analytic queries

Answer 80

print(""" SELECT * FROM table_name """)

Answer 81

spark.sql("SELECT * FROM table_name")

Answer 82

display(spark.sql("SELECT * FROM table_name"))

Answer 83

def return_new_string(string_arg): return "The string passed to this function was " + string_arg

Answer 84

f"I can substitute {my_string} here"

Answer 85

table_name = "users" filter_clause = "WHERE state = 'CA'" query = f""" SELECT * FROM {table_name} {filter_clause} """ print(query)

Answer 86

def foods_i_like(food): if food == "beans": print(f"I love {food}") elif food == "potatoes": print(f"My favorite vegetable is {food}") elif food != "beef": print(f"Do you have any good recipes for {food}?") else: print(f"I don't eat {food}")

Answer 87

Example asserting that the number 2 is an integer: assert type(2) == int

Answer 88

try / except provides robust error handling. When a nonnumeric string is passed, an informative message is printed out. def try_int(num_string): try: int(num_string) result = f"{num_string} is a number." except: result = f"{num_string} is not a number!" print(result)

Answer 89

an error will not be raised when an error occurs. Implementing logic that suppresses errors can lead to logic silently failing.

Answer 90

def three_times(number): try: return int(number) * 3 except ValueError as e: print(f"You passed the string variable '{number}'.\n") print(f"Try passing an integer instead.") return None

Answer 91

Using a simple if clause with a function allows us to execute arbitrary SQL queries, optionally displaying the results, and always returning the resultant DataFrame. def simple_query_function(query, preview=True): query_result = spark.sql(query) if preview: display(query_result) return query_result result = simple_query_function(query)

Answer 92

Provides a way for data teams to load raw data from cloud object stores at lower costs and latencies, Allows you to continuously ingest data into Delta Lake2, Use as general best practice when ingesting data from cloud object storage.

Answer 93

data_source : Auto Loader will detect new files as they arrive in this location and queue them for ingestion; passed to the .load() method. source_format: While the format for all Auto Loader queries will be cloudFiles , the format of the source data should always be specified for the cloudFiles.format option table_name: Spark Structured Streaming supports writing directly to Delta Lake tables by passing a table name as a string to the .table() method. Note that table you can either append to an existing table or create a new table. checkpoint_directory: This argument is passed to the checkpointLocation and for storing cloudFiles.schemaLocation options. Checkpoints keep track of streaming progress, while the schema location tracks updates to the fields in the source dataset.

Answer 94

With cloudFiles.schemaLocation , Auto Loader will infer schema wheareas traditional structured streaming will not, Auto Loader will scan the first gigabytes of data and infer the schema for you.

Answer 95

cloudFiles

Answer 96

to capture any data that might be malformed and not fit into the table otherwise

Answer 97

STRING type

Answer 98

Reprocess all records in a source directory to calculate current results. Implement custom logic to identify new data that's arrived since the last time a table was updated.

Answer 99

%sql DESCRIBE HISTORY target_table

Answer 100

Extends the functionality of Apache Spark to allow for simplified configuration and bookkeeping when processing incremental datasets. Allows users to interact with ever-growing data sources as if they were just a static table of records, by treating infinite data as a table.

Answer 101

(spark.readStream .table("bronze") .createOrReplaceTempView("streaming_tmp_vw"))

Answer 102

%sql SELECT device_id, count(device_id) AS total_recordings FROM streaming_tmp_vw GROUP BY device_id

Answer 103

Checkpointing with checkpointLocation .outputMode("append").outputMode("complete"). Trigger Intervals

Answer 104

(spark.table("device_counts_tmp_vw") .writeStream .option("checkpointLocation", f"{DA.paths.checkpoints}/silver") .outputMode("complete") .trigger(availableNow=True) .table("device_counts") .awaitTermination() # This optional method blocks execution of the next cell until the incremental batch write has succeeded)

Answer 105

contains raw data ingested from various sources (JSON files, RDBMS data, IoT data, to name a few examples). Bronze makes sure that data is appended incrementally and grows over time. We're interested in retaining the full unprocessed history of each dataset in an efficient storage format which will provide us with the ability to recreate any state of a given data system.

Answer 106

provides a more refined view of our data. We can join fields from various bronze tables to enrich streaming records, or update account statuses based on recent activity. The silver layer might contain many pipelines and silver tables. Various different views for a given dataset. The goal is that this silver layer becomes that validated single source of truth for our data. This is the dream that the data lake could have been: Correct schema, deduplicated records, but no aggregations for our business users yet.

Answer 107

highly refined and aggregated data. Data that has been transformed to knowledge. Updates to these tables will be completed as part of regularly scheduled production workloads, which helps control costs and allows SLAs for data freshness to be established. Gold tables provide business level aggregates often used for reporting and dashboarding. This would include aggregations such as daily active website users, weekly sales per store, or gross revenue per quarter by department. The end outputs are actionable insights, dashboards and reports of business metrics. Gold tables will often be stored in a separate storage container to help avoid cloud limits on data requests. In general, because aggregations, joins and filtering are being handled before data is written to the golden layer, query performance on data in the gold tables should be exceptional.

Answer 108

Examples: source file names that are being ingested, recording of the time where that data was originally processed.

Answer 109

(spark.readStream .format("cloudFiles") .option("cloudFiles.format", "json") .option("cloudFiles.schemaHints", "time DOUBLE") .option("cloudFiles.schemaLocation", f"{DA.paths.checkpoints}/bronze") .load(DA.paths.data_landing_location) .createOrReplaceTempView("recordings_raw_temp"))

Answer 110

ensuring that only fully successfully commits are reflected in your tables. If you choose to merge these data with other data sources, be aware of how those sources version data and what sort of consistency guarantees they have.

Answer 111

We join the recordings data with the PII to add patient names, the time for the recordings we parse the time for the recordings to the format 'yyyy-MM-dd HH:mm:ss' to be human-readable, and we perform a quality check by excluding heart rates that are <= 0. (spark.readStream .table("bronze") .createOrReplaceTempView("bronze_tmp")) %sql CREATE OR REPLACE TEMPORARY VIEW recordings_w_pii AS ( SELECT device_id, a.mrn, b.name, cast(from_unixtime(time, 'yyyy-MM-dd HH:mm:ss') AS timestamp) time, heartrate FROM bronze_tmp a INNER JOIN pii b ON a.mrn = b.mrn WHERE heartrate > 0) (spark.table("recordings_w_pii") .writeStream .format("delta") .option("checkpointLocation", f"{DA.paths.checkpoints}/recordings_enriched") .outputMode("append") .table("recordings_enriched")) %sql SELECT COUNT(*) FROM recordings_enriched

Answer 112

We read a stream of data from recordings_enriched and write another stream to create an aggregate gold table of daily averages for each patient. (spark.readStream .table("recordings_enriched") .createOrReplaceTempView("recordings_enriched_temp")) %sql CREATE OR REPLACE TEMP VIEW patient_avg AS ( SELECT mrn, name, mean(heartrate) avg_heartrate, date_trunc("DD", time) date FROM recordings_enriched_temp GROUP BY mrn, name, date_trunc("DD", time))

Answer 113

provides us the ability to continue to use the strengths of structured streaming while trigger this job one-time to process all available data in micro-batches. (spark.table("patient_avg") .writeStream .format("delta") .outputMode("complete") .option("checkpointLocation", f"{DA.paths.checkpoints}/daily_avg") .trigger(availableNow=True) # you want the benefits of streaming but as a single batch .table("daily_patient_avg"))

Answer 114

When using complete output mode, we rewrite the entire state of our table each time our logic runs. While this is ideal for calculating aggregates, we cannot read a stream from this directory, as Structured Streaming assumes data is only being appended.

Answer 115

CREATE OR REFRESH STREAMING LIVE TABLE sales_orders_raw COMMENT "The raw sales orders, ingested from /databricks-datasets." AS SELECT * FROM cloud_files("/databricks-datasets/retail-org/sales_orders/", "json", map("cloudFiles.inferColumnTypes", "true"))

Answer 116

New Job Cluster Existing All-Purpose Clusters

Answer 117

To view details of the run, including the start time, duration, and status, hover over the bar in the Job Runs row. To view details of each task, including the start time, duration, cluster, and status, hover over the cell for that task

Answer 118

make sure to use shared job clusters. choose the correct cluster type for your job.

Answer 119

allows users and admins to navigate databases, tables, and views; explore data schema,metadata, and history; set and modify permissions of relational entities

Answer 120

shows where data came from, who created it and when, how it has been modified over time, how it's being used, and more.

Answer 121

Data Access Control: Who has access to what? Data Access Audit: Understand who accessed what and when? What did they do? Compliance aspect. Data Lineage: Which data objects feed downstream data objects - if you make a change to an upstream table, how does that affect downstream and vice versa. Data Discovery: Important to find your data and see what actually exists.

Answer 122

Admins will have the ability to view all objects registered to the metastore and will be able to control permissions for other users in the workspace. Users will have no permissions on anything registered to the metastore, other than objects that they create in DBSQL; before users can create any databases, tables, or views, they must have create and usage privileges specifically granted to them. Permissions will be set using Groups that have been configured by an administrator. Access Control Lists (ACLs) are used to control permissions.

Answer 123

CATALOG controls access to the entire data catalog. DATABASE controls access to a database. TABLE controls access to a managed or external table. VIEW controls access to SQL views. FUNCTION controls access to a named function. ANY FILE controls access to the underlying filesystem. Users granted access to ANY FILE can bypass the restrictions put on the catalog, databases, tables, and views by reading from the file system directly.

Answer 124

Databricks administrator: All objects in the catalog and the underlying filesystem. Catalog owner: All objects in the catalog. Database owner: All objects in the database. Table owner: Only the table (similar options for views and functions).

Answer 125

ALL PRIVILEGES: gives all privileges (is translated into all the below privileges). SELECT: gives read access to an object. MODIFY: gives ability to add, delete, and modify data to or from an object. READ_METADATA: gives ability to view an object and its metadata. USAGE: does not give any abilities, but is an additional requirement to perform any action on a database object. CREATE: gives ability to create an object (for example, a table in a database)

Answer 126

To enable the ability to create databases and tables in the default catalog using Databricks SQL: GRANT usage, create ON CATALOG hive_metastore TO users To confirm this has run successfully: SHOW GRANT ON CATALOG hive_metastore

Answer 127

spark.table

Databricks Flashcards

(153 cards)