Databricks Flashcards

(153 cards)

1
Q

What are the 2 main components of Databricks?

A

The control plan: stores notebooks commands, workspace configurations.
The data plane: hosts compute resources
(clusters)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the 3 different Databricks service?

A

Data science and engineering workspace
SQL
Machine learning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is a cluster?

A

A set of compute resources on which you run data engineer, data science workloads, which are run as a set of commands on notebook or as a job.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are the 2 cluster types?

A

All purpose clusters: use interactive notebooks.
Job clusters: run automated jobs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

. How long does Databricks retain cluster configuration information?

A

30 days
Pin cluster to keep all purpose cluster after 30 days.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

1.6. What are the three cluster modes?

A

Standard clusters: large amounts of data with Apache Spark.
Single Node clusters: jobs that use small amounts of data or non-distributed
workloads such as single-node machine learning libraries.
High Concurrency clusters: groups of users who need to share resources or run ad-hoc
jobs. Administrators usually create High Concurrency clusters.
Databricks recommends enabling
autoscaling for High Concurrency clusters.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

To ensure that all data at rest is
encrypted for all storage types, including shuffle data that is stored temporarily on your cluster’s local
disks:

A

You can enable local disk encryption.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

To reduce cluster start time

A

you can attach a cluster to a predefined pool of idle instances, for the driver
and worker nodes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Which magic command do you use to run a notebook from
another notebook?

A

%run ../Includes/Classroom-Setup-1.2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is Databricks utilities and how can you use it to list out
directories of files from Python cells?

A

display(dbutils.fs.ls(“/databricks-datasets”))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What function should you use when you have tabular data
returned by a Python cell?

A

display()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is Databricks Repos?

A

provides repository-level
integration with Git providers, allowing you to work in an environment that is backed by revision control
using Git

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is the definition of a Delta Lake?

A

technology at the heart of the Databricks Lakehouse platform. It is an open source
technology that enables building a data lakehouse on top of existing storage systems.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

. How does Delta Lake address the data lake pain points to
ensure reliable, ready-to-go data?

A

ACID Transactions – Delta Lake adds ACID transactions to data lakes. ACID stands for atomicity,
consistency, isolation, and durability

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Describe how Delta Lake brings ACID transactions to object
storage

A

Difficult to append data
Difficult to modify existing data
Jobs failing mid way
Real time operations are not easy
Costly to keep historical data versions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Is Delta Lake the default for all tables created in Databricks?

A

Yes, Delta Lakes is the default for all tables created in Databricks

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What data objects are in the Databricks Lakehouse?

A

Catalog, database, table, view, function

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is a metastore?

A

contains all of the metadata that defines data objects in the lakehouse:
Including:
Unity catalog
Hive metastore
external metastore

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is a catalog?

A

the highest abstraction (or coarsest grain) in the Databricks Lakehouse relational model.
catalog_name.database_name.table_name

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What is a Delta Lake table?

A

stores data as a directory of files on
cloud object storage and registers table metadata to the metastore within a catalog and schema

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What is the syntax to create a Delta Table?

A

CREATE TABLE students
(id INT, name STRING, value DOUBLE);
CREATE TABLE IF NOT EXISTS students
(id INT, name STRING, value DOUBLE)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What is the syntax to insert data?

A

INSERT INTO students
VALUES
(4, “Ted”, 4.7),
(5, “Tiffany”, 5.5),
(6, “Vini”, 6.3)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What is the syntax to update particular records of a table?

A

UPDATE students
SET value = value + 1
WHERE name LIKE “T%”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What is the syntax to delete particular records of a table?

A

DELETE FROM students
WHERE value > 6

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
What is the syntax for merge and what are the benefits of using merge?
MERGE INTO table_a a USING table_b b ON a.col_name=b.col_name WHEN MATCHED AND b.col = X THEN UPDATE SET * WHEN MATCHED AND a.col = Y THEN DELETE WHEN NOT MATCHED AND b.col = Z THEN INSERT *
26
Deduplication case (python syntax)
(deltaTable .alias("t") .merge(historicalUpdates.alias("s"), "t.loan_id = s.loan_id") .whenNotMatchedInsertAll() .execute())
27
What is the syntax to delete a table?
DROP TABLE students
28
What is Hive?
is built on top of Apache Hadoop, which is an open-source framework used to efficiently store and process large datasets
29
What are the two commands to see metadata about a table?
DESCRIBE EXTENDED students DESCRIBE DETAIL students
30
What is the syntax to display the Delta Lake files?
%python display(dbutils.fs.ls(f"{DA.paths.user_db}/students")) DESCRIBE DETAIL students
31
Describe the Delta Lake files, their format and directory structure
%python display(dbutils.fs.ls(f"{DA.paths.user_db}/students/_delta_log"))
32
What does the query engine do using the transaction logs when we query a Delta Lake table?
resolve all the files that are valid in the current version, and ignores all other data files. You can look at a particular transaction log and see if records were inserted / updated / deleted. %python display(spark.sql(f"SELECT * FROM json.`{DA.paths.user_db}/students/_delta_log/00000000000000000007.json`"))
33
What commands do you use to compact small files and index tables?
OPTIMIZE command allows you to combine files toward an optimal size: OPTIMIZE events OPTIMIZE events WHERE date >= '2017-01-01' OPTIMIZE events WHERE date >= current_timestamp() - INTERVAL 1 day ZORDER BY (eventType)
34
How do you review a history of table transactions?
Because all changes to the Delta Lake table are stored in the transaction log: DESCRIBE HISTORY students
35
. How do you query and roll back to previous table version?
SELECT * FROM students VERSION AS OF 3
36
How to roll back after deletion?
RESTORE TABLE students TO VERSION AS OF 8
37
What command do you use to clean up stale data files and what are the consequences of using this command?
SET spark.databricks.delta.retentionDurationCheck.enabled = false; SET spark.databricks.delta.vacuum.logging.enabled = true; VACUUM students RETAIN 0 HOURS DRY RUN
38
What is the advantage of using Delta cache?
optimize performance
39
What is the syntax to create a database with default location (no location specified)?
CREATE DATABASE IF NOT EXISTS db_name_default_location;
40
What is the syntax to create a database with specified location?
CREATE DATABASE IF NOT EXISTS db_name_custom_location LOCATION 'path/db_name_custom_location.db';
41
How do you get metadata information of a database? Where are the databases located (difference between default vs custom location)
DESCRIBE DATABASE EXTENDED db_name;
42
What's the best practice when creating databases?
declare a location for a given database
43
What is the syntax for creating a table in a database with default location and inserting data?
USE db_name_default_location; CREATE OR REPLACE TABLE managed_table_in_db_with_default_location (width INT, length INT, height INT); INSERT INTO managed_table_in_db_with_default_location VALUES (3, 2, 1); SELECT * FROM managed_table_in_db_with_default_location;
44
What is the syntax for a table in a database with custom location?
USE db_name_custom_location; CREATE OR REPLACE TABLE managed_table_in_db_with_custom_location (width INT, length INT, height INT); INSERT INTO managed_table_in_db_with_custom_location VALUES (3, 2, 1); SELECT * FROM managed_table_in_db_with_custom_location;
45
Where are managed tables located in a database?
in the LOCATION of the database it is registered to
46
How can you find their location?
DESCRIBE EXTENDED managed_table_in_db;
47
What is the syntax to create an external table?
CREATE OR REPLACE TABLE external_table LOCATION 'path/external_table' AS SELECT * FROM temp_delays; SELECT * FROM external_table; Python: df.write.option("path", "/path/to/empty/directory").saveAsTable("table_name")
48
What happens when you drop tables ?(Managed tables)
the table's directory and its log and data files will be deleted, only the database directory remains. Managed tables: DROP TABLE managed_table_in_db_with_default_location; DROP TABLE managed_table_in_db_with_custom_location;
49
What happens when you drop tables a unmanaged table?
The table definition no longer exists in the metastore, but the underlying data remain intact. DROP TABLE external_table; We still have access to the underlying data files: %python tbl_path = f"{DA.paths.working_dir}/external_table" files = dbutils.fs.ls(tbl_path) display(files)
50
What is the command to drop the database and its underlying tables and views?
Default location: DROP DATABASE db_name_default_location CASCADE; Customer location: DROP DATABASE db_name_custom_location CASCADE;
51
How can you show a list of tables and views?
SHOW TABLES;
52
What is the difference between Views, Temp Views & Global Temp Views?
View: Persisted across multiple sessions, just like a table. No process or write data. Temp View: not persisted across multiple sessions, is scoped to the query level Global temp view: are scoped to the cluster level, registered to a separate database. Use views with appropriate table ACLs instead of global temporary views.
53
What is the syntax for view?
CREATE VIEW view_delays_abq_lax AS SELECT * FROM external_table WHERE origin = 'ABQ' AND destination = 'LAX'; SELECT * FROM view_delays_abq_lax;
54
What is the syntax for temp view?
CREATE TEMPORARY VIEW temp_view_delays_gt_120 AS SELECT * FROM external_table WHERE delay > 120 ORDER BY delay ASC; SELECT * FROM temp_view_delays_gt_120;
55
What is syntax for global temp view?
CREATE GLOBAL TEMPORARY VIEW global_temp_view_dist_gt_1000 AS SELECT * FROM external_table WHERE distance > 1000; SELECT * FROM global_temp.global_temp_view_dist_gt_1000; Show tables for global temp views SHOW TABLES IN global_temp;
56
What is the syntax to select from global temp views?
SELECT * FROM global_temp.name_of_the_global_temp_view;
57
What is the syntax for defining a CTE in a subquery?
SELECT max(total_delay) AS `Longest Delay (in minutes)` FROM (WITH delayed_flights(total_delay) AS ( SELECT delay FROM external_table) SELECT * FROM delayed_flights );
58
What is the syntax for defining a CTE in a subquery expression?
SELECT ( WITH distinct_origins AS ( SELECT DISTINCT origin FROM external_table ) SELECT count(origin) AS `Number of Distinct Origins` FROM distinct_origins) AS `Number of Different Origin Airports`;
59
What is the syntax for defining a CTE in a CREATE VIEW statement?
CREATE OR REPLACE VIEW BOS_LAX AS WITH origin_destination(origin_airport, destination_airport) AS (SELECT origin, destination FROM external_table) SELECT * FROM origin_destination WHERE origin_airport = 'BOS' AND destination_airport = 'LAX'; SELECT count(origin_airport) AS `Number of Delayed Flights from BOS to LAX` FROM BOS_LAX;
60
How do you query data from a single file?
SELECT * FROM file_format.`/path/to/file` SELECT * FROM json.`${da.paths.datasets}/raw/events-kafka/001.json`
61
. How do you query a directory of files?
SELECT * FROM json.`${da.paths.datasets}/raw/events-kafka`
62
How do you create references to files?
CREATE OR REPLACE TEMP VIEW events_temp_view AS SELECT * FROM json.`${da.paths.datasets}/raw/events-kafka/`; SELECT * FROM events_temp_view
63
How do you extract text files as raw strings?
SELECT * FROM text.`${da.paths.datasets}/raw/events-kafka/`
64
How do you extract the raw bytes and metadata of a file?
SELECT * FROM binaryFile.`${da.paths.datasets}/raw/events-kafka/`
65
Explain why executing a direct query against CSV files rarely returns the desired result.
the header row can be extracted as a table row, all columns can be loaded as a single column, and the column can contain nested data that is being truncated. SELECT * FROM csv.`${da.paths.working_dir}/sales-csv`
66
Describe the syntax required to extract data from most formats against external sources.
CREATE TABLE table_identifier (col_name1 col_type1, ...) USING data_source OPTIONS (key1 = "val1", key2 = "val2", ...) LOCATION = path
67
Using Spark SQL DDL to create a table against an external CSV source
CREATE TABLE sales_csv (order_id LONG, email STRING, transactions_timestamp LONG, total_item_quantity INTEGER, purchase_revenue_in_usd DOUBLE, unique_items INTEGER, items STRING) USING CSV OPTIONS ( header = "true", delimiter = "|" ) LOCATION "${da.paths.working_dir}/sales-csv"
68
What happens to the data, metadata and options during table declaration for these external sources?
All the metadata and options passed during table declaration will be persisted to the metastore.
69
What is the syntax to show all of the metadata associated with the table definition?
DESCRIBE EXTENDED sales_csv
70
How can you manually refresh the cache of your data?
REFRESH TABLE sales_csv
71
What is the syntax to extract data from SQL Databases?
CREATE TABLE USING JDBC OPTIONS ( url = "jdbc:{databaseServerType}://{jdbcHostname}:{jdbcPort}", dbtable = "{jdbcDatabase}.table", user = "{jdbcUsername}", password = "{jdbcPassword}" ) DROP TABLE IF EXISTS users_jdbc; CREATE TABLE users_jdbc USING JDBC OPTIONS ( url = "jdbc:sqlite:/${da.username}_ecommerce.db", dbtable = "users" ) SELECT * FROM users_jdbc
72
Explain the two basic approaches that Spark uses to interact with external SQL databases and their limits.
You can move the entire source table(s) to Databricks and then executing logic on the currently active cluster. BUT significant overhead because of network transfer latency. You can push down the query to the external SQL database and only transfer the results back to Databricks. But significant overhead because the execution of query logic in source systems not optimized for big data queries.
73
What is a CTAS statement and what is the syntax?
CREATE OR REPLACE TABLE sales AS SELECT * FROM parquet.`${da.paths.datasets}/raw/sales-historical/`; DESCRIBE EXTENDED sales;
74
Do CTAS support manual schema declaration?
No
75
What is the syntax to overcome the limitation when trying to ingest data from CSV files?
CREATE OR REPLACE TEMP VIEW sales_tmp_vw (order_id LONG, email STRING, transactions_timestamp LONG, total_item_quantity INTEGER, purchase_revenue_in_usd DOUBLE, unique_items INTEGER, items STRING) USING CSV OPTIONS ( path = "${da.paths.datasets}/raw/sales-csv", header = "true", delimiter = "|" ); CREATE TABLE sales_delta AS SELECT * FROM sales_tmp_vw; SELECT * FROM sales_delta
76
How do you filter and rename columns from existing tables during table creation?
CREATE OR REPLACE TABLE purchases AS SELECT order_id AS id, transaction_timestamp, purchase_revenue_in_usd AS price FROM sales; SELECT * FROM purchases
77
What is a generated column and how do you declare schemas with generated columns?
CREATE OR REPLACE TABLE purchase_dates ( id STRING, transaction_timestamp STRING, price STRING, date DATE GENERATED ALWAYS AS ( cast(cast(transaction_timestamp/1e6 AS TIMESTAMP) AS DATE)) COMMENT "generated based on `transactions_timestamp` column")
78
Which built-in Spark SQL commands are useful for file ingestion (for the select clause)?
current_timestamp() records the timestamp when the logic is executed; input_file_name() records the source data file for each record in the table
79
What are the three options when creating tables?
A COMMENT is added to allow for easier discovery of table contents A LOCATION is specified, which will result in an external (rather than managed) table The table is PARTITIONED BY a date column; CREATE OR REPLACE TABLE users_pii COMMENT "Contains PII" LOCATION "${da.paths.working_dir}/tmp/users_pii" PARTITIONED BY (first_touch_date) AS SELECT *, cast(cast(user_first_touch_timestamp/1e6 AS TIMESTAMP) AS DATE) first_touch_date, current_timestamp() updated, input_file_name() source_file FROM parquet.`${da.paths.datasets}/raw/users-historical/`; SELECT * FROM users_pii;
80
As a best practice, should you default to partitioned tables for most use cases when working with Delta Lake?
you should default to nonpartitioned tables for most use cases when working with Delta Lake
81
What are the two options to copy Delta Lake tables and what are the use cases?
DEEP CLONE CREATE OR REPLACE TABLE purchases_clone DEEP CLONE purchases SHALLOW CLONE CREATE OR REPLACE TABLE purchases_shallow_clone SHALLOW CLONE purchases
82
What are the multiple benefits of overwriting tables instead of deleting and recreating tables?
Much faster The old version of the table still exists and can be easily retrieved using Time Travel; Concurrent queries can still read the table while you are deleting the table. if overwriting the table fails, the table will be in its previous state.
83
What are the two easy methods to accomplish complete overwrites?
CREATE OR REPLACE TABLE: completely redefine the contents of our target table, CREATE OR REPLACE TABLE events AS SELECT * FROM parquet.`${da.paths.datasets}/raw/events-historical` INSERT OVERWRITE: only overwrite an existing table. Will fail if we try to change our schema. INSERT OVERWRITE sales SELECT * FROM parquet.`${da.paths.datasets}/raw/sales-historical/
84
What is the syntax to atomically append new rows to an existing Delta table? Is the command idempotent?
INSERT INTO sales SELECT * FROM parquet.`${da.paths.datasets}/raw/sales-30m`
85
What is the syntax for the the MERGE SQL operation and the benefits of using merge?
MERGE INTO target a USING source b ON {merge_condition} WHEN MATCHED THEN {matched_action} WHEN NOT MATCHED THEN {not_matched_action} Benfits: : 1. updates, inserts, and deletes are completed as a single transaction; 2. multiple conditions can be added in addition to matching fields; 3. it provides extensive options for implementing custom logic
86
How can you use merge for deduplication?
MERGE INTO events a USING events_update b ON a.user_id = b.user_id AND a.event_timestamp = b.event_timestamp WHEN NOT MATCHED AND b.traffic_source = 'email' THEN INSERT *
87
What is the syntax to have an idempotent option to incrementally ingest data from external systems?
COPY INTO sales FROM "${da.paths.datasets}/raw/sales-30m" FILEFORMAT = PARQUET
88
How is COPY INTO different than Auto Loader?
COPY INTO focused on a SQL analyst doing a batch execution. Auto Loader requires Structured Streaming
89
What is the syntax to count null values?
SELECT * FROM table_name WHERE col_name IS NULL SELECT count_if(col_name IS NULL) AS new_col_name FROM table_name
90
What is the syntax to count for distinct values in a table for a specific column?
SELECT COUNT(DISTINCT(col_1, col_2)) FROM table_name WHERE col_1 IS NOT NULL
91
What is the syntax to deal with binary-encoded JSON values in a human readable format?
CREATE OR REPLACE TEMP VIEW events_strings AS SELECT string(key), string(value) FROM events_raw; SELECT * FROM events_strings
92
What is the syntax to parse JSON objects into struct types with Spark SQL?
CREATE OR REPLACE TEMP VIEW parsed_events AS SELECT from_json(value, schema_of_json('{insert_example_schema_here}')) AS json FROM events_strings; SELECT * FROM parsed_events
93
Once a JSON string is unpacked to a struct type, what is the syntax to flatten the fields into columns?
CREATE OR REPLACE TEMP VIEW new_events_final AS SELECT json.* FROM parsed_events; SELECT * FROM new_events_final
94
What is the syntax for exploding arrays of structs?
SELECT user_id, event_timestamp, event_name, explode(items) AS item FROM events Explode function lets us put each element in an array on its own row
95
What is the syntax to collect arrays?
SELECT user_id, collect_set(event_name) AS event_history, array_distinct(flatten(collect_set(items.item_id))) AS cart_history FROM events GROUP BY user_id The collect_set function can collect unique values for a field, including fields within arrays. The flatten function allows multiple arrays to be combined into a single array. The array_distinct function removes duplicate elements from an array.
96
What is the syntax for a semi-join?
SELECT columns FROM table_1 WHERE EXISTS ( SELECT values FROM table_2 WHERE table_2.column = table_1.column);
97
What is the syntax for FILTER ?
FILTER (items, i -> i.item_id LIKE "%K") AS king_items FILTER : the name of the higher-order function items : the name of our input array i : the name of the iterator variable. You choose this name and then use it in the lambda function. It iterates over the array, cycling each value into the function one at a time. -> : Indicates the start of a function i.item_id LIKE "%K" : This is the function. Each value is checked to see if it ends with the capital letter K. If it is, it gets filtered into the new column, king_items
98
What is the syntax for EXIST ?
EXISTS (categories, c -> c = "Company Blog") companyFlag Let's say we want to flag all blog posts with "Company Blog" in the categories field. I can use the EXISTS function to mark which entries include that category.
99
What is the syntax for TRANSFORM ?
TRANSFORM(king_items, k -> CAST(k.item_revenue_in_usd \* 100 AS INT)) AS item_revenues we extract the item's revenue value, multiply it by 100, and cast the result to integer
100
What is the syntax for REDUCE ?
REDUCE(co2_level, 0, (c, acc) -> c + acc, acc ->(acc div size(co2_level)))
101
What is the syntax to define and register SQL UDFs? How do you then apply that function to the data?
CREATE OR REPLACE FUNCTION function_name(param TYPE) RETURNS type_to_be_returned RETURN function_itself
102
What are SQL UDFs governed by?
by the same Access Control Lists (ACLs) as databases, tables, or views
103
What permissions must a user have on the function to use a SQL UDF? Describe their scoping.
The user must have USAGE and SELECT permissions on the function to use it.
104
What is the benefit of using SQL UDFs?
allow a handful of users to define the complex logic needed for common reporting and analytic queries
105
What is the syntax to turn SQL queries into Python strings?
print(""" SELECT * FROM table_name """)
106
What is the syntax to execute SQL from a Python cell?
spark.sql("SELECT * FROM table_name")
107
What function in python do you call to render a query the way it would appear in a normal SQL notebook?
display(spark.sql("SELECT * FROM table_name"))
108
What is the syntax to define a function in Python?
def return_new_string(string_arg): return "The string passed to this function was " + string_arg
109
What is the syntax for f-strings?
f"I can substitute {my_string} here"
110
How can f-strings be used for SQL queries?
table_name = "users" filter_clause = "WHERE state = 'CA'" query = f""" SELECT * FROM {table_name} {filter_clause} """ print(query)
111
What is the syntax for if / else clauses wrapped in a function?
def foods_i_like(food): if food == "beans": print(f"I love {food}") elif food == "potatoes": print(f"My favorite vegetable is {food}") elif food != "beef": print(f"Do you have any good recipes for {food}?") else: print(f"I don't eat {food}")
112
What are assert statements and what is the syntax?
Example asserting that the number 2 is an integer: assert type(2) == int
113
Why do we use try / except statements and what is the syntax?
try / except provides robust error handling. When a nonnumeric string is passed, an informative message is printed out. def try_int(num_string): try: int(num_string) result = f"{num_string} is a number." except: result = f"{num_string} is not a number!" print(result)
114
What is the downside of using try / except statements?
an error will not be raised when an error occurs. Implementing logic that suppresses errors can lead to logic silently failing.
115
What is the syntax for try / except statements where you return an informative error message?
def three_times(number): try: return int(number) * 3 except ValueError as e: print(f"You passed the string variable '{number}'.\n") print(f"Try passing an integer instead.") return None
116
. How do you apply these concepts to execute SQL logic on Databricks, for example to avoid SQL injection attack?
Using a simple if clause with a function allows us to execute arbitrary SQL queries, optionally displaying the results, and always returning the resultant DataFrame. def simple_query_function(query, preview=True): query_result = spark.sql(query) if preview: display(query_result) return query_result result = simple_query_function(query)
117
What is the purpose of Auto Loader?
Provides a way for data teams to load raw data from cloud object stores at lower costs and latencies, Allows you to continuously ingest data into Delta Lake2, Use as general best practice when ingesting data from cloud object storage.
118
What are the 4 arguments using Auto Loader with automatic schema inference and evolution?
data_source : Auto Loader will detect new files as they arrive in this location and queue them for ingestion; passed to the .load() method. source_format: While the format for all Auto Loader queries will be cloudFiles , the format of the source data should always be specified for the cloudFiles.format option table_name: Spark Structured Streaming supports writing directly to Delta Lake tables by passing a table name as a string to the .table() method. Note that table you can either append to an existing table or create a new table. checkpoint_directory: This argument is passed to the checkpointLocation and for storing cloudFiles.schemaLocation options. Checkpoints keep track of streaming progress, while the schema location tracks updates to the fields in the source dataset.
119
What is the benefit of Auto Loader compared to structured streaming?
With cloudFiles.schemaLocation , Auto Loader will infer schema wheareas traditional structured streaming will not, Auto Loader will scan the first gigabytes of data and infer the schema for you.
120
What keyword indicates that you're using Auto Loader rather than a traditional stream for ingesting?
cloudFiles
121
What is the _rescued_data column?
to capture any data that might be malformed and not fit into the table otherwise
122
What is the data type encoded by Auto Loader for fields in a text-based file format?
STRING type
123
Historically, what were the two inefficient ways to land new data?
Reprocess all records in a source directory to calculate current results. Implement custom logic to identify new data that's arrived since the last time a table was updated.
124
How do you track the ingestion progress?
%sql DESCRIBE HISTORY target_table
125
What is Spark Structured Streaming?
Extends the functionality of Apache Spark to allow for simplified configuration and bookkeeping when processing incremental datasets. Allows users to interact with ever-growing data sources as if they were just a static table of records, by treating infinite data as a table.
126
What is the syntax to read a stream?
(spark.readStream .table("bronze") .createOrReplaceTempView("streaming_tmp_vw"))
127
How can you transform streaming data?
%sql SELECT device_id, count(device_id) AS total_recordings FROM streaming_tmp_vw GROUP BY device_id
128
What are the 3 most important settings when writing a stream to Delta Lake tables?
Checkpointing with checkpointLocation .outputMode("append").outputMode("complete"). Trigger Intervals
129
What is the syntax to load data from a streaming temp view back to a DataFrame, and then query the table that we wrote out to?
(spark.table("device_counts_tmp_vw") .writeStream .option("checkpointLocation", f"{DA.paths.checkpoints}/silver") .outputMode("complete") .trigger(availableNow=True) .table("device_counts") .awaitTermination() # This optional method blocks execution of the next cell until the incremental batch write has succeeded)
130
what is a bronze table?
contains raw data ingested from various sources (JSON files, RDBMS data, IoT data, to name a few examples). Bronze makes sure that data is appended incrementally and grows over time. We're interested in retaining the full unprocessed history of each dataset in an efficient storage format which will provide us with the ability to recreate any state of a given data system.
131
What is a silver table?
provides a more refined view of our data. We can join fields from various bronze tables to enrich streaming records, or update account statuses based on recent activity. The silver layer might contain many pipelines and silver tables. Various different views for a given dataset. The goal is that this silver layer becomes that validated single source of truth for our data. This is the dream that the data lake could have been: Correct schema, deduplicated records, but no aggregations for our business users yet.
132
What is a gold table?
highly refined and aggregated data. Data that has been transformed to knowledge. Updates to these tables will be completed as part of regularly scheduled production workloads, which helps control costs and allows SLAs for data freshness to be established. Gold tables provide business level aggregates often used for reporting and dashboarding. This would include aggregations such as daily active website users, weekly sales per store, or gross revenue per quarter by department. The end outputs are actionable insights, dashboards and reports of business metrics. Gold tables will often be stored in a separate storage container to help avoid cloud limits on data requests. In general, because aggregations, joins and filtering are being handled before data is written to the golden layer, query performance on data in the gold tables should be exceptional.
133
Bronze: what additional metadata could you add for enhanced discoverability?
Examples: source file names that are being ingested, recording of the time where that data was originally processed.
134
how you can configure a read on a raw JSON source using Auto Loader with schema inference?
(spark.readStream .format("cloudFiles") .option("cloudFiles.format", "json") .option("cloudFiles.schemaHints", "time DOUBLE") .option("cloudFiles.schemaLocation", f"{DA.paths.checkpoints}/bronze") .load(DA.paths.data_landing_location) .createOrReplaceTempView("recordings_raw_temp"))
135
What happens with the ACID guarantees that Delta Lake brings to your data when you choose to merge this data with other data sources?
ensuring that only fully successfully commits are reflected in your tables. If you choose to merge these data with other data sources, be aware of how those sources version data and what sort of consistency guarantees they have.
136
Describe what happens at the silver level, when we enrich our data.
We join the recordings data with the PII to add patient names, the time for the recordings we parse the time for the recordings to the format 'yyyy-MM-dd HH:mm:ss' to be human-readable, and we perform a quality check by excluding heart rates that are <= 0. (spark.readStream .table("bronze") .createOrReplaceTempView("bronze_tmp")) %sql CREATE OR REPLACE TEMPORARY VIEW recordings_w_pii AS ( SELECT device_id, a.mrn, b.name, cast(from_unixtime(time, 'yyyy-MM-dd HH:mm:ss') AS timestamp) time, heartrate FROM bronze_tmp a INNER JOIN pii b ON a.mrn = b.mrn WHERE heartrate > 0) (spark.table("recordings_w_pii") .writeStream .format("delta") .option("checkpointLocation", f"{DA.paths.checkpoints}/recordings_enriched") .outputMode("append") .table("recordings_enriched")) %sql SELECT COUNT(*) FROM recordings_enriched
137
Describe what happens at the Gold level.
We read a stream of data from recordings_enriched and write another stream to create an aggregate gold table of daily averages for each patient. (spark.readStream .table("recordings_enriched") .createOrReplaceTempView("recordings_enriched_temp")) %sql CREATE OR REPLACE TEMP VIEW patient_avg AS ( SELECT mrn, name, mean(heartrate) avg_heartrate, date_trunc("DD", time) date FROM recordings_enriched_temp GROUP BY mrn, name, date_trunc("DD", time))
138
What is .trigger(availableNow=True) and when is it used?
provides us the ability to continue to use the strengths of structured streaming while trigger this job one-time to process all available data in micro-batches. (spark.table("patient_avg") .writeStream .format("delta") .outputMode("complete") .option("checkpointLocation", f"{DA.paths.checkpoints}/daily_avg") .trigger(availableNow=True) # you want the benefits of streaming but as a single batch .table("daily_patient_avg"))
139
What are the important considerations for complete output mode with Delta?
When using complete output mode, we rewrite the entire state of our table each time our logic runs. While this is ideal for calculating aggregates, we cannot read a stream from this directory, as Structured Streaming assumes data is only being appended.
140
What is the syntax for declaring a bronze layer table using Auto Loader and DLT?
CREATE OR REFRESH STREAMING LIVE TABLE sales_orders_raw COMMENT "The raw sales orders, ingested from /databricks-datasets." AS SELECT * FROM cloud_files("/databricks-datasets/retail-org/sales_orders/", "json", map("cloudFiles.inferColumnTypes", "true"))
141
When scheduling a Job, what are the two options to configure the cluster where the task runs?
New Job Cluster Existing All-Purpose Clusters
142
How can you view runs for a Job and the details of the runs?
To view details of the run, including the start time, duration, and status, hover over the bar in the Job Runs row. To view details of each task, including the start time, duration, cluster, and status, hover over the cell for that task
143
What are the recommendations for cluster configuration for specific job types?
make sure to use shared job clusters. choose the correct cluster type for your job.
144
What is the data explorer, how do you access it and what does it allow you to do?
allows users and admins to navigate databases, tables, and views; explore data schema,metadata, and history; set and modify permissions of relational entities
145
What is Unity catalog?
shows where data came from, who created it and when, how it has been modified over time, how it's being used, and more.
146
What are the four key functional areas for data governance?
Data Access Control: Who has access to what? Data Access Audit: Understand who accessed what and when? What did they do? Compliance aspect. Data Lineage: Which data objects feed downstream data objects - if you make a change to an upstream table, how does that affect downstream and vice versa. Data Discovery: Important to find your data and see what actually exists.
147
What are the default permissions for users and admins in DBSQL?
Admins will have the ability to view all objects registered to the metastore and will be able to control permissions for other users in the workspace. Users will have no permissions on anything registered to the metastore, other than objects that they create in DBSQL; before users can create any databases, tables, or views, they must have create and usage privileges specifically granted to them. Permissions will be set using Groups that have been configured by an administrator. Access Control Lists (ACLs) are used to control permissions.
148
What are the 6 objects for which Databricks allows you to configure permissions?
CATALOG controls access to the entire data catalog. DATABASE controls access to a database. TABLE controls access to a managed or external table. VIEW controls access to SQL views. FUNCTION controls access to a named function. ANY FILE controls access to the underlying filesystem. Users granted access to ANY FILE can bypass the restrictions put on the catalog, databases, tables, and views by reading from the file system directly.
149
For each object owner, describe what they can grant privileges for.
Databricks administrator: All objects in the catalog and the underlying filesystem. Catalog owner: All objects in the catalog. Database owner: All objects in the database. Table owner: Only the table (similar options for views and functions).
150
Describe all the privileges that can be configured in Data Explorer.
ALL PRIVILEGES: gives all privileges (is translated into all the below privileges). SELECT: gives read access to an object. MODIFY: gives ability to add, delete, and modify data to or from an object. READ_METADATA: gives ability to view an object and its metadata. USAGE: does not give any abilities, but is an additional requirement to perform any action on a database object. CREATE: gives ability to create an object (for example, a table in a database)
151
What is the command to generate a new database and grant permissions to all users in the DBSQL query editor?
To enable the ability to create databases and tables in the default catalog using Databricks SQL: GRANT usage, create ON CATALOG hive_metastore TO users To confirm this has run successfully: SHOW GRANT ON CATALOG hive_metastore
152
In pyspark dataframe API, what should we use to load data from from temp view back to dataframe?
spark.table
153