ETL with Spark SQL and Python (Test Qs) Flashcards

1
Q

You were asked to create a table that can store the below data, orderTime is a timestamp but the finance team when they query this data normally prefer the orderTime in date format, you would like to create a calculated column that can convert the orderTime column timestamp datatype to date and store it, fill in the blank to complete the DDL.

orderID
1
2

order Time
01-01-2020 09:10:24AM
01-01-2020 10:30:30AM

Units
100
10

CREATE TABLE orders (
orderId int,
orderTime timestamp,
orderdate date _____________________________________________ ,
units int)

Choose one:

AS DEFAULT (CAST(orderTime as DATE))

GENERATED DEFAULT AS (CAST(orderTime as DATE))

GENERATED ALWAYS AS (CAST(orderTime as DATE))

AS (CAST(orderTime as DATE))

Delta lake does not support calculated columns, values should be inserted into the tables as a parameter.

A

GENERATED ALWAYS AS (CAST(orderTime as DATE))

Delta Lake supports generated columns which are a special type of columns whose values are automatically generated based on a user-specified function over other columns in the Delta table. When you write to a table with generated columns and you do not explicitly provide values for them, Delta Lake automatically computes the values.

Note: Databricks also supports partitioning using generated column

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Which one of the following is not a Databricks lakehouse object?

Tables

Functions

Stored Procedures

Views

Catalog

Database/Schemas

A

Stored Procedures

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What type of table is created when you create delta table with below command?

CREATE TABLE transactions USING DELTA LOCATION “DBFS:/mnt/bronze/transactions”

Managed Delta table

Managed Table

Temp Table

External Table

Delta Lake Table

A

Anytime a table is created using the LOCATION keyword it is considered an external table, below is the current syntax.

Syntax

CREATE TABLE table_name ( column column_data_type…) USING format LOCATION “dbfs:/”

format -> DELTA, JSON, CSV, PARQUET, TEXT

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Which of the following command can be used to drop a managed delta table and the underlying files in the storage?

DROP TABLE table_name

DROP TABLE table_name CASCADE

DROP TABLE table_name INCLUDE_FILES

DROP TABLE table and run VACUUM command

Use DROP TABLE table_name command and manually delete files usind command dbutils.fs.rm(“/path”, True)

A

The answer is DROP TABLE table_name,

When a managed table is dropped, the table definition is dropped from metastore and everything including data, metadata, and history are also dropped from storage.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Which of the following is the correct statement for a session scoped temporary view?

Temporary views stored in memory

Temporary views are lost once the notebook is detached and re-attached

Temporary views can still be accessed even if the notebook is detached and attached

Temporary views can still be accessed even if the cluster is restarted

Temper views are created in local_temp database

A

The answer is Temporary views are lost once the notebook is detached and attached

There are two types of temporary views that can be created, Session scoped and Global

A local/session scoped temporary view is only available with a spark session, so another notebook in the same cluster can not access it. if a notebook is detached and reattached local temporary view is lost.

A global temporary view is available to all the notebooks in the cluster, if a cluster restarts global temporary view is lost.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Which of the following is correct for the global temporary view?

global temporary views cannot be accessed once the notebook is detached and attached

global temporary vies can be accessed across many clusters

global temporary views can still be accessed even if the notebook is detached and attached

global temporary views can still be accessed even if the cluster is restarted

global temporary views are created in a database called temp database

A

The answer is global temporary views can be still accessed even if the notebook is detached and attached

There are two types of temporary views that can be created Local and Global

· A local temporary view is only available with a spark session, so another notebook in the same cluster can not access it. if a notebook is detached and reattached local temporary view is lost.

· A global temporary view is available to all the notebooks in the cluster, even if the notebook is detached and reattached it can still be accessible but if a cluster is restarted the global temporary view is lost.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

You are currently working on reloading customer_sales tables using the below query

INSERT OVERWRITE customer_sales
SELECT * FROM customers c
INNER JOIN sales_monthly s on s.customer_id = c.customer_id

After you ran the above command, the Marketing team quickly wanted to review the old data that was in the table. How does INSERT OVERWRITE impact the data in the customer_sales table if you want to see the previous version of the data prior to running the above statement?

Overwrites the data int he tables, all historical versions of the data, you can not time travel to previous versions.

Overwrites the current version of the data but preserves all historical versions of the data, you can time travel to previous versions.

Overwrites the current version of the data but clears all historical versions of the data, so you can not time travel to previous versions

Appends the data to the current version, you can time travel to previous versions

By default, overwrites the data and schema, you cannot perform time travel.

A

The answer is, INSERT OVERWRITE Overwrites the current version of the data but preserves all historical versions of the data, you can time travel to previous versions.

INSERT OVERWRITE customer_sales
SELECT * FROM customers c
INNER JOIN sales s on s.customer_id = c.customer_id

Let’s just assume that this is the second time you are running the above statement, you can still query the prior version of the data using time travel, and any DML/DDL except DROP TABLE creates new PARQUET files so you can still access the previous versions of data.

SQL Syntax for Time travel

SELECT * FROM table_name as of [version number]

with customer_sales example

SELECT * FROM customer_sales as of 1 – previous version

SELECT * FROM customer_sales as of 2 – current version

You see all historical changes on the table using DESCRIBE HISTORY table_name

Note: the main difference between INSERT OVERWRITE and CREATE OR REPLACE TABLE(CRAS) is that CRAS can modify the schema of the table, i.e it can add new columns or change data types of existing columns. By default INSERT OVERWRITE only overwrites the data.

INSERT OVERWRITE can also be used to update the schema when spark.databricks.delta.schema.autoMerge.enabled is set true if this option is not enabled and if there is a schema mismatch command INSERT OVERWRITEwill fail.

Any DML/DDL operation(except DROP TABLE) on the Delta table preserves the historical version of the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Which of the following SQL statement can be used to query a table by eliminating duplicate rows from the query results?

SELECT DISTINCT * FROM table_name

SELECT DISTINCT * FROM table_name HAVING COUNT (*) > 1

SELECT DISTINCT_ROWS * FROM table_name

SELECT * FROM table_name GROUP BY * HAVING COUNT (*) > 1

SELECT * FROM table_name GROUP BY * HAVING COUNT (*) < 1

A

The answer is SELECT DISTINCT * FROM table_name

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Which of the below SQL Statements can be used to create a SQL UDF to convert Celsius to Fahrenheit and vice versa, you need to pass two parameters to this function one, actual temperature, and the second that identifies if its needs to be converted to Fahrenheit or Celcius with a one-word letter F or C?

select udf_convert(60,’C’) will result in 15.5

select udf_convert(10,’F’) will result in 50

Select one:

CREATE UDF FUNCTION udf_convert(temp DOUBLE, measure STRING)
RETURNS DOUBLE
RETURN CASE WHEN measure == ‘F’ then (temp * 9/5) + 32
ELSE (temp – 33 ) * 5/9
END

CREATE UDF FUNCTION udf_convert(temp DOUBLE, measure STRING)
RETURN CASE WHEN measure == ‘F’ then (temp * 9/5) + 32
ELSE (temp – 33 ) * 5/9
END

CREATE FUNCTION udf_convert(temp DOUBLE, measure STRING)
RETURN CASE WHEN measure == ‘F’ then (temp * 9/5) + 32
ELSE (temp – 33 ) * 5/9
END

CREATE FUNCTION udf_convert(temp DOUBLE, measure STRING)
RETURNS DOUBLE
RETURN CASE WHEN measure == ‘F’ then (temp * 9/5) + 32
ELSE (temp – 33 ) * 5/9
END

CREATE USER FUNCTION udf_convert(temp DOUBLE, measure STRING)
RETURNS DOUBLE
RETURN CASE WHEN measure == ‘F’ then (temp * 9/5) + 32
ELSE (temp – 33 ) * 5/9
END

A

CREATE FUNCTION udf_convert(temp DOUBLE, measure STRING)
RETURNS DOUBLE
RETURN CASE WHEN measure == ‘F’ then (temp * 9/5) + 32
ELSE (temp – 33 ) * 5/9
END

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

You are trying to calculate total sales made by all the employees by parsing a complex struct data type that stores employee and sales data, how would you approach this in SQL

Table definition,

batchId INT, performance ARRAY<STRUCT<employeeId: BIGINT, sales: INT», insertDate TIMESTAMP

Sample data of performance column

[
{ “employeeId”:1234
“sales” : 10000},

{ “employeeId”:3232
“sales” : 30000}
]

Calculate total sales made by all the employees?

Sample data with create table syntax for the data:

create or replace table sales as
select 1 as batchId ,
from_json(‘[{ “employeeId”:1234,”sales” : 10000 },{ “employeeId”:3232,”sales” : 30000 }]’,
‘ARRAY<STRUCT<employeeId: BIGINT, sales: INT»’) as performance,
current_timestamp() as insertDate
union all
select 2 as batchId ,
from_json(‘[{ “employeeId”:1235,”sales” : 10500 },{ “employeeId”:3233,”sales” : 32000 }]’,
‘ARRAY<STRUCT<employeeId: BIGINT, sales: INT»’) as performance,
current_timestamp() as insertDate

select one:

WITH CTE as (SELECT EXPLODE (performance) FROM table_name)
SELECT SUM (performance.sales) FROM CTE

select aggregate(flatten(collect_list(performance.sales)), 0, (x, y) -> x + y)
as total_sales from sales

WITH CTE as (SELECT FLATTEN (performance) FROM table_name)
SELECT SUM (performance.sales) FROM CTE

SELECT SUM(SLICE(performance, sales)) FROM employee

select reduce(flatten(collect_list(performance.sales)), 0, (x, y) -> x + y)
as total_sales from sales

A

The answer is

select aggregate(flatten(collect_list(performance.sales)), 0, (x, y) -> x + y)
as total_sales from sales
Nested Struct can be queried using the . notation performance.sales will give you access to all the sales values in the performance column.

Note: option D is wrong because it uses performance:sales not performance.sales. “:” this is only used when referring to JSON data but here we are dealing with a struct data type. for the exam please make sure to understand if you are dealing with JSON data or Struct data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Which of the following statements can be used to test the functionality of code to test number of rows in the table equal to 10 in python?

row_count = spark.sql(“select count(*) from table”).collect()[0][0]

assert (row_count == 10, “Row count did not match”)

assert row_count == 10, “Row count did not match”

assert if ( row_count == 10, “Row count did not match”)

assert if row_count == 10, “Row count did not match”

assert row_count = 10, “Row count did not match”

A

The answer is assert row_count == 10, “Row count did not match”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How do you handle failures gracefully when writing code in Pyspark, fill in the blanks to complete the below statement

_____

Spark.read.table("table_name").select("column").write.mode("append").SaveAsTable("new_table_name")

_____

print(f"query failed")

try: failure:

try: catch:

try: except:

try: fail:

try: error:

A

The answer is try: and except:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

You are working on a process to query the table based on batch date, and batch date is an input parameter and expected to change every time the program runs, what is the best way to we can parameterize the query to run without manually changing the batch date?

Create a notebook parameter for batch date and assign the value to a python variable and use a spark data frame to filter the data based on the python variable

Create a dynamic view that can calculate the batch data automatically and use the view to query the data

There is not way we can combine python variable and spark code

Manually edit code every time to change the batch data

Store the batch data in the spark config and use a spark data frame to filter the data based on the spark config

A

Create a notebook parameter for batch date and assign the value to a python variable and use a spark data frame to filter the data based on the python variable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly