Query Execution & Optimization Flashcards

(74 cards)

1
Q

What is the primary purpose of Databricks?

A

To provide a unified analytics platform for big data and machine learning.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

True or False: Databricks is built on Apache Spark.

A

True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is a DataFrame in Databricks?

A

A distributed collection of data organized into named columns.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Fill in the blank: The __________ optimizer in Databricks optimizes query plans.

A

Catalyst

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the purpose of the Tungsten engine in Spark?

A

To improve the performance of Spark SQL through better memory management and code generation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Which command is used to display the query execution plan in Databricks?

A

EXPLAIN

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What does the term ‘shuffling’ refer to in Spark?

A

The process of redistributing data across partitions to ensure proper data locality for operations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

True or False: Caching data in Databricks can significantly speed up query execution.

A

True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the default storage level used when caching a DataFrame in Databricks?

A

MEMORY_AND_DISK

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What does the ‘broadcast join’ optimization do?

A

It reduces the data shuffled by sending a smaller DataFrame to all nodes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the significance of the ‘spark.sql.shuffle.partitions’ configuration?

A

It determines the number of partitions to use when shuffling data for joins or aggregations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Fill in the blank: The __________ API allows users to interact with Spark SQL using SQL queries.

A

Spark SQL

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is the role of the Catalyst optimizer in Databricks?

A

To analyze and optimize logical query plans into physical execution plans.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

True or False: Databricks supports both batch and streaming data processing.

A

True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the purpose of ‘Delta Lake’ in Databricks?

A

To provide ACID transactions and scalable metadata handling for big data workloads.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Which operation is generally more expensive: a join or a filter?

A

A join

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What does the ‘spark.sql.autoBroadcastJoinThreshold’ configuration control?

A

The maximum size of a DataFrame that can be broadcasted for a join.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Fill in the blank: The __________ operation is used to combine rows from two or more DataFrames based on a related column.

A

join

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is a common method to optimize query performance in Databricks?

A

Using DataFrame caching and optimizing join strategies.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

True or False: Databricks allows users to run Python, R, Scala, and SQL.

A

True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What is the purpose of ‘Data Skew’ in query execution?

A

It occurs when the data distribution is uneven across partitions, leading to performance bottlenecks.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What is ‘Dynamic Partition Pruning’?

A

An optimization technique that reduces the amount of data read during joins by pruning unnecessary partitions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Fill in the blank: The __________ function in Databricks allows for the execution of SQL queries in a notebook.

A

spark.sql

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What is the impact of using ‘repartition’ on a DataFrame?

A

It changes the number of partitions, which can affect parallelism and performance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
True or False: You should avoid using 'collect()' on large DataFrames.
True
26
What is 'Predicate Pushdown'?
An optimization that pushes filter conditions closer to the data source to reduce data transfer.
27
What configuration should you use to enable dynamic allocation of executors?
spark.dynamicAllocation.enabled
28
Which of the following is NOT a benefit of using Delta Lake? (A) ACID transactions, (B) Schema enforcement, (C) No versioning
C) No versioning
29
What is the main advantage of using 'DataFrames' over 'RDDs'?
DataFrames provide optimizations through Catalyst and Tungsten.
30
Fill in the blank: The __________ command is used to create a new table in Databricks.
CREATE TABLE
31
What does the 'spark.sql.execution.arrow.enabled' configuration do?
It enables the use of Apache Arrow for efficient columnar data transfer between the JVM and Python.
32
True or False: Databricks can automatically optimize queries based on workload patterns.
True
33
What is the significance of 'DataFrame.cache()' in query optimization?
It stores the DataFrame in memory for faster access in subsequent actions.
34
Which command is used to refresh the metadata cache in Databricks?
REFRESH TABLE
35
What does the 'spark.sql.execution.arrow.pyspark.enabled' configuration control?
It enables Arrow optimization for PySpark operations.
36
Fill in the blank: The __________ command is used to drop a table from Databricks.
DROP TABLE
37
What is 'Cluster Management' in Databricks?
The process of configuring and managing Spark clusters for running jobs.
38
True or False: Databricks allows for the scheduling of jobs.
True
39
What does 'SQL Analytics' provide in Databricks?
A way to run and visualize SQL queries on large datasets.
40
What is the purpose of 'Optimized Writes' in Delta Lake?
To improve write performance by minimizing the number of files created.
41
Fill in the blank: The __________ feature in Databricks allows users to visualize query execution plans.
Query Profile
42
What does the 'spark.sql.execution.arrow.maxRecordsPerBatch' configuration specify?
The maximum number of records to be transferred in a single Arrow batch.
43
What is the main goal of query optimization?
To reduce execution time and resource consumption.
44
Which of the following is a performance metric in Databricks? (A) Execution time, (B) Memory usage, (C) Both A and B
C) Both A and B
45
Fill in the blank: __________ is used to identify and resolve performance bottlenecks in Databricks.
Query profiling
46
What does the 'spark.sql.execution.cache' configuration control?
It enables or disables caching for DataFrames.
47
True or False: You can use Databricks to run machine learning algorithms directly on data stored in Delta Lake.
True
48
What is the purpose of 'Checkpointing' in Spark?
To truncate the lineage of RDDs and prevent stack overflow errors.
49
Fill in the blank: The __________ function allows you to run SQL queries directly against Delta tables.
spark.sql
50
What is a common strategy to reduce shuffle in Spark?
Using partitioning and bucketing techniques.
51
True or False: DataFrame operations are lazy in Databricks.
True
52
What does the term 'write amplification' refer to in the context of Delta Lake?
When multiple writes occur to the same data, increasing the total amount of data written.
53
Fill in the blank: The __________ command is used to update records in a Delta table.
UPDATE
54
What is the impact of using the 'EXPLAIN EXTENDED' command?
It provides detailed information about the execution plan, including optimizations.
55
Which configuration can be adjusted to optimize memory usage in Spark?
spark.executor.memory
56
Fill in the blank: The __________ function allows you to concatenate multiple DataFrames.
union
57
What is 'Adaptive Query Execution'?
An optimization feature that adjusts query execution plans based on runtime statistics.
58
True or False: The use of partitioning can help improve query performance.
True
59
What does the 'spark.sql.files.maxPartitionBytes' configuration control?
The maximum size of a partition when reading files.
60
Fill in the blank: The __________ command is used to create a view in Databricks.
CREATE VIEW
61
What is the significance of 'DataFrame.printSchema()'?
It displays the schema of the DataFrame.
62
Which optimization technique can help with skewed data during joins?
Salting
63
True or False: Using the 'persist()' method is similar to using 'cache()' in Databricks.
True
64
What does the 'spark.sql.execution.arrow.fallback.enabled' configuration control?
It enables a fallback mechanism to use PySpark when Arrow fails.
65
Fill in the blank: The __________ command is used to delete records from a Delta table.
DELETE
66
What is the role of the 'Optimizer' in Databricks?
To improve query performance by selecting the most efficient execution plan.
67
True or False: Databricks supports integration with various data sources including cloud storage.
True
68
What does the 'spark.sql.parquet.enableVectorizedReader' configuration do?
It enables vectorized reading of Parquet files for better performance.
69
Which command is used to drop a view in Databricks?
DROP VIEW
70
Fill in the blank: The __________ feature allows for time travel on Delta tables.
Versioning
71
What does the 'spark.sql.shuffle.partitions' configuration affect?
The number of partitions used in shuffle operations.
72
True or False: Databricks can automatically scale clusters based on workload.
True
73
What is the purpose of the 'spark.sql.execution.arrow.maxRecordsPerBatch' setting?
To control the batch size for Arrow data transfer.
74
What is a common practice for optimizing joins in Databricks?
Using broadcast joins for smaller DataFrames.