Query Execution & Optimization Flashcards
(74 cards)
What is the primary purpose of Databricks?
To provide a unified analytics platform for big data and machine learning.
True or False: Databricks is built on Apache Spark.
True
What is a DataFrame in Databricks?
A distributed collection of data organized into named columns.
Fill in the blank: The __________ optimizer in Databricks optimizes query plans.
Catalyst
What is the purpose of the Tungsten engine in Spark?
To improve the performance of Spark SQL through better memory management and code generation.
Which command is used to display the query execution plan in Databricks?
EXPLAIN
What does the term ‘shuffling’ refer to in Spark?
The process of redistributing data across partitions to ensure proper data locality for operations.
True or False: Caching data in Databricks can significantly speed up query execution.
True
What is the default storage level used when caching a DataFrame in Databricks?
MEMORY_AND_DISK
What does the ‘broadcast join’ optimization do?
It reduces the data shuffled by sending a smaller DataFrame to all nodes.
What is the significance of the ‘spark.sql.shuffle.partitions’ configuration?
It determines the number of partitions to use when shuffling data for joins or aggregations.
Fill in the blank: The __________ API allows users to interact with Spark SQL using SQL queries.
Spark SQL
What is the role of the Catalyst optimizer in Databricks?
To analyze and optimize logical query plans into physical execution plans.
True or False: Databricks supports both batch and streaming data processing.
True
What is the purpose of ‘Delta Lake’ in Databricks?
To provide ACID transactions and scalable metadata handling for big data workloads.
Which operation is generally more expensive: a join or a filter?
A join
What does the ‘spark.sql.autoBroadcastJoinThreshold’ configuration control?
The maximum size of a DataFrame that can be broadcasted for a join.
Fill in the blank: The __________ operation is used to combine rows from two or more DataFrames based on a related column.
join
What is a common method to optimize query performance in Databricks?
Using DataFrame caching and optimizing join strategies.
True or False: Databricks allows users to run Python, R, Scala, and SQL.
True
What is the purpose of ‘Data Skew’ in query execution?
It occurs when the data distribution is uneven across partitions, leading to performance bottlenecks.
What is ‘Dynamic Partition Pruning’?
An optimization technique that reduces the amount of data read during joins by pruning unnecessary partitions.
Fill in the blank: The __________ function in Databricks allows for the execution of SQL queries in a notebook.
spark.sql
What is the impact of using ‘repartition’ on a DataFrame?
It changes the number of partitions, which can affect parallelism and performance.