Query Execution & Optimization Flashcards by Ozzy Campos

What is the primary purpose of Databricks?

To provide a unified analytics platform for big data and machine learning.

How well did you know this?

Not at all

Perfectly

True or False: Databricks is built on Apache Spark.

True

How well did you know this?

Not at all

Perfectly

What is a DataFrame in Databricks?

A distributed collection of data organized into named columns.

How well did you know this?

Not at all

Perfectly

Fill in the blank: The __________ optimizer in Databricks optimizes query plans.

Catalyst

How well did you know this?

Not at all

Perfectly

What is the purpose of the Tungsten engine in Spark?

To improve the performance of Spark SQL through better memory management and code generation.

How well did you know this?

Not at all

Perfectly

Which command is used to display the query execution plan in Databricks?

EXPLAIN

How well did you know this?

Not at all

Perfectly

What does the term ‘shuffling’ refer to in Spark?

The process of redistributing data across partitions to ensure proper data locality for operations.

How well did you know this?

Not at all

Perfectly

True or False: Caching data in Databricks can significantly speed up query execution.

True

How well did you know this?

Not at all

Perfectly

What is the default storage level used when caching a DataFrame in Databricks?

MEMORY_AND_DISK

How well did you know this?

Not at all

Perfectly

What does the ‘broadcast join’ optimization do?

It reduces the data shuffled by sending a smaller DataFrame to all nodes.

How well did you know this?

Not at all

Perfectly

What is the significance of the ‘spark.sql.shuffle.partitions’ configuration?

It determines the number of partitions to use when shuffling data for joins or aggregations.

How well did you know this?

Not at all

Perfectly

Fill in the blank: The __________ API allows users to interact with Spark SQL using SQL queries.

Spark SQL

How well did you know this?

Not at all

Perfectly

What is the role of the Catalyst optimizer in Databricks?

To analyze and optimize logical query plans into physical execution plans.

How well did you know this?

Not at all

Perfectly

True or False: Databricks supports both batch and streaming data processing.

True

How well did you know this?

Not at all

Perfectly

What is the purpose of ‘Delta Lake’ in Databricks?

To provide ACID transactions and scalable metadata handling for big data workloads.

How well did you know this?

Not at all

Perfectly

Which operation is generally more expensive: a join or a filter?

A join

How well did you know this?

Not at all

Perfectly

What does the ‘spark.sql.autoBroadcastJoinThreshold’ configuration control?

The maximum size of a DataFrame that can be broadcasted for a join.

How well did you know this?

Not at all

Perfectly

Fill in the blank: The __________ operation is used to combine rows from two or more DataFrames based on a related column.

join

How well did you know this?

Not at all

Perfectly

What is a common method to optimize query performance in Databricks?

Using DataFrame caching and optimizing join strategies.

How well did you know this?

Not at all

Perfectly

True or False: Databricks allows users to run Python, R, Scala, and SQL.

True

How well did you know this?

Not at all

Perfectly

What is the purpose of ‘Data Skew’ in query execution?

It occurs when the data distribution is uneven across partitions, leading to performance bottlenecks.

How well did you know this?

Not at all

Perfectly

What is ‘Dynamic Partition Pruning’?

An optimization technique that reduces the amount of data read during joins by pruning unnecessary partitions.

How well did you know this?

Not at all

Perfectly

Fill in the blank: The __________ function in Databricks allows for the execution of SQL queries in a notebook.

spark.sql

How well did you know this?

Not at all

Perfectly

What is the impact of using ‘repartition’ on a DataFrame?

It changes the number of partitions, which can affect parallelism and performance.

How well did you know this?

Not at all

Perfectly

True or False: You should avoid using 'collect()' on large DataFrames.

True

What is 'Predicate Pushdown'?

An optimization that pushes filter conditions closer to the data source to reduce data transfer.

What configuration should you use to enable dynamic allocation of executors?

spark.dynamicAllocation.enabled

Which of the following is NOT a benefit of using Delta Lake? (A) ACID transactions, (B) Schema enforcement, (C) No versioning

C) No versioning

What is the main advantage of using 'DataFrames' over 'RDDs'?

DataFrames provide optimizations through Catalyst and Tungsten.

Fill in the blank: The __________ command is used to create a new table in Databricks.

CREATE TABLE

What does the 'spark.sql.execution.arrow.enabled' configuration do?

It enables the use of Apache Arrow for efficient columnar data transfer between the JVM and Python.

True or False: Databricks can automatically optimize queries based on workload patterns.

True

What is the significance of 'DataFrame.cache()' in query optimization?

It stores the DataFrame in memory for faster access in subsequent actions.

Which command is used to refresh the metadata cache in Databricks?

REFRESH TABLE

What does the 'spark.sql.execution.arrow.pyspark.enabled' configuration control?

It enables Arrow optimization for PySpark operations.

Fill in the blank: The __________ command is used to drop a table from Databricks.

DROP TABLE

What is 'Cluster Management' in Databricks?

The process of configuring and managing Spark clusters for running jobs.

True or False: Databricks allows for the scheduling of jobs.

True

What does 'SQL Analytics' provide in Databricks?

A way to run and visualize SQL queries on large datasets.

What is the purpose of 'Optimized Writes' in Delta Lake?

To improve write performance by minimizing the number of files created.

Fill in the blank: The __________ feature in Databricks allows users to visualize query execution plans.

Query Profile

What does the 'spark.sql.execution.arrow.maxRecordsPerBatch' configuration specify?

The maximum number of records to be transferred in a single Arrow batch.

What is the main goal of query optimization?

To reduce execution time and resource consumption.

Which of the following is a performance metric in Databricks? (A) Execution time, (B) Memory usage, (C) Both A and B

C) Both A and B

Fill in the blank: __________ is used to identify and resolve performance bottlenecks in Databricks.

Query profiling

What does the 'spark.sql.execution.cache' configuration control?

It enables or disables caching for DataFrames.

True or False: You can use Databricks to run machine learning algorithms directly on data stored in Delta Lake.

True

What is the purpose of 'Checkpointing' in Spark?

To truncate the lineage of RDDs and prevent stack overflow errors.

Fill in the blank: The __________ function allows you to run SQL queries directly against Delta tables.

spark.sql

What is a common strategy to reduce shuffle in Spark?

Using partitioning and bucketing techniques.

True or False: DataFrame operations are lazy in Databricks.

True

What does the term 'write amplification' refer to in the context of Delta Lake?

When multiple writes occur to the same data, increasing the total amount of data written.

Fill in the blank: The __________ command is used to update records in a Delta table.

UPDATE

What is the impact of using the 'EXPLAIN EXTENDED' command?

It provides detailed information about the execution plan, including optimizations.

Which configuration can be adjusted to optimize memory usage in Spark?

spark.executor.memory

Fill in the blank: The __________ function allows you to concatenate multiple DataFrames.

union

What is 'Adaptive Query Execution'?

An optimization feature that adjusts query execution plans based on runtime statistics.

True or False: The use of partitioning can help improve query performance.

True

What does the 'spark.sql.files.maxPartitionBytes' configuration control?

The maximum size of a partition when reading files.

Fill in the blank: The __________ command is used to create a view in Databricks.

CREATE VIEW

What is the significance of 'DataFrame.printSchema()'?

It displays the schema of the DataFrame.

Which optimization technique can help with skewed data during joins?

Salting

True or False: Using the 'persist()' method is similar to using 'cache()' in Databricks.

True

What does the 'spark.sql.execution.arrow.fallback.enabled' configuration control?

It enables a fallback mechanism to use PySpark when Arrow fails.

Fill in the blank: The __________ command is used to delete records from a Delta table.

DELETE

What is the role of the 'Optimizer' in Databricks?

To improve query performance by selecting the most efficient execution plan.

True or False: Databricks supports integration with various data sources including cloud storage.

True

What does the 'spark.sql.parquet.enableVectorizedReader' configuration do?

It enables vectorized reading of Parquet files for better performance.

Which command is used to drop a view in Databricks?

DROP VIEW

Fill in the blank: The __________ feature allows for time travel on Delta tables.

Versioning

What does the 'spark.sql.shuffle.partitions' configuration affect?

The number of partitions used in shuffle operations.

True or False: Databricks can automatically scale clusters based on workload.

True

What is the purpose of the 'spark.sql.execution.arrow.maxRecordsPerBatch' setting?

To control the batch size for Arrow data transfer.

What is a common practice for optimizing joins in Databricks?

Using broadcast joins for smaller DataFrames.

Query Execution & Optimization Flashcards

(74 cards)