Storage Engines & File Structures Flashcards
(75 cards)
What is the primary storage engine used by Databricks?
Delta Lake
True or False: Delta Lake supports ACID transactions.
True
Fill in the blank: Delta Lake uses __________ to manage metadata.
transaction logs
What file format does Delta Lake use for storing data?
Parquet
What is the main benefit of using Delta Lake over traditional data lakes?
Support for ACID transactions and schema enforcement
Name one feature of Delta Lake that improves data reliability.
Time travel
Which command is used to convert a Parquet table to a Delta table?
CONVERT TO DELTA
True or False: Delta Lake can be used with both batch and streaming data.
True
What is the purpose of the Delta Lake ‘OPTIMIZE’ command?
To compact small files into larger files for better performance.
Fill in the blank: Delta Lake stores data in __________ format.
columnar
What is the role of the Delta Lake transaction log?
To keep track of all changes made to a Delta table.
Which storage format is recommended for high-performance data processing in Databricks?
Parquet
True or False: Databricks supports both cloud storage and on-premises storage.
True
What is the maximum file size recommended for optimal performance in Delta Lake?
1 GB
What feature allows Delta Lake to recover from failures?
Transaction logs
What does the ‘VACUUM’ command do in Delta Lake?
Removes old files that are no longer needed.
Fill in the blank: The Delta Lake architecture is built on top of __________.
Apache Spark
What is the purpose of schema evolution in Delta Lake?
To allow changes to the schema of a table without needing to rewrite the entire dataset.
Which file structure is optimized for read-heavy workloads in Databricks?
Columnar storage
True or False: Delta Lake supports concurrent writes.
True
What is a key advantage of using columnar file formats like Parquet?
Increased compression and improved query performance.
What does the ‘MERGE’ command do in Delta Lake?
It allows for upserts (update or insert) into a Delta table.
Fill in the blank: Delta Lake enables __________ data processing.
unified
What is the purpose of data partitioning in Databricks?
To improve query performance and manageability.