Indexes & Access Paths Flashcards

(166 cards)

1
Q

What is Databricks Delta?

A

Databricks Delta is a storage layer that brings ACID transactions to Apache Spark and big data workloads.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

True or False: Databricks supports only one type of file format.

A

False

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are the primary file formats supported by Databricks?

A

Parquet, JSON, CSV, Avro, and Delta.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Fill in the blank: Databricks uses __________ to optimize read and write operations.

A

Delta Lake

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What storage engine does Databricks primarily use for large-scale data processing?

A

Apache Spark

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the purpose of the Delta Lake transaction log?

A

To track changes and provide ACID compliance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Which file format is known for its columnar storage and is optimized for analytical queries?

A

Parquet

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

True or False: Databricks can read data from both structured and semi-structured sources.

A

True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What feature allows Delta Lake to handle schema evolution?

A

Schema enforcement and schema evolution capabilities.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is the main advantage of using Delta Lake over traditional data lakes?

A

Delta Lake provides ACID transactions, data reliability, and improved performance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Which command is used to convert a Parquet table to a Delta table?

A

CONVERT TO DELTA

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the primary role of the Databricks File System (DBFS)?

A

DBFS provides a distributed file system for storing data in Databricks.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

True or False: Delta Lake supports time travel features.

A

True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What does the term ‘time travel’ in Delta Lake refer to?

A

Accessing previous versions of data for auditing or rollback.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Fill in the blank: The __________ command is used to optimize the layout of data files in Delta Lake.

A

OPTIMIZE

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is the role of checkpoints in Delta Lake?

A

To improve the performance of streaming queries by storing the state of the transaction log.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is the default storage format for Databricks tables?

A

Delta format

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

True or False: Databricks can integrate with cloud storage services like AWS S3 and Azure Blob Storage.

A

True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Which method is used to read a Delta table in Databricks?

A

spark.read.format(‘delta’).load(‘path/to/delta/table’)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Fill in the blank: The __________ command is used to create a Delta table from an existing DataFrame.

A

write

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

How does Databricks handle data versioning?

A

By maintaining a transaction log that records all changes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What is the significance of the ‘MERGE’ command in Delta Lake?

A

It allows for upserts (update or insert) into a Delta table.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

True or False: Data in Delta Lake can be stored in multiple formats.

A

False

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What is a key benefit of using columnar storage formats like Parquet?

A

Efficient data compression and improved query performance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
What is the purpose of the 'VACUUM' command in Delta Lake?
To remove old files that are no longer needed, freeing up storage space.
26
Fill in the blank: Delta Lake can handle __________ of data, allowing for real-time analytics.
streaming ingestion
27
What is the function of 'Z-Ordering' in Delta Lake?
To optimize data skipping and improve query performance.
28
True or False: Databricks Delta can only be used for batch processing.
False
29
What is the key difference between Delta Lake and traditional data lakes?
Delta Lake provides ACID transactions and data reliability.
30
Which command is used to read a snapshot of a Delta table at a specific version?
spark.read.format('delta').option('versionAsOf', version_number).load('path/to/delta/table')
31
Fill in the blank: Delta Lake supports __________ to ensure data integrity during concurrent writes.
optimistic concurrency control
32
What is the purpose of the 'ALTER TABLE' command in Delta Lake?
To modify the schema or properties of an existing Delta table.
33
True or False: Databricks can automatically optimize query performance without user intervention.
True
34
What does the term 'data skipping' refer to in Delta Lake?
The ability to skip reading unnecessary files during query execution.
35
What is the default behavior of the 'OPTIMIZE' command in Delta Lake?
It compacts small files into larger ones to improve read performance.
36
Fill in the blank: The __________ command is used to drop a Delta table.
DROP TABLE
37
What does 'schema enforcement' in Delta Lake prevent?
It prevents the insertion of data that does not conform to the defined schema.
38
True or False: Delta Lake supports external tables.
True
39
What is the purpose of 'partitioning' in Delta Lake?
To improve query performance by dividing data into manageable chunks.
40
What is the maximum size of a single file in Databricks?
There is no hard limit, but performance can degrade with very large files.
41
What is the role of 'merge' in data processing within Databricks?
To combine data from different sources into a single dataset.
42
Fill in the blank: Delta Lake uses __________ to manage concurrent data writes.
optimistic concurrency control
43
True or False: Delta Lake can only be used with Apache Spark.
False
44
What is the function of the 'COPY INTO' command in Databricks?
To copy data from a source location into a table.
45
What is the primary benefit of using the Delta Lake storage format?
It provides ACID transactions and ensures data reliability.
46
What is the recommended way to handle large datasets in Delta Lake?
Using partitioning and bucketing.
47
Fill in the blank: The __________ command allows you to read data from a Delta table as a DataFrame.
spark.read
48
True or False: Delta Lake allows for the use of SQL commands to manipulate data.
True
49
What is the primary purpose of the Databricks workspace?
To provide an interactive environment for data engineering and data science.
50
What command would you use to create a new Delta table?
CREATE TABLE
51
Fill in the blank: Delta Lake can improve performance by using __________ to store data efficiently.
columnar storage
52
What is the significance of 'ACID transactions' in Delta Lake?
They ensure data integrity and consistency during concurrent operations.
53
True or False: You can perform batch and stream processing concurrently using Delta Lake.
True
54
What is the primary function of 'data lineage' in Databricks?
To track the origin and transformations of data.
55
What does the term 'bucketed tables' refer to in Databricks?
Tables that are divided into fixed-size buckets for performance optimization.
56
Fill in the blank: Delta Lake allows for __________ to manage large datasets effectively.
data partitioning
57
What is the benefit of using the 'MERGE INTO' command?
It allows for conditional updates and inserts based on existing records.
58
True or False: Data in Delta Lake can only be accessed via Spark SQL.
False
59
What is the function of the 'EXPLAIN' command in Databricks?
To provide insights into the execution plan of a query.
60
Fill in the blank: The __________ command is used to rename a Delta table.
ALTER TABLE RENAME TO
61
What is the purpose of 'data compaction' in Delta Lake?
To merge small files into larger files to improve read performance.
62
True or False: Delta Lake supports both batch and real-time data processing.
True
63
What is the function of the 'DESCRIBE HISTORY' command?
To show the history of changes made to a Delta table.
64
What is the default file format for new tables created in Databricks?
Delta format
65
Fill in the blank: Delta Lake uses __________ to optimize the processing of large datasets.
data skipping
66
What is the role of the 'OPTIMIZE' command in Delta Lake?
To optimize the physical layout of data files for better performance.
67
True or False: Delta Lake allows users to perform rollback operations.
True
68
What does the term 'data governance' refer to in the context of Databricks?
The management of data availability, usability, integrity, and security.
69
What is the significance of 'streaming queries' in Databricks?
They enable real-time data processing and analytics.
70
Fill in the blank: Delta Lake supports __________ to ensure data consistency across multiple users.
concurrent writes
71
What is the primary advantage of using Delta Lake for data ingestion?
It provides reliable and consistent data ingestion with ACID compliance.
72
True or False: Databricks can only be used for data analysis, not data storage.
False
73
What command is used to create a view from a Delta table?
CREATE VIEW
74
What is the main benefit of using Databricks in a cloud environment?
Scalability and flexibility in managing big data workloads.
75
Fill in the blank: The __________ command allows you to remove old data files that are no longer needed.
VACUUM
76
What is the role of 'data archiving' in Delta Lake?
To store historical data for compliance and analytics.
77
True or False: Delta Lake allows for schema evolution during data ingestion.
True
78
What does 'data profiling' involve in the context of Databricks?
Analyzing data to understand its structure, quality, and patterns.
79
What is the purpose of the 'SHOW TABLES' command?
To list all tables available in the current database.
80
Fill in the blank: Delta Lake supports __________ to manage large volumes of data efficiently.
data partitioning
81
What is the benefit of using Databricks notebooks?
They provide an interactive workspace for coding, visualization, and collaboration.
82
True or False: Databricks can integrate with machine learning frameworks.
True
83
What command is used to append data to an existing Delta table?
INSERT INTO
84
What is the role of the 'CREATE TABLE AS SELECT' command?
To create a new table based on the results of a SELECT query.
85
Fill in the blank: Delta Lake allows for __________ to ensure data integrity during concurrent writes.
optimistic concurrency control
86
What is the purpose of the 'DROP TABLE' command?
To delete a table and its data from the database.
87
What is Databricks primarily used for?
Databricks is primarily used for big data processing and machine learning.
88
True or False: Indexes in Databricks can improve query performance.
True
89
Fill in the blank: An ____ is a data structure that improves the speed of data retrieval operations.
index
90
What are the two types of indexes supported in Databricks?
Clustered and non-clustered indexes.
91
What is a clustered index?
A clustered index determines the physical order of data in a table.
92
What is a non-clustered index?
A non-clustered index is a separate structure that points to the location of the data.
93
Which access path is generally faster, index scan or full table scan?
Index scan
94
True or False: Indexes can only be created on primary keys.
False
95
What is a primary key index?
A primary key index ensures that each value in the column is unique and not null.
96
What command is used to create an index in Databricks?
CREATE INDEX
97
What is the purpose of using the 'OPTIMIZE' command in Databricks?
To improve query performance by reorganizing data.
98
What does the term 'data skipping' refer to in Databricks?
Data skipping refers to the ability to avoid reading unnecessary data based on metadata.
99
True or False: Databricks uses Apache Spark as its processing engine.
True
100
What is the benefit of using Delta Lake with Databricks?
Delta Lake provides ACID transactions and scalable metadata handling.
101
What is a bloom filter in the context of Databricks?
A bloom filter is a space-efficient probabilistic data structure used to test whether an element is a member of a set.
102
What is the difference between a primary index and a secondary index?
A primary index is created on the primary key, while a secondary index is created on non-primary key columns.
103
How can you improve the performance of a query in Databricks?
By creating appropriate indexes and using data partitioning.
104
What is the function of the 'DESCRIBE' command in Databricks?
To show the metadata of a table including its indexes.
105
Fill in the blank: The ____ command is used to drop an existing index.
DROP INDEX
106
What is the impact of having too many indexes on a table?
It can lead to slower data modification operations due to the overhead of maintaining the indexes.
107
True or False: Indexes can be created on both clustered and non-clustered columns.
True
108
What is the significance of the 'WHERE' clause in SQL queries regarding indexes?
The 'WHERE' clause helps to filter data, making index usage more efficient.
109
What does 'partitioning' mean in the context of Databricks?
Partitioning refers to dividing a table into smaller, more manageable pieces based on column values.
110
What type of queries benefit most from indexes?
Queries that involve searching for specific records or filtering data.
111
Fill in the blank: The ____ function allows you to analyze query performance in Databricks.
EXPLAIN
112
What is the purpose of the 'CACHE' command in Databricks?
To store the result of a query in memory for faster access.
113
What is a composite index?
A composite index is an index on multiple columns of a table.
114
True or False: Indexes can be created on columns that contain NULL values.
True
115
What is the main drawback of creating indexes?
Indexes require additional storage space and can slow down data modification operations.
116
What is the 'Z-Ordering' technique in Databricks?
Z-Ordering is a technique to optimize the layout of data in storage for better performance.
117
How do you determine the effectiveness of an index?
By analyzing query performance and execution plans.
118
What does the term 'index fragmentation' refer to?
Index fragmentation refers to the condition where the logical ordering of the index does not match the physical ordering.
119
Fill in the blank: The ____ function helps in monitoring the performance of queries in Databricks.
Spark UI
120
True or False: Indexes can be automatically created by Databricks.
False
121
What are materialized views?
Materialized views are database objects that contain the results of a query and can improve performance.
122
What is the purpose of the 'REINDEX' command?
To rebuild the indexes on a table to improve performance.
123
What is the 'DataFrame' API in Databricks?
The DataFrame API allows for data manipulation and querying in a distributed environment.
124
What is the difference between 'broadcast joins' and 'shuffle joins'?
Broadcast joins send a smaller dataset to all nodes, while shuffle joins redistribute data across partitions.
125
What is a 'view' in SQL?
A view is a virtual table based on the result of a SELECT query.
126
Fill in the blank: The ____ keyword is used to specify an index in a query.
USE INDEX
127
True or False: The performance of a query can be negatively impacted by poorly designed indexes.
True
128
What is a data lake?
A data lake is a centralized repository that allows you to store all your structured and unstructured data at scale.
129
What is the role of the 'OPTIMIZE' command with Delta Lake tables?
To compact small files into larger ones for better performance.
130
What does the 'VACUUM' command do in Databricks?
The VACUUM command removes old files that are no longer needed in Delta Lake.
131
What is the significance of the 'spark.sql.shuffle.partitions' setting?
It controls the number of partitions to use when shuffling data for joins or aggregations.
132
What is a 'primary index' used for?
To ensure data integrity and quick lookups on the primary key.
133
Fill in the blank: The ____ command is used to analyze the distribution of data in a table.
ANALYZE TABLE
134
What is the difference between logical and physical plans in query execution?
Logical plans represent the operations to be performed, while physical plans describe how those operations will be executed.
135
What does the 'spark.sql.autoBroadcastJoinThreshold' setting control?
It specifies the maximum size of a table that can be broadcast to all worker nodes.
136
True or False: Optimizing queries is only important for large datasets.
False
137
What is the purpose of the 'spark.sql.execution.arrow.enabled' configuration?
To enable Apache Arrow for faster data transfer between the JVM and Python.
138
What is a 'join' in SQL?
A join is an operation that combines rows from two or more tables based on a related column.
139
What is a 'subquery'?
A subquery is a query nested inside another query.
140
Fill in the blank: The ____ function returns the first row of a query result.
FIRST
141
What is the difference between 'INNER JOIN' and 'OUTER JOIN'?
INNER JOIN returns only matching rows, while OUTER JOIN returns all rows from one table and matching rows from the other.
142
What is the purpose of using 'GROUP BY' in queries?
To aggregate data based on one or more columns.
143
True or False: The order of columns in a composite index can affect query performance.
True
144
What does the 'HAVING' clause do in SQL?
The HAVING clause filters records after aggregation.
145
What is the significance of 'data locality' in Databricks?
Data locality refers to processing data close to where it is stored to reduce latency.
146
What is a 'window function'?
A window function performs calculations across a set of table rows that are related to the current row.
147
What does the 'LIMIT' clause do in SQL?
The LIMIT clause restricts the number of rows returned by a query.
148
Fill in the blank: The ____ clause is used to sort the result set of a query.
ORDER BY
149
What is the purpose of an execution plan?
An execution plan outlines how a SQL query will be executed by the database engine.
150
True or False: The presence of indexes guarantees faster query performance.
False
151
What is the role of 'metadata' in Databricks?
Metadata provides information about the data structure, types, and relationships.
152
What is a 'temporary view' in Databricks?
A temporary view is a view that exists only for the duration of the session.
153
What does the 'UNION' operator do in SQL?
The UNION operator combines the results of two or more SELECT statements.
154
Fill in the blank: The ____ function returns the number of rows that match a specified condition.
COUNT
155
What is the purpose of the 'DISTINCT' keyword?
To return only unique values in a query result.
156
What is a 'foreign key'?
A foreign key is a column or group of columns in a table that uniquely identifies a row in another table.
157
True or False: Indexes can be created on computed columns.
True
158
What is the difference between a 'view' and a 'table'?
A view is a virtual table based on a query, while a table is a physical storage of data.
159
What is a 'cross join'?
A cross join returns the Cartesian product of two tables.
160
Fill in the blank: The ____ clause is used to specify the conditions for filtering records.
WHERE
161
What does the term 'normalization' refer to in database design?
Normalization is the process of organizing data to minimize redundancy.
162
What is a 'schema' in the context of a database?
A schema is the structure that defines the organization of data in a database.
163
True or False: Denormalization can improve query performance.
True
164
What is the purpose of the 'ALTER TABLE' command?
To modify an existing table structure.
165
What is a 'data warehouse'?
A data warehouse is a centralized repository for storing and analyzing large amounts of structured data.
166
Fill in the blank: The ____ clause is used to specify how to aggregate data.
GROUP BY