Indexes & Access Paths Flashcards

Question

What is the purpose of the 'VACUUM' command in Delta Lake?

Answer 1

To remove old files that are no longer needed, freeing up storage space.

Answer 2

streaming ingestion

Answer 3

To optimize data skipping and improve query performance.

Answer 4

Delta Lake provides ACID transactions and data reliability.

Answer 5

spark.read.format('delta').option('versionAsOf', version_number).load('path/to/delta/table')

Answer 6

optimistic concurrency control

Answer 7

To modify the schema or properties of an existing Delta table.

Answer 8

The ability to skip reading unnecessary files during query execution.

Answer 9

It compacts small files into larger ones to improve read performance.

Answer 10

DROP TABLE

Answer 11

It prevents the insertion of data that does not conform to the defined schema.

Answer 12

To improve query performance by dividing data into manageable chunks.

Answer 13

There is no hard limit, but performance can degrade with very large files.

Answer 14

To combine data from different sources into a single dataset.

Answer 15

optimistic concurrency control

Answer 16

To copy data from a source location into a table.

Answer 17

It provides ACID transactions and ensures data reliability.

Answer 18

Using partitioning and bucketing.

Answer 19

spark.read

Answer 20

To provide an interactive environment for data engineering and data science.

Answer 21

CREATE TABLE

Answer 22

columnar storage

Answer 23

They ensure data integrity and consistency during concurrent operations.

Answer 24

To track the origin and transformations of data.

Answer 25

Tables that are divided into fixed-size buckets for performance optimization.

Answer 26

data partitioning

Answer 27

It allows for conditional updates and inserts based on existing records.

Answer 28

To provide insights into the execution plan of a query.

Answer 29

ALTER TABLE RENAME TO

Answer 30

To merge small files into larger files to improve read performance.

Answer 31

To show the history of changes made to a Delta table.

Answer 32

Delta format

Answer 33

data skipping

Answer 34

To optimize the physical layout of data files for better performance.

Answer 35

The management of data availability, usability, integrity, and security.

Answer 36

They enable real-time data processing and analytics.

Answer 37

concurrent writes

Answer 38

It provides reliable and consistent data ingestion with ACID compliance.

Answer 39

CREATE VIEW

Answer 40

Scalability and flexibility in managing big data workloads.

Answer 41

To store historical data for compliance and analytics.

Answer 42

Analyzing data to understand its structure, quality, and patterns.

Answer 43

To list all tables available in the current database.

Answer 44

data partitioning

Answer 45

They provide an interactive workspace for coding, visualization, and collaboration.

Answer 46

INSERT INTO

Answer 47

To create a new table based on the results of a SELECT query.

Answer 48

optimistic concurrency control

Answer 49

To delete a table and its data from the database.

Answer 50

Databricks is primarily used for big data processing and machine learning.

Answer 51

Clustered and non-clustered indexes.

Answer 52

A clustered index determines the physical order of data in a table.

Answer 53

A non-clustered index is a separate structure that points to the location of the data.

Answer 54

Index scan

Answer 55

A primary key index ensures that each value in the column is unique and not null.

Answer 56

CREATE INDEX

Answer 57

To improve query performance by reorganizing data.

Answer 58

Data skipping refers to the ability to avoid reading unnecessary data based on metadata.

Answer 59

Delta Lake provides ACID transactions and scalable metadata handling.

Answer 60

A bloom filter is a space-efficient probabilistic data structure used to test whether an element is a member of a set.

Answer 61

A primary index is created on the primary key, while a secondary index is created on non-primary key columns.

Answer 62

By creating appropriate indexes and using data partitioning.

Answer 63

To show the metadata of a table including its indexes.

Answer 64

DROP INDEX

Answer 65

It can lead to slower data modification operations due to the overhead of maintaining the indexes.

Answer 66

The 'WHERE' clause helps to filter data, making index usage more efficient.

Answer 67

Partitioning refers to dividing a table into smaller, more manageable pieces based on column values.

Answer 68

Queries that involve searching for specific records or filtering data.

Answer 69

To store the result of a query in memory for faster access.

Answer 70

A composite index is an index on multiple columns of a table.

Answer 71

Indexes require additional storage space and can slow down data modification operations.

Answer 72

Z-Ordering is a technique to optimize the layout of data in storage for better performance.

Answer 73

By analyzing query performance and execution plans.

Answer 74

Index fragmentation refers to the condition where the logical ordering of the index does not match the physical ordering.

Answer 75

Materialized views are database objects that contain the results of a query and can improve performance.

Answer 76

To rebuild the indexes on a table to improve performance.

Answer 77

The DataFrame API allows for data manipulation and querying in a distributed environment.

Answer 78

Broadcast joins send a smaller dataset to all nodes, while shuffle joins redistribute data across partitions.

Answer 79

A view is a virtual table based on the result of a SELECT query.

Answer 80

A data lake is a centralized repository that allows you to store all your structured and unstructured data at scale.

Answer 81

To compact small files into larger ones for better performance.

Answer 82

The VACUUM command removes old files that are no longer needed in Delta Lake.

Answer 83

It controls the number of partitions to use when shuffling data for joins or aggregations.

Answer 84

To ensure data integrity and quick lookups on the primary key.

Answer 85

ANALYZE TABLE

Answer 86

Logical plans represent the operations to be performed, while physical plans describe how those operations will be executed.

Answer 87

It specifies the maximum size of a table that can be broadcast to all worker nodes.

Answer 88

To enable Apache Arrow for faster data transfer between the JVM and Python.

Answer 89

A join is an operation that combines rows from two or more tables based on a related column.

Answer 90

A subquery is a query nested inside another query.

Answer 91

INNER JOIN returns only matching rows, while OUTER JOIN returns all rows from one table and matching rows from the other.

Answer 92

To aggregate data based on one or more columns.

Answer 93

The HAVING clause filters records after aggregation.

Answer 94

Data locality refers to processing data close to where it is stored to reduce latency.

Answer 95

A window function performs calculations across a set of table rows that are related to the current row.

Answer 96

The LIMIT clause restricts the number of rows returned by a query.

Answer 97

An execution plan outlines how a SQL query will be executed by the database engine.

Answer 98

Metadata provides information about the data structure, types, and relationships.

Answer 99

A temporary view is a view that exists only for the duration of the session.

Answer 100

The UNION operator combines the results of two or more SELECT statements.

Answer 101

To return only unique values in a query result.

Answer 102

A foreign key is a column or group of columns in a table that uniquely identifies a row in another table.

Answer 103

A view is a virtual table based on a query, while a table is a physical storage of data.

Answer 104

A cross join returns the Cartesian product of two tables.

Answer 105

Normalization is the process of organizing data to minimize redundancy.

Answer 106

A schema is the structure that defines the organization of data in a database.

Answer 107

To modify an existing table structure.

Answer 108

A data warehouse is a centralized repository for storing and analyzing large amounts of structured data.

Indexes & Access Paths Flashcards

(166 cards)