Foundations of Database Systems Flashcards

(46 cards)

1
Q

What is Databricks?

A

Databricks is a unified analytics platform that provides a cloud-based environment for data engineering, data science, and machine learning.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is a Delta Lake in Databricks?

A

Delta Lake is an open-source storage layer in Databricks that brings ACID transaction support to Apache Spark and big data workloads. It enables reliable data lakes by supporting features like schema enforcement, time travel, and scalable metadata handling. Delta Lake ensures consistency and performance across streaming and batch jobs, making it ideal for modern data engineering workflows.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Which programming languages are natively supported by Databricks?

A

Python, R, Scala, and SQL.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the primary purpose of Databricks Runtime?

A

The Databricks Runtime is a Databricks-specific, optimized runtime environment built on top of Apache Spark. It includes performance improvements, integrated libraries (like Delta Lake, MLflow, Photon), and optimized connectors.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the role of a cluster in Databricks?

A

A cluster in Databricks is a set of computation resources that run jobs and notebooks.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Which storage options can be used with Databricks?

A

Azure Blob Storage, AWS S3, and ADLS Gen2.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the purpose of the Databricks Jobs feature?

A

The Databricks Jobs feature is used to automate the execution of notebooks, Python scripts, JARs, or workflows on a scheduled or triggered basis. It enables teams to build and manage production-grade data pipelines and ML workflows directly within the platform. Jobs can be monitored, retried on failure, and configured with dependencies to support complex workflows.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is Databricks SQL, and how is it used?

A

Databricks SQL is a workspace environment for running SQL queries on data stored in the Lakehouse. It allows analysts and engineers to explore data, create dashboards, and build visualizations using familiar SQL syntax. It supports Delta Lake tables, integrates with BI tools, and provides a fully managed, scalable SQL engine optimized for performance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the significance of the spark.sql.shuffle.partitions setting in Databricks?

A

The spark.sql.shuffle.partitions setting controls the number of partitions used when Spark performs shuffle operations, such as joins and aggregations. By default, it’s set to 200, but tuning this value can significantly impact performance. Lowering it can reduce overhead for small datasets, while increasing it may help parallelize work better for large datasets.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What does the display() command do in a Databricks notebook?

A

The display() command in a Databricks notebook renders DataFrames or query results as interactive tables or visualizations. It allows users to quickly explore data, apply filters, and generate charts without writing additional code. While specific to Databricks, it enhances interactivity and is often used for data exploration and presentation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is MLflow in relation to Databricks?

A

MLflow is an open-source platform integrated into Databricks for managing the end-to-end machine learning lifecycle. It supports tracking experiments, packaging models, managing model versions, and deploying them to production. Databricks provides a seamless MLflow experience, allowing teams to collaborate and operationalize ML workflows efficiently within the Lakehouse environment.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the purpose of Databricks’ Auto-Scaling feature?

A

Databricks’ Auto-Scaling feature automatically adjusts the number of worker nodes in a cluster based on workload demands. It helps optimize resource usage by scaling up during heavy computation and scaling down when demand decreases. This leads to cost savings and improved efficiency without manual intervention.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What type of data storage does Delta Lake provide?

A

Delta Lake provides transactional storage for big data workloads by layering ACID compliance on top of cloud object stores like AWS S3 or Azure Data Lake Storage. It combines the scalability of a data lake with the reliability and consistency of a traditional database. This enables robust data pipelines and simplifies batch and streaming unification.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is the command to read a Delta table in Databricks?

A

To read a Delta table in Databricks, you can use the command spark.read.format("delta").load("path/to/table") for file-based access or spark.table("table_name") for a registered table in the metastore. Both options load the table as a DataFrame for further processing. Using the delta format ensures access to versioned, ACID-compliant data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the default language for Databricks notebooks?

A

Python

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is the purpose of the cache() method in Spark?

A

The cache() method in Spark stores a DataFrame or RDD in memory across the cluster, making future actions on the same data much faster. It’s useful when the same dataset will be reused multiple times in a workflow. Caching improves performance by avoiding repeated computation or data retrieval from storage.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

How does Databricks integrate with Apache Kafka?

A

Databricks integrates with Apache Kafka to support real-time data streaming into and out of the Lakehouse. Using the readStream and writeStream APIs in Spark Structured Streaming, Databricks can consume Kafka topics or publish processed data back. This enables near real-time analytics, event-driven pipelines, and machine learning on streaming data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What does the spark.sqlContext object provide in Databricks?

A

The spark.sqlContext object in Databricks provides an entry point for working with structured data using SQL and DataFrames. It enables operations like querying tables, registering temporary views, and running SQL commands within a Spark application. While now largely replaced by SparkSession, it remains accessible for backward compatibility.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What function does the groupBy() method perform in Spark?

A

The groupBy() method in Spark groups rows in a DataFrame based on one or more columns, allowing aggregation operations like count(), sum(), or avg() to be applied to each group. It’s essential for summarizing and analyzing data by categories or keys. Grouping helps extract meaningful insights from large datasets efficiently.

20
Q

What is the command to write a DataFrame to a Delta table?

A

To write a DataFrame to a Delta table, you can use:

dataframe.write.format("delta").save("path/to/table")

to save to a specific location, or:

dataframe.write.format("delta").saveAsTable("table_name")

to save as a managed Delta table in the metastore. Both commands ensure ACID transactions and versioning features of Delta Lake are enabled.

21
Q

What is the purpose of the join() method in Spark?

A

The join() method in Spark combines two DataFrames based on a common key or condition. It allows you to merge related data from different sources, similar to SQL joins. This is essential for integrating datasets and performing comprehensive analysis across tables.

22
Q

What are the types of clusters that can be created in Databricks?

A

Databricks supports two main types of clusters: Interactive clusters, used for ad hoc analysis, development, and exploration; and Job clusters, which are temporary clusters created to run automated jobs or workflows. Interactive clusters persist until manually terminated, while job clusters are created and terminated automatically for specific tasks.

23
Q

What are the differences between RDDs and DataFrames in Spark?

A

RDDs (Resilient Distributed Datasets) are low-level distributed collections of objects offering fine-grained control but requiring more code and manual optimization. DataFrames provide a higher-level, tabular abstraction with schema support, enabling optimized execution through Spark’s Catalyst optimizer and easier integration with SQL. DataFrames generally offer better performance and simpler APIs for data processing.

24
Q

What is the purpose of the writeStream() method in Spark Structured Streaming?

A

The writeStream() method is used to define the output sink for streaming data in Spark Structured Streaming. It specifies where and how the streaming results are written, such as to files, Kafka, or memory, and controls options like output mode and checkpointing. This method enables continuous, incremental processing of real-time data streams.

25
What is the significance of the `spark.conf` settings in Databricks?
The `spark.conf` settings allow users to configure Spark’s runtime behavior and performance parameters within a Databricks environment. They control aspects such as memory allocation, shuffle partitions, and query optimizations. Proper tuning of these settings can greatly improve job efficiency and resource utilization.
26
How does Databricks utilize Parquet files?
Databricks extensively uses Parquet as the default file format for storing data because of its efficient columnar storage and compression. Parquet enables fast querying and reduced storage costs, making it ideal for big data analytics. Delta Lake builds on Parquet by adding transactional capabilities and schema enforcement.
27
What is the primary function of Databricks' Unity Catalog?
Databricks' Unity Catalog provides a unified governance solution for managing data and AI assets across all Databricks workspaces. It centralizes metadata, enforces fine-grained access controls, and enables auditing and lineage tracking. This ensures secure and compliant data sharing and collaboration in large organizations.
28
What is the `spark.sql.catalog` configuration used for in Databricks?
The `spark.sql.catalog` configuration specifies the catalog implementation that Spark uses to access metadata about databases and tables. It allows users to define custom catalogs or connect to external metastore services, enabling flexible management of table metadata. This setting is important for integrating with Unity Catalog or Hive metastores.
29
What does the `dropDuplicates()` method do in a DataFrame?
The `dropDuplicates()` method removes duplicate rows from a DataFrame based on all or specified columns. It helps clean data by ensuring each record is unique, which is essential for accurate analysis and processing. This method can be used to prevent data quality issues caused by redundant entries.
30
What is the command to create a new DataFrame from a CSV file?
To create a new DataFrame from a CSV file, use: ``` spark.read.format("csv").option("header", "true").load("path/to/file.csv") ``` This command reads the CSV file with headers and loads it as a DataFrame for further processing.
31
What is the purpose of the `select()` method in a DataFrame?
The `select()` method is used to choose specific columns from a DataFrame. It creates a new DataFrame containing only the selected columns, which helps focus analysis and reduce data size. This method is essential for shaping data before transformations or aggregations.
32
What is the difference between a 'static' and a 'streaming' DataFrame in Spark?
A 'static' DataFrame represents a fixed dataset loaded into memory, suitable for batch processing. A 'streaming' DataFrame is continuously updated as new data arrives, supporting real-time processing with incremental results. Streaming DataFrames enable applications like real-time analytics and event-driven pipelines.
33
What is the purpose of 'checkpointing' in Spark Streaming?
Checkpointing in Spark Streaming saves the state of a streaming application periodically to durable storage. It enables fault tolerance by allowing the system to recover from failures without data loss. Checkpointing also helps maintain stateful transformations and ensures exactly-once processing guarantees.
34
What command is used to stop a running cluster in Databricks?
In Databricks, clusters are typically stopped through the UI by selecting the cluster and clicking “Terminate.” Programmatically, you can use the Databricks REST API’s `clusters/delete` endpoint to stop a cluster. Stopping a cluster releases its resources and stops billing until restarted.
35
What are ACID transactions, and how does Databricks support them?
ACID transactions ensure data operations are Atomic, Consistent, Isolated, and Durable, providing reliability in database systems. Databricks supports ACID transactions through Delta Lake, which adds transactional guarantees on top of data lakes. This enables safe concurrent reads and writes, schema enforcement, and rollback capabilities.
36
What is the function of the `filter()` method in a DataFrame?
The `filter()` method selects rows from a DataFrame that meet a specified condition or set of conditions. It’s used to narrow down data to relevant subsets for analysis or processing. Filtering helps improve efficiency by working only with the necessary data.
37
What is the command to list all available tables in a Databricks SQL context?
To list all available tables, you can use the SQL command: `SHOW TABLES` This displays all tables accessible in the current database or schema, helping users discover and explore available data assets.
38
What does the `count()` method do in a DataFrame?
The `count()` method returns the total number of rows in a DataFrame. It’s useful for quickly determining dataset size and validating data processing steps. Counting helps monitor the volume of data being analyzed or transformed.
39
What is the function of the `orderBy()` method in a DataFrame?
The `orderBy()` method sorts the rows of a DataFrame based on one or more columns, either ascending or descending. It is used to organize data for better readability or to prepare for operations that require sorted input. Sorting helps in tasks like ranking, reporting, and presenting results.
40
What is the command to save a DataFrame as a Parquet file?
To save a DataFrame as a Parquet file, use: `dataframe.write.format("parquet").save("path/to/output")` This command writes the data in an efficient, columnar format suitable for fast analytics and storage optimization.
41
What is the purpose of the `collect()` method in a DataFrame?
The `collect()` method retrieves all the rows of a DataFrame from the distributed cluster to the driver program as a local array. It’s useful for small datasets when you need to access data locally but can cause memory issues with large datasets. Use it cautiously to avoid performance bottlenecks.
42
What does the `withColumn()` method do in a DataFrame?
The `withColumn()` method creates a new column or replaces an existing one in a DataFrame based on a specified expression or transformation. It’s commonly used to add calculated fields or modify data without changing the original DataFrame. This method supports chaining multiple transformations efficiently.
43
What is the command to run a SQL query in Databricks?
In Databricks notebooks, you can run a SQL query by prefixing it with `%sql`. For example: ``` %sql SELECT * FROM table_name ``` This allows you to execute SQL directly within a notebook cell and view the results interactively.
44
What is the significance of the `broadcast()` function in Spark?
The `broadcast()` function in Spark distributes a small dataset to all worker nodes, enabling efficient joins with large datasets. By broadcasting the smaller dataset, Spark avoids expensive data shuffles and reduces network I/O, improving join performance. It’s especially useful in scenarios where one table is much smaller than the other.
45
What is the command to create a new cluster in Databricks?
Clusters in Databricks are typically created through the UI by specifying configuration details like node type, size, and runtime version. Programmatically, you can use the Databricks REST API `clusters/create` endpoint with a JSON payload defining cluster parameters. This allows automation of cluster provisioning.
46
What is the primary function of the `union()` method in a DataFrame?
The `union()` method combines the rows of two DataFrames with the same schema into a single DataFrame. It appends the datasets vertically without removing duplicates. This is useful for merging datasets from different sources or time periods for unified analysis.