Foundations of Database Systems Flashcards
(46 cards)
What is Databricks?
Databricks is a unified analytics platform that provides a cloud-based environment for data engineering, data science, and machine learning.
What is a Delta Lake in Databricks?
Delta Lake is an open-source storage layer in Databricks that brings ACID transaction support to Apache Spark and big data workloads. It enables reliable data lakes by supporting features like schema enforcement, time travel, and scalable metadata handling. Delta Lake ensures consistency and performance across streaming and batch jobs, making it ideal for modern data engineering workflows.
Which programming languages are natively supported by Databricks?
Python, R, Scala, and SQL.
What is the primary purpose of Databricks Runtime?
The Databricks Runtime is a Databricks-specific, optimized runtime environment built on top of Apache Spark. It includes performance improvements, integrated libraries (like Delta Lake, MLflow, Photon), and optimized connectors.
What is the role of a cluster in Databricks?
A cluster in Databricks is a set of computation resources that run jobs and notebooks.
Which storage options can be used with Databricks?
Azure Blob Storage, AWS S3, and ADLS Gen2.
What is the purpose of the Databricks Jobs feature?
The Databricks Jobs feature is used to automate the execution of notebooks, Python scripts, JARs, or workflows on a scheduled or triggered basis. It enables teams to build and manage production-grade data pipelines and ML workflows directly within the platform. Jobs can be monitored, retried on failure, and configured with dependencies to support complex workflows.
What is Databricks SQL, and how is it used?
Databricks SQL is a workspace environment for running SQL queries on data stored in the Lakehouse. It allows analysts and engineers to explore data, create dashboards, and build visualizations using familiar SQL syntax. It supports Delta Lake tables, integrates with BI tools, and provides a fully managed, scalable SQL engine optimized for performance.
What is the significance of the spark.sql.shuffle.partitions
setting in Databricks?
The spark.sql.shuffle.partitions
setting controls the number of partitions used when Spark performs shuffle operations, such as joins and aggregations. By default, it’s set to 200, but tuning this value can significantly impact performance. Lowering it can reduce overhead for small datasets, while increasing it may help parallelize work better for large datasets.
What does the display()
command do in a Databricks notebook?
The display()
command in a Databricks notebook renders DataFrames or query results as interactive tables or visualizations. It allows users to quickly explore data, apply filters, and generate charts without writing additional code. While specific to Databricks, it enhances interactivity and is often used for data exploration and presentation.
What is MLflow in relation to Databricks?
MLflow is an open-source platform integrated into Databricks for managing the end-to-end machine learning lifecycle. It supports tracking experiments, packaging models, managing model versions, and deploying them to production. Databricks provides a seamless MLflow experience, allowing teams to collaborate and operationalize ML workflows efficiently within the Lakehouse environment.
What is the purpose of Databricks’ Auto-Scaling feature?
Databricks’ Auto-Scaling feature automatically adjusts the number of worker nodes in a cluster based on workload demands. It helps optimize resource usage by scaling up during heavy computation and scaling down when demand decreases. This leads to cost savings and improved efficiency without manual intervention.
What type of data storage does Delta Lake provide?
Delta Lake provides transactional storage for big data workloads by layering ACID compliance on top of cloud object stores like AWS S3 or Azure Data Lake Storage. It combines the scalability of a data lake with the reliability and consistency of a traditional database. This enables robust data pipelines and simplifies batch and streaming unification.
What is the command to read a Delta table in Databricks?
To read a Delta table in Databricks, you can use the command spark.read.format("delta").load("path/to/table")
for file-based access or spark.table("table_name")
for a registered table in the metastore. Both options load the table as a DataFrame for further processing. Using the delta
format ensures access to versioned, ACID-compliant data.
What is the default language for Databricks notebooks?
Python
What is the purpose of the cache()
method in Spark?
The cache()
method in Spark stores a DataFrame or RDD in memory across the cluster, making future actions on the same data much faster. It’s useful when the same dataset will be reused multiple times in a workflow. Caching improves performance by avoiding repeated computation or data retrieval from storage.
How does Databricks integrate with Apache Kafka?
Databricks integrates with Apache Kafka to support real-time data streaming into and out of the Lakehouse. Using the readStream
and writeStream
APIs in Spark Structured Streaming, Databricks can consume Kafka topics or publish processed data back. This enables near real-time analytics, event-driven pipelines, and machine learning on streaming data.
What does the spark.sqlContext
object provide in Databricks?
The spark.sqlContext
object in Databricks provides an entry point for working with structured data using SQL and DataFrames. It enables operations like querying tables, registering temporary views, and running SQL commands within a Spark application. While now largely replaced by SparkSession
, it remains accessible for backward compatibility.
What function does the groupBy()
method perform in Spark?
The groupBy()
method in Spark groups rows in a DataFrame based on one or more columns, allowing aggregation operations like count()
, sum()
, or avg()
to be applied to each group. It’s essential for summarizing and analyzing data by categories or keys. Grouping helps extract meaningful insights from large datasets efficiently.
What is the command to write a DataFrame to a Delta table?
To write a DataFrame to a Delta table, you can use:
dataframe.write.format("delta").save("path/to/table")
to save to a specific location, or:
dataframe.write.format("delta").saveAsTable("table_name")
to save as a managed Delta table in the metastore. Both commands ensure ACID transactions and versioning features of Delta Lake are enabled.
What is the purpose of the join()
method in Spark?
The join()
method in Spark combines two DataFrames based on a common key or condition. It allows you to merge related data from different sources, similar to SQL joins. This is essential for integrating datasets and performing comprehensive analysis across tables.
What are the types of clusters that can be created in Databricks?
Databricks supports two main types of clusters: Interactive clusters, used for ad hoc analysis, development, and exploration; and Job clusters, which are temporary clusters created to run automated jobs or workflows. Interactive clusters persist until manually terminated, while job clusters are created and terminated automatically for specific tasks.
What are the differences between RDDs and DataFrames in Spark?
RDDs (Resilient Distributed Datasets) are low-level distributed collections of objects offering fine-grained control but requiring more code and manual optimization. DataFrames provide a higher-level, tabular abstraction with schema support, enabling optimized execution through Spark’s Catalyst optimizer and easier integration with SQL. DataFrames generally offer better performance and simpler APIs for data processing.
What is the purpose of the writeStream()
method in Spark Structured Streaming?
The writeStream()
method is used to define the output sink for streaming data in Spark Structured Streaming. It specifies where and how the streaming results are written, such as to files, Kafka, or memory, and controls options like output mode and checkpointing. This method enables continuous, incremental processing of real-time data streams.