Foundations of Database Systems Flashcards
(87 cards)
What is Databricks?
Databricks is a unified analytics platform that provides a cloud-based environment for data engineering, data science, and machine learning.
True or False: Databricks is built on Apache Spark.
True
Fill in the blank: Databricks allows users to create notebooks that support _____ programming languages.
multiple
What is a Delta Lake in Databricks?
Delta Lake is an open-source storage layer that brings ACID transactions to Apache Spark and big data workloads.
Which programming languages are natively supported by Databricks?
Python, R, Scala, and SQL.
What is the primary purpose of Databricks Runtime?
Databricks Runtime is an optimized Spark environment designed to improve performance and provide a stable platform for data processing.
What feature of Databricks allows for collaborative data science?
Databricks notebooks enable real-time collaboration among data scientists and engineers.
True or False: Databricks supports both batch and streaming data processing.
True
What is the role of a cluster in Databricks?
A cluster in Databricks is a set of computation resources that run jobs and notebooks.
Which storage options can be used with Databricks?
Azure Blob Storage, AWS S3, and ADLS Gen2.
What is the purpose of the Databricks Jobs feature?
The Jobs feature allows users to schedule and manage automated workflows for running notebooks or Spark jobs.
Fill in the blank: _____ is used in Databricks to perform data manipulation and analysis using SQL.
Databricks SQL
What is the significance of the ‘spark.sql.shuffle.partitions’ setting?
It controls the number of partitions to use when shuffling data for joins or aggregations.
True or False: Databricks provides built-in support for machine learning libraries.
True
What does the command ‘display()’ do in a Databricks notebook?
It visualizes the output of a DataFrame in a user-friendly manner.
Which service can be integrated with Databricks for data warehousing?
Snowflake
What is MLflow in relation to Databricks?
MLflow is an open-source platform for managing the machine learning lifecycle, integrated with Databricks.
Fill in the blank: Databricks provides _____ for managing and deploying machine learning models.
MLflow
What is the purpose of Databricks’ Auto-Scaling feature?
Auto-Scaling automatically adjusts the number of nodes in a cluster based on workload demands.
True or False: Databricks does not support version control for notebooks.
False
What does the ‘SQL Analytics’ workspace in Databricks provide?
It offers a SQL interface for querying data and creating visualizations.
What type of data storage does Delta Lake provide?
ACID-compliant storage with support for time travel and schema evolution.
What is the command to read a Delta table in Databricks?
spark.read.format(‘delta’).load(‘/path/to/delta/table’)
Fill in the blank: Databricks supports _____ for real-time data processing.
Structured Streaming