Databricks Flashcards
(153 cards)
What are the 2 main components of Databricks?
The control plan: stores notebooks commands, workspace configurations.
The data plane: hosts compute resources
(clusters)
What are the 3 different Databricks service?
Data science and engineering workspace
SQL
Machine learning
What is a cluster?
A set of compute resources on which you run data engineer, data science workloads, which are run as a set of commands on notebook or as a job.
What are the 2 cluster types?
All purpose clusters: use interactive notebooks.
Job clusters: run automated jobs
. How long does Databricks retain cluster configuration information?
30 days
Pin cluster to keep all purpose cluster after 30 days.
1.6. What are the three cluster modes?
Standard clusters: large amounts of data with Apache Spark.
Single Node clusters: jobs that use small amounts of data or non-distributed
workloads such as single-node machine learning libraries.
High Concurrency clusters: groups of users who need to share resources or run ad-hoc
jobs. Administrators usually create High Concurrency clusters.
Databricks recommends enabling
autoscaling for High Concurrency clusters.
To ensure that all data at rest is
encrypted for all storage types, including shuffle data that is stored temporarily on your cluster’s local
disks:
You can enable local disk encryption.
To reduce cluster start time
you can attach a cluster to a predefined pool of idle instances, for the driver
and worker nodes
Which magic command do you use to run a notebook from
another notebook?
%run ../Includes/Classroom-Setup-1.2
What is Databricks utilities and how can you use it to list out
directories of files from Python cells?
display(dbutils.fs.ls(“/databricks-datasets”))
What function should you use when you have tabular data
returned by a Python cell?
display()
What is Databricks Repos?
provides repository-level
integration with Git providers, allowing you to work in an environment that is backed by revision control
using Git
What is the definition of a Delta Lake?
technology at the heart of the Databricks Lakehouse platform. It is an open source
technology that enables building a data lakehouse on top of existing storage systems.
. How does Delta Lake address the data lake pain points to
ensure reliable, ready-to-go data?
ACID Transactions – Delta Lake adds ACID transactions to data lakes. ACID stands for atomicity,
consistency, isolation, and durability
Describe how Delta Lake brings ACID transactions to object
storage
Difficult to append data
Difficult to modify existing data
Jobs failing mid way
Real time operations are not easy
Costly to keep historical data versions
Is Delta Lake the default for all tables created in Databricks?
Yes, Delta Lakes is the default for all tables created in Databricks
What data objects are in the Databricks Lakehouse?
Catalog, database, table, view, function
What is a metastore?
contains all of the metadata that defines data objects in the lakehouse:
Including:
Unity catalog
Hive metastore
external metastore
What is a catalog?
the highest abstraction (or coarsest grain) in the Databricks Lakehouse relational model.
catalog_name.database_name.table_name
What is a Delta Lake table?
stores data as a directory of files on
cloud object storage and registers table metadata to the metastore within a catalog and schema
What is the syntax to create a Delta Table?
CREATE TABLE students
(id INT, name STRING, value DOUBLE);
CREATE TABLE IF NOT EXISTS students
(id INT, name STRING, value DOUBLE)
What is the syntax to insert data?
INSERT INTO students
VALUES
(4, “Ted”, 4.7),
(5, “Tiffany”, 5.5),
(6, “Vini”, 6.3)
What is the syntax to update particular records of a table?
UPDATE students
SET value = value + 1
WHERE name LIKE “T%”
What is the syntax to delete particular records of a table?
DELETE FROM students
WHERE value > 6