Azure Databricks Flashcards

1
Q

What is medallion architecture?

A

A medallion architecture is a data design pattern used to logically organize data in a lakehouse, with the goal of incrementally and progressively improving the structure and quality of data as it flows through each layer of the architecture (from Bronze ⇒ Silver ⇒ Gold layer tables).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

In medallion architecture, what is the bronze layer?

A

The landing zone for raw data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

In medallion architecture, what is the silver layer?

A

cleansed and conformed data

In the Silver layer of the lakehouse, the data from the Bronze layer is combined, organized, and cleaned up to create a comprehensive view of the important aspects of the business, such as customers, stores, transactions, and reference tables. This helps to ensure that the Silver layer provides a unified and reliable representation of the enterprise’s key information.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

T or F: From a data modeling perspective, the Silver Layer has more 3rd-Normal Form like data models

A

True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

In medallion architecture, what is the gold layer?

A

curated business-level tables. Uses more de-normalized and read-optimized datamodels with fewer joins. (flattened data)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are the benefits of a lakehouse architecture? (5 items)

A

Simple data model

Easy to understand and implement

Enables incremental ETL

Can recreate your tables from raw data at any time

ACID transactions, time travel

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the concept of a data mesh?

A

Bronze and silver tables can be joined together in a “one-to-many” fashion, meaning that the data in a single upstream table could be used to generate multiple downstream tables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is databricks?

A

Databricks offers a powerful workspace that integrates various components and tools to simplify the end-to-end data lifecycle. It provides capabilities for data ingestion, data preparation, data exploration, machine learning, and visualization.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What programming languages does databricks support?

A

Using Jupyter notebooks, The platform supports multiple programming languages, including Python, Scala, R, and SQL,

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

T or F: Databricks has Apache Spark capabilities

A

T

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

T or F: You can manage spark clusters through APIS with datbricks?

A

T

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What do databricks alerts do?

A

you could set up an alert to monitor certain data streams and then automatically create support tickets if those data streams or queries exceeds certain thresholds.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are apache spark clusters?

A

Spark is a distributed data processing solution that makes use of clusters to scale processing across multiple compute nodes. Each Spark cluster has a driver node to coordinate processing jobs, and one or more worker nodes on which the processing occurs.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is a databricks file system?

A

The nodes in a spark cluster have access to a shared, distributed file system in which they can access and operate on data files. The Databricks File System (DBFS) enables you to mount cloud storage and use it to work with and persist file-based data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is a notebook?

A

One of the most common ways for data analysts, data scientists, data engineers, and developers to work with Spark is to write code in notebooks.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is a hive metastore?

A

Hive is an open source technology used to define a relational abstraction layer of tables over file-based data. The tables can then be queried using SQL syntax.

17
Q

What is a delta lake?

A

Delta Lake builds on the relational table schema abstraction over files in the data lake to add support for SQL semantics commonly found in relational database systems. SQL semantics are ie ddl statements, dml statements, etc

18
Q

How is databricks used as a data workflow tool?

A

it can be used to flow data from one source to another and manipulate that data along the way. So what that means is that the tools that integrate with Databricks best are the tools that accept the data flow. So for example, Databricks can flow its data to data stores like Azure Synapse Analytics for extra analytics, of course, it can flow its data to something like Azure Data Lake too.

18
Q

How is Azure Active Directory used in Databricks?

A

Each user or even groups for that matter, can be granted various levels of permissions to your resource or even your data.

19
Q

What is an azure databricks workspace?

A

workspace is an analytics platform that’s actually based on Apache Spark. Databricks integrate Spark into Azure and makes the interaction with Spark seamless for collaborators like data engineers and data scientists.

19
Q

What is a sql warehouse?

A

SQL Warehouses are relational compute resources with endpoints that enable client applications to connect to an Azure Databricks workspace and use SQL to work with data in tables.