Lakehouse Flashcards by Michael Bass

What is a lakehouse?

A lakehouse presents as a database and is built on top of a data lake using Delta format tables.

How well did you know this?

Not at all

Perfectly

What capabilities do lakehouses combine?

The SQL-based analytical capabilities of a relational data warehouse and the flexibility and scalability of a data lake.

How well did you know this?

Not at all

Perfectly

What types of data formats can lakehouses store?

All data formats.

How well did you know this?

Not at all

Perfectly

What is the advantage of lakehouses being cloud-based?

They can scale automatically and provide high availability and disaster recovery.

How well did you know this?

Not at all

Perfectly

What processing engines do lakehouses use?

Spark and SQL engines.

How well did you know this?

Not at all

Perfectly

What is the schema-on-read format?

Data is organized in a schema-on-read format, meaning the schema is defined as needed rather than having a predefined schema.

How well did you know this?

Not at all

Perfectly

What does ACID stand for in the context of lakehouses?

Atomicity, Consistency, Isolation, Durability.

How well did you know this?

Not at all

Perfectly

What are the roles of different users in a lakehouse?

Data engineers, data scientists, and data analysts access and use data.

How well did you know this?

Not at all

Perfectly

What is the ETL process?

Extract, Transform, Load.

How well did you know this?

Not at all

Perfectly

What types of data sources can be ingested into a lakehouse?

Local files, databases, or APIs.

How well did you know this?

Not at all

Perfectly

What are Fabric shortcuts?

Links to data in external sources, such as Azure Data Lake Store Gen2 or OneLake.

How well did you know this?

Not at all

Perfectly

What tools can be used to transform ingested data?

Apache Spark with notebooks or Dataflows Gen2.

How well did you know this?

Not at all

Perfectly

What is the purpose of Data Factory pipelines?

To orchestrate different ETL activities and land prepared data into the lakehouse.

How well did you know this?

Not at all

Perfectly

What familiar tool do Dataflows Gen2 utilize?

Power Query.

How well did you know this?

Not at all

Perfectly

What can you analyze using a lakehouse?

Using SQL.

How well did you know this?

Not at all

Perfectly

What can be developed in Power BI using a lakehouse?

Reports.

How well did you know this?

Not at all

Perfectly

How is lakehouse access managed?

Through workspace roles or item-level sharing.

How well did you know this?

Not at all

Perfectly

What are sensitivity labels used for in lakehouses?

Data governance features.

How well did you know this?

Not at all

Perfectly

True or False: Item-level sharing is best for granting access for read-only needs.

True.

How well did you know this?

Not at all

Perfectly

Fill in the blank: Lakehouses support _______ transactions through Delta Lake formatted tables.

Study These Flashcards

ACID

What is a key benefit of using a lakehouse for analytics?

Study These Flashcards

Scalable analytics solution that maintains data consistency.

What three items are automatically created in your workspace when you create a new lakehouse?

Study These Flashcards

Shortcuts, folders, files, and tables.

The lakehouse serves as a central hub for data management.

What does the Semantic model (default) provide for Power BI report developers?

Study These Flashcards

An easy data source.

The Semantic model simplifies data representation for reporting.

What is the purpose of the SQL analytics endpoint in a lakehouse?

Study These Flashcards

Allows read-only access to query data with SQL.

This endpoint enables SQL-based interaction with the lakehouse data.

In what two modes can you work with data in the lakehouse?

Lakehouse mode and SQL analytics endpoint mode. ## Footnote Each mode offers different capabilities for managing and querying data.

What is the first step in the ETL process for a lakehouse?

Ingesting data into your lakehouse. ## Footnote This step is crucial for preparing data for analysis.

List the methods to ingest data into a lakehouse.

* Upload local files * Dataflows Gen2 * Notebooks * Data Factory pipelines ## Footnote Each method has its own use case and benefits.

What should you consider when ingesting data to determine your loading pattern?

Whether to load all raw data as files or use staging tables. ## Footnote This decision impacts performance and data processing efficiency.

What can Spark job definitions be used for in a lakehouse?

To submit batch/streaming jobs to Spark clusters. ## Footnote This allows for processing large volumes of data efficiently.

What is the purpose of shortcuts in a lakehouse?

To integrate data while keeping it stored in external storage. ## Footnote Shortcuts enhance data accessibility across different storage solutions.

How are source data permissions and credentials managed when using shortcuts?

They are managed by OneLake. ## Footnote This central management simplifies access control across data sources.

What is required for a user to access data through a shortcut to another OneLake location?

The user must have permissions in the target location to read the data. ## Footnote This ensures secure and authorized access to the data.

Where can shortcuts be created?

In both lakehouses and KQL databases. ## Footnote This versatility allows for broader data integration options.

True or False: Shortcuts appear as a folder in the lake.

True. ## Footnote This structure allows for organized data management within the lakehouse.

What is the main role of data transformations in the data loading process?

Most data requires transformations before loading into tables.

What tools can be used to transform and load data?

The same tools used to ingest data can also transform and load data.

What is a Delta table?

Transformed data can be loaded as a file or a Delta table.

Who favors notebooks for data engineering tasks?

Data engineers familiar with different programming languages including PySpark, SQL, and Scala.

What interface do Dataflows Gen2 use?

The PowerQuery interface.

What do pipelines provide in the ETL process?

A visual interface to perform and orchestrate ETL processes.

How complex can pipelines be?

Pipelines can be as simple or as complex as needed.

What is required for data to be used after ingestion?

Data must be transformed and loaded.

What do Fabric items provide for organizations?

The flexibility needed for every organization.

What tools can data scientists use for exploring and training machine learning models?

Notebooks or Data wrangler.

What can report developers create using the semantic model?

Power BI reports.

What can analysts use the SQL analytics endpoint for?

To query, filter, aggregate, and explore data in lakehouse tables.

What is the benefit of combining Power BI with a data lakehouse?

You can implement an end-to-end analytics solution on a single platform.

Fill in the blank: After data is ingested, transformed, and loaded, it's ready for _______.

others to use.

True or False: Dataflows Gen2 are excellent for developers familiar with SQL only.

False.

Lakehouse Flashcards

(49 cards)