Prepare data Flashcards

(7 cards)

1
Q

You have an Azure SQL database that contains a customer dimension table. The table contains two columns named CustomerID and CustomerCompositeKey.

You have a Fabric workspace that contains a Dataflow Gen2 query that connects to the database.

You need to use Dataflows Query Editor to identify which of the two columns contains non-duplicate values per customer.

Which option should you use?

A

Column distribution – distinct values

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

You have a Fabric lakehouse named Lakehouse1.

You use a notebook in Lakehouse1 to explore customer data.

You need to identify the rows of a DataFrame named df_customers in which any of the columns (axis 1 of the DataFrame) are NULL.

Which statement should you run?

A

df_customers[df_customers.isnull().any(axis=1)

*this seems like it has a typo, no closing square bracket

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

You have a Fabric tenant that contains two lakehouses named Lakehouse1 and Lakehouse2. Lakehouse1 contains a table named FactSales that is partitioned by a column named CustomerID.

You need to create a shortcut to the FactSales table in Lakehouse2. The shortcut must only connect to data for CustomerID 100.

What should you do?

A

As you create the shortcut, select the CustomerKey=100 folder under the FactSales folder in Tables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

You have a Fabric tenant that contains a lakehouse named Lakehouse1.

You have a large 1 TB dataset in an external data source.

You need to recommend a method to ingest the dataset into Lakehouse1. The solution must provide the highest throughput. The solution must be suitable for developers who prefer the low-code/no-code option.

What should you recommend?

A

Use the Copy data activity of a pipeline to copy the data.
The Copy data activity provides the best performance when copying data from large datasets

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

You have a Fabric tenant that contains a workspace named Workspace1. Workspace1 contains a data pipeline named Pipeline1 that runs in the US-West Azure region. Workspace1 also contains a semantic model named SemanticModel1 and a warehouse named Warehouse1.

You need to ensure that Pipeline1 runs at midnight (12:00 AM), and that the schedule is set to the UTC-0 time zone.

How should you configure the schedule for Pipeline1?

A

For Pipeline1, set the scheduler time zone to UTC-0.
The data pipeline artifact in the workspace has its own time zone setting that applies to only that data pipeline. This is where you need to configure the UTC time zone that will apply to only the data pipeline.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

You have a Fabric warehouse.

You have an Azure SQL database that contains a fact table named Sales and a second table named ExceptionRecords. Both tables contain a unique key column named Record ID.

You plan to ingest the Sales table into the warehouse.

You need to use Dataflow Gen2 to configure a merge type to ensure that the Sales table excludes any records found in the ExceptionRecords table, and that query folding is maintained.

Which applied steps should you use?

A

Merge (left anti join) applied step, and then the expand columns applied step
A left anti join ensures that only rows not found in the ExceptionRecords table are loaded, and the expand columns step ensures that query folding is maintained for performance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q
A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly