Ingest and transform data Flashcards
(28 cards)
Which tool is best suited for data transformation in Fabric when dealing with large-scale data that will continue to grow?
Notebooks
Which Microsoft Fabric Real-Time Intelligence component should you use to ingest and transform a stream of real-time data?
Eventstream
What do temporal window transformation enable you to do?
Aggregate event data in a stream based on specific time periods.
What are the four data ingestion options available in Microsoft Fabric for loading data into a data warehouse?
COPY (Transact-SQL) statement, data pipelines, dataflows, and cross-warehouse.
What are the supported data sources and file formats for the COPY (Transact-SQL) statement in Warehouse?
Azure Data Lake Storage (ADLS) Gen2 and Azure Blob Storage, with PARQUET and CSV file formats.
What is the recommended minimum file size when working with external data on files in Microsoft Fabric?
At least 4 MB.
What is a data pipeline?
A sequence of activities to orchestrate a data ingestion or transformation process
You want to use a pipeline to copy data to a folder with a specified name for each run. What should you do?
Add a parameter to the pipeline and use it to specify the folder name for each run
You have previously run a pipeline containing multiple activities. What’s the best way to check how long each individual activity took to complete?
View the run details in the run history.
You want to include data in an external Azure Data Lake Store Gen2 location in your lakehouse, without the requirement to copy the data. What should you do?
Create a shortcut.
You want to use Apache Spark to interactively explore data in a file in the lakehouse. What should you do?
Create a notebook.
What is a Dataflow Gen2?
A way to import and transform data with Power Query Online.
Which workload experience lets you create a Dataflow Gen2?
Data Factory.
You need to connect to and transform data to be loaded into a Fabric lakehouse. You aren’t comfortable using Spark notebooks, so decide to use Dataflows Gen2. How would you complete this task?
Connect to Data Factory workload > Create a Dataflow Gen2 to transform data > add your lakehouse as the data destination.
Which tool is best suited for data transformation in Fabric when dealing with large-scale data that will continue to grow?
- Dataflows(Gen2)
- Pipelines
- Notebooks
Notebooks
Your company is implementing real-time data processing using Spark Structured Streaming in Microsoft Fabric. Data from IoT devices needs to be stored in a Delta table.
You need to ensure efficient processing of streaming data while preventing errors due to data changes.
What should you do?
Use ignoreChanges.
Your company has implemented a Microsoft Fabric lakehouse to store and analyze data from multiple sources. The data is used for generating Microsoft Power BI reports and requires regular updates.
You need to ensure that the data in the Microsoft Fabric lakehouse is updated incrementally to reflect changes from the source systems.
Which method should you use to achieve incremental updates?
Implement a watermark strategy.
You are implementing a data warehouse using Microsoft Fabric. You need to integrate data from multiple sources, including Microsoft Azure Data Lake Storage Gen2 and Microsoft SQL Server databases.
You need to design a process to efficiently load data into tables while ensuring data quality and consistency.
Each correct answer presents part of the solution. Which two actions should you take?
Use Data Factory pipelines for ETL orchestration and T-SQL execution.
Use dataflows to ingest and transform data from Azure Data Lake Storage Gen2.
Your company uses a Microsoft Fabric data warehouse to store frequently updated customer transaction data.
You need to design an ETL process that minimizes load on source systems while ensuring only new or changed data is loaded.
What should you do?
Use Change Data Capture (CDC) for tracking source data changes.
You are implementing a data warehouse solution using Microsoft Fabric. The data warehouse integrates data from multiple sources, including Microsoft Azure Data Lake Storage Gen2 and Microsoft SQL Server databases. You need to transform and load the data into dimensional model tables for reporting purposes.
Design an ETL process that efficiently loads data while ensuring high data quality and consistency.
Each correct answer presents part of the solution. Which three actions should you perform?
Stage data before loading.
Transform data to match the model.
Use Data Factory pipelines.
A company uses a lakehouse architecture with Microsoft Fabric. The data engineering team needs to transform large datasets in Delta format for machine learning.
You need to perform data transformations efficiently using Microsoft Fabric tools.
What should you use?
Apache Spark notebooks.
You are using Microsoft Fabric to manage data ingestion and transformation. Your task is to set up a data pipeline to ingest batch data from multiple CSV files stored in Azure Blob Storage.
You need to ensure the process is efficient and handles errors effectively.
Each correct answer presents part of the solution. Which three actions should you perform?
Specify a location for rejected rows.
Use the COPY statement to load data.
Use wildcards in the path to load multiple files.
You use Microsoft Fabric to manage data across warehouses and lakehouses.
You need to integrate data from a warehouse and a lakehouse into a single table for analysis.
What should you use?
CREATE TABLE AS SELECT (CTAS) statement.
Your organization uses Microsoft Fabric to process real-time IoT data monitoring environmental conditions. The data includes temperature and humidity readings streamed into a Microsoft KQL database.
You need to ensure efficient data ingestion and near real-time querying for reporting.
What should you do?
Implement Spark Structured Streaming to write data to Delta table.