Dp_203 Flashcards

1
Q

You have a table in an Azure Synapse Analytics dedicated SQL pool. The table was created by using the following Transact-SQL statement.

You need to alter the table to meet the following requirements:
✑ Ensure that users can identify the current manager of employees.
✑ Support creating an employee reporting hierarchy for your entire company.
✑ Provide fast lookup of the managers’ attributes such as name and job title.
Which column should you add to the table?
A. [ManagerEmployeeID] [smallint] NULL
B. [ManagerEmployeeKey] [smallint] NULL
C. [ManagerEmployeeKey] [int] NULL
D. [ManagerName] varchar NULL

A

C. [ManagerEmployeeKey] [int] NULL
Explain: We need an extra column to identify the Manager. Use the data type as the EmployeeKey column, an int column.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

You have an Azure Synapse workspace named MyWorkspace that contains an Apache Spark database named mytestdb.
You run the following command in an Azure Synapse Analytics Spark pool in MyWorkspace.
CREATE TABLE mytestdb.myParquetTable(
EmployeeID int,
EmployeeName string,
EmployeeStartDate date)

USING Parquet -
You then use Spark to insert a row into mytestdb.myParquetTable. The row contains the following data.
EmployeeName: Alice
EmployeeID: 24
EmployeeStartDate: 2020-01-25
One minute later, you execute the following query from a serverless SQL pool in MyWorkspace.

SELECT EmployeeID -
FROM mytestdb.dbo.myParquetTable
WHERE EmployeeName = ‘Alice’;
What will be returned by the query?
A. 24
B. an error
C. a null value

A

An error
Explaination: Table names will be converted to lower case and need to be queried using the lower case name

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Which role works with Azure Cognitive Services, Cognitive Search, and the Bot Framework?
(Azure Databricks)
A data engineer
A data scientist
An AI engineer

A

An AI engineer

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Azure Databricks encapsulates which Apache Storage technology?
(Azure Databricks)
Apache HDInsight
Apache Hadoop
Apache Spark

A

Apache Spark
Azure Databricks is an Apache Spark-based analytics platform optimized for the Microsoft Azure.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Which security features does Azure Databricks not support?
Azure Active Directory
Shared Access Keys
Role-based access

A

Shared Access Keys
Shared Access Keys are a security feature used within Azure storage accounts.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Which of the following Azure Databricks is used for support for R, SQL, Python, Scala, and Java?
MLlib
GraphX
Spark Core API

A

Spark Core API

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Which Notebook format is used in Databricks?

DBC
.notebook
.spark

A

DBC
dbc file types are the supported Databricks notebook format. There is no no .notebook or .spark file format available.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Which browsers are recommended for best use with Databricks Notebook?
Chrome and Firefox
Microsoft Edge and IE 11
Safari and Microsoft Edge

A

Chrome and Firefox
Microsoft Edge and IE 11 are not recommended because of faulty rendering of iFrames, but Safari is also an acceptable browser.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How do you connect your Spark cluster to the Azure Blob?
By calling the .connect() function on the Spark Cluster.
By mounting it
By calling the .connect() function on the Azure Blob

A

By mounting it
Mounts require Azure credentials such as SAS keys and give access to a virtually infinite store for your data. The .connect() function is not a valid method.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How does Spark connect to databases like MySQL, Hive and other data stores?
JDBC
ODBC
Using the REST API Layer

A

JDBC
JDBC stands for Java Database Connectivity, and is a Java API for connecting to databases such as MySQL, Hive, and other data stores. ODBC is not an option and the REST API Layer is not available

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How do you specify parameters when reading data?
Using .option() during your read allows you to pass key/value pairs specifying aspects of your read
Using .parameter() during your read allows you to pass key/value pairs specifying aspects of your read
Using .keys() during your read allows you to pass key/value pairs specifying aspects of your read

A

Using .option()
Using .option() during your read allows you to pass key/value pairs specifying aspects of your read. For instance, options for reading CSV data include header, delimiter, and inferSchema.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

By default, how are corrupt records dealt with using spark.read.json()
They appear in a column called “_corrupt_record”
They get deleted automatically
They throw an exception and exit the read operation

A

They appear in a column called “_corrupt_record”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is the recommended storage format to use with Spark?
JSON
XML
Apache Parquet

A

Apache Parquet
Apache Parquet is a highly optimized solution for data storage and is the recommended option for storage.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

You need to develop a pipeline for processing data. The pipeline must meet the following requirements:
Scale up and down resources for cost reduction
Use an in-memory data processing engine to speed up ETL and machine learning operations.
Use streaming capabilities
Provide the ability to code in SQL, Python, Scala, and R
Integrate workspace collaboration with Git
What should you use?

HDInsight Spark Cluster
Azure Stream Analytics
HDInsight Hadoop Cluster
Azure SQL Data Warehouse
HDInsight Kafka Cluster
HDInsight Storm Cluster

A

HDInsight Spark Cluster

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

You plan to perform batch processing in Azure Databricks once daily.

Which type of Databricks cluster should you use?

job
interactive
High Concurrency

A

Job

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

You are a data engineer implementing a lambda architecture on Microsoft Azure. You use an open-source big data solution to collect, process, and maintain data. The analytical data store performs poorly.
You must implement a solution that meets the following requirements:
Provide data warehousing
Reduce ongoing management activities
Deliver SQL query responses in less than one second
You need to create an HDInsight cluster to meet the requirements.

Which type of cluster should you create?

Apache HBase
Apache Hadoop
Interactive Query
Apache Spark

A

Apache Spark
Apache Spark for Azure HDInsight, a processing framework that runs large-scale data analytics applications.
Lambda architecture is a data-processing architecture designed to handle massive quantities of data by taking advantage of both batch processing and stream processing methods, and minimizing the latency involved in querying big data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

You plan to perform batch processing in Azure Databricks once daily.

Which type of Databricks cluster should you use?
High Concurrency
interactive
automated

A

Automated
You use interactive clusters to analyze data collaboratively with interactive notebooks. You use automated clusters to run fast and robust automated jobs.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Your company plans to create an event processing engine to handle streaming data from Twitter.
The data engineering team uses Azure Event Hubs to ingest the streaming data.
You need to implement a solution that uses Azure Databricks to receive the streaming data from the Azure Event Hubs.

Which three actions should you recommend be performed in sequence?
A. Create and configure a Notebook that consumes the streaming data.
B. Import data from Blob storage
C. Use Environment variables to define the Apache Spark connection.
D. Configure the JDBC or ODBC connector
E. Deploy Azure Databricks service
F. Deploy a Spark cluster and then attach the required libraries to the cluster.

A

E-F-A

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

You are developing a solution using a Lambda architecture on Microsoft Azure.
The data at rest layer must meet the following requirements:

Data storage:
- Serve as a repository for high volumes of large files in various formats.
- Implement optimized storage for big data analytics workloads.
- Ensure that data can be organized using a hierarchical structure.

Batch processing:
- Use a managed solution for in-memory computation processing.
- Natively support Scala, Python, and R programming languages.
Provide the ability to resize and terminate the cluster automatically.

Analytical data store:
- Support parallel processing.
- Use columnar storage.
- Support SQL-based languages.

You need to identify the correct technologies to build the Lambda architecture.

Which technologies should you use?
Data Storage:
A. Azure SQL Databse
B. Azure Blob Storage
C. Azure Cosmo DB
D. Azure Data Lake
Batch Processing:
A. HDinsight Spark
B. HDinsight Hadoop
C. Azure Databricks
D. HDinsight Interactive Query
Analytical data store:
A. HDinsight Hbase
B. Azure SQL Data warehouse
C. Azure Analysis services
D. Azure Cosmo DB

A

D-A-B
A key mechanism that allows Azure Data Lake Storage Gen2 to provide file system performance at object storage scale and prices is the addition of a hierarchical namespace.
Aparch Spark is an open-source, parallel-processing framework that supports in-memory processing to boost the performance of big-data analysis applications.
HDInsight is a managed Hadoop service. Use it deploy and manage Hadoop clusters in Azure. For batch processing, you can use Spark, Hive, Hive LLAP, MapReduce.
SQL Data Warehouse is a cloud-based Enterprise Data Warehouse (EDW) that uses Massively Parallel Processing (MPP).
SQL Data Warehouse stores data into relational tables with columnar storage.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

You create an Azure Databricks cluster and specify an additional library to install.
When you attempt to load the library to a notebook, the library is not found.
You need to identify the cause of the issue.

What should you review?

workspace logs
notebook logs
global init scripts logs
cluster event logs

A

global init scripts logs
Init scripts are shell scripts that run during the startup of each cluster node before the Spark driver or worker JVM starts. Databricks customers use init scripts for various purposes such as installing custom libraries, launching background processes, or applying enterprise security policies.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

You need to collect application metrics, streaming query events, and application log messages for an Azure Databricks cluster.
Which type of library and workspace should you implement?
Library:
A. Azure Databricks Monitoring Library
B. Azure Management Monitoring Library
C. PyTorch
D. TensorFlow
Workspace:
A. Azure Databricks
B. Azure Log Analytics
C. Azure Machine Learning

A

A-B
You can send application logs and metrics from Azure Databricks to a Log Analytics workspace. It uses the Azure Databricks Monitoring Library, which is available on GitHub.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

You have an Azure Databricks resource.
You need to log actions that relate to compute changes triggered by the Databricks resources.
Which Databricks services should you log?

workspace
SSH
DBFS
clusters
jobs

A

Clusters
An Azure Databricks cluster is a set of computation resources and configurations on which you run data engineering, data science, and data analytics workloads.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Your company analyzes images from security cameras and sends alerts to security teams that respond to unusual activity. The solution uses Azure Databricks.
You need to send Apache Spark level events, Spark Structured Streaming metrics, and application metrics to Azure Monitor.
Which three actions should you perform in sequence?
A. Create a data source in Azure Monitor
B. Configure the Databricks cluster to use the databricks monitoring library.
C. Deploy Grafana to open an Azure VM.
D. Build a spark-listeners-loganalytics-1.0-SNAPSHOT.jar Jar file.
E. Create Dropwizard counters in the application code.

A

B-D-E
You can send application logs and metrics from Azure Databricks to a Log Analytics workspace.
Spark uses a configurable metrics system based on the Dropwizard Metrics Library.
Prerequisites: Configure your Azure Databricks cluster to use the monitoring library.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

You plan to build a structured streaming solution in Azure Databricks. The solution will count new events in five-minute intervals and report only events that arrive during the interval. The output will be sent to a Delta Lake table.

Which output mode should you use?
complete
update
append

A

Append
Append Mode: Only new rows appended in the result table since the last trigger are written to external storage. This is applicable only for the queries where existing rows in the Result Table are not expected to change.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

You have an Azure Data Lake Storage Gen2 account that contains JSON files for customers. The files contain two attributes named FirstName and LastName.
You need to copy the data from the JSON files to an Azure Synapse Analytics table by using Azure Databricks. A new column must be created that concatenates the FirstName and LastName values.
You create the following components:
- A destination table in Azure Synapse
- An Azure Blob storage container
- A service principal
Which five actions should you perform in sequence next in a Databricks notebook?
A. Specify a temporary folder to stage the data
B. Write the results to a table in Azure Synapse
C. Write the results to Data Lank Storage.
D. Drop the data frame
E. Perform transformations on the data frame
F. Mouth the Data Lake Storage onto DBFS
G. Perform transformations on the file
H. Read the file into a data frame

A

H-E-A-B-D

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

You are designing an Azure Databricks interactive cluster.
You need to ensure that the cluster meets the following requirements:
- Enable auto-termination
- Retain cluster configuration indefinitely after cluster termination.

What should you recommend?

Start the cluster after it is terminated.
Pin the cluster
Clone the cluster after it is terminated.
Terminate the cluster manually at process completion.

A

Pin the cluster
To keep an interactive cluster configuration even after it has been terminated for more than 30 days, an administrator can pin a cluster to the cluster list.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

You are designing an Azure Databricks table. The table will ingest an average of 20 million streaming events per day.
You need to persist the events in the table for use in incremental load pipeline jobs in Azure Databricks. The solution must minimize storage costs and incremental load times.

What should you include in the solution?

Partition by DateTime fields.
Sink to Azure Queue storage.
Include a watermark column.
Use a JSON format for physical data storage.

A

Sink the Azure Queue Storage
The Databricks ABS-AQS connector uses Azure Queue Storage (AQS) to provide an optimized file source that lets you find new files written to an Azure Blob storage (ABS) container without repeatedly listing all of the files. This provides two major advantages:
- Lower latency: no need to list nested directory structures on ABS, which is slow and resource intensive.
- Lower costs: no more costly LIST API requests made to ABS.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

You are designing an Azure Databricks cluster that runs user-defined local processes.
You need to recommend a cluster configuration that meets the following requirements:
- Minimize query latency.
- Maximize the number of users that can run queries on the cluster at the same time.
- Reduce overall costs without compromising other requirements.

Which cluster type should you recommend?
(Azure Databricks)
Standard with Auto Termination
High Concurrency with Autoscaling
High Concurrency with Auto Termination
Standard with Autoscaling

A

High Concurrency with Autoscaling.
The key benefits of High Concurrency clusters are that they provide fine-grained sharing for maximum resource utilization and minimum query latencies.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

You are planning a solution to aggregate streaming data that originates in Apache Kafka and is output to Azure Data Lake Storage Gen2. The developers who will implement the stream processing solution use Java.

Which service should you recommend using to process the streaming data?
(Azure Databricks)
Azure Event Hubs
Azure Data Factory
Azure Stream Analytics
Azure Databricks

A

Azure Databricks

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

You need to implement an Azure Databricks cluster that automatically connects to Azure Data Lake Storage Gen2 by using Azure Active Directory (Azure AD) integration.
How should you configure the new cluster?
Tier:
A. Premium
B. Standard
Advanced option to enable:
A. Azure Data Lake Storage credential passthrough.
B. Table Access Control

A

Premium - Azure Data Lake storage Credential passthrough.
Credential passthrough requires an Azure Databricks Premium Plan

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

You create an Azure Databricks cluster and specify an additional library to install.
When you attempt to load the library to a notebook, the library in not found.
You need to identify the cause of the issue.
What should you review?
A. notebook logs
B. cluster event logs
C. global init scripts logs
D. workspace logs

A

cluster event logs
Cluster event logs capture cluster lifecycle events, like creation, termination, configuration edits, and so on.
Apache Spark driver and worker logs, which you can use for debugging.
Cluster init-script logs, valuable for debugging init scripts.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

You are planning a streaming data solution that will use Azure Databricks. The solution will stream sales transaction data from an online store. The solution has the following specifications:
- The output data will contain items purchased, quantity, line total sales amount, and line total tax amount.
- Line total sales amount and line total tax amount will be aggregated in Databricks.
- Sales transactions will never be updated. Instead, new rows will be added to adjust a sale.

You need to recommend an output mode for the dataset that will be processed by using Structured Streaming. The solution must minimize duplicate data.

What should you recommend?

Update
Complete
Append

A

Append

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

Which Azure Data Factory component contains the transformation logic or the analysis commands of the Azure Data Factory’s work?

Linked Services
Datasets
Activities
Pipelines

A

Activities
Activities contains the transformation logic or the analysis commands of the Azure Data Factory’s work.
Linked Services are objects that are used to define the connection to data stores or compute resources in Azure.
Datasets represent data structures within the data store that is being referenced by the Linked Service object.
Pipelines are a logical grouping of activities.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

You have a new Azure Data Factory environment.
You need to periodically analyze pipeline executions from the last 60 days to identify trends in execution durations. The solution must use Azure Log Analytics to query the data and create charts.

Which diagnostic settings should you configure in Data Factory?

A

Log type: Pipeline runs
Storage location: An Azure storage account

35
Q

You have an activity in an Azure Data Factory pipeline. The activity calls a stored procedure in a data warehouse in Azure Synapse Analytics and runs daily.
You need to verify the duration of the activity when it ran last.
What should you use?

A. The sys.dm_pdw_wait_stats data management view in Azure Synapse Analytic
B. An Azure Resource Manager template
C. Activity runs in Azure Monitor
D. Activity log in Azure Synapse Analytics

A

Activity runs in Azure Monitor

36
Q

You plan to monitor an Azure data factory by using the Monitor & Manage app.
You need to identify the status and duration of activities that reference a table in a source database.
Which three actions should you perform in sequence?
A. From the data factory monitoring app, add the source user property to the activity runs table.
B. From the data factory monitoring app, add the source user property to the pipeline runs table.
C. From the data factory authoring UI, publish the pipelines.
D. From the data factory monitoring app, add a linked service to the pipeline runs table.
E. From the data factory authoring UI, generate a user property for source on all activities.
F. From the data factory authoring UI, generate a user property for source on all the datasets.

A

E-B-C

37
Q

You have an Azure subscription that contains an Azure Storage account.
You plan to implement changes to a data storage solution to meet regulatory and compliance standards.
Every day, Azure needs to identify and delete blobs that were NOT modified during the last 100 days.

Solution: You schedule an Azure Data Factory pipeline with a delete activity.

Yes
No

A

Yes
You can use the Delete Activity in Azure Data Factory to delete files or folders from on-premises storage stores or cloud storage stores.
Azure Blob storage is supported.
Note: You can also apply an Azure Blob storage lifecycle policy.

38
Q

You use Azure Data Factory to prepare data to be queried by Azure Synapse Analytics serverless SQL pools.
Files are initially ingested into an Azure Data Lake Storage Gen2 account as 10 small JSON files. Each file contains the same data attributes and data from a subsidiary of your company.

You need to move the files to a different folder and transform the data to meet the following requirements:
- Provide the fastest possible query times.
- Automatically infer the schema from the underlying files.

How should you configure the Data Factory copy activity?
Copying behavior:
A. Flatten hierarchy
B. Merge files
C. Preserve hierarchy
Sink file type:
A. CSV
B. Json
C. Parquet
D. Text

A

Merge activities & Parquet
1) Merge Files - Question clearly says “initially ingested as 10 small json files”. There is no hint on hierarchy or partition information. so clearly we need to merge these files for better performance

2) Parquet -> Always gives better performance for columnar based data

39
Q

You are creating an Azure Data Factory data flow that will ingest data from a CSV file, cast columns to specified types of data, and insert the data into a table in an Azure Synapse Analytic dedicated SQL pool. The CSV file contains three columns named username, comment, and date.
The data flow already contains the following:
- A source transformation.
- A Derived Column transformation to set the appropriate types of data.
- A sink transformation to land the data in the pool.

You need to ensure that the data flow meets the following requirements:
- All valid rows must be written to the destination table.
- Truncation errors in the comment column must be avoided proactively.
- Any rows containing comment values that will cause truncation errors upon insert must be written to a file in blob storage.

Which two actions should you perform?

To the data flow, add a sink transformation to write the rows to a file in blob storage.
To the data flow, add a Conditional Split transformation to separate the rows that will cause truncation errors.
To the data flow, add a filter transformation to filter out rows that will cause truncation errors.
Add a select transformation to select only the rows that will cause truncation errors.

A

To the data flow, add a sink transformation to write the rows to a file in blob storage.

To the data flow, add a Conditional Split transformation to separate the rows that will cause truncation errors.

40
Q

You build an Azure Data Factory pipeline to move data from an Azure Data Lake Storage Gen2 container to a database in an Azure Synapse Analytics dedicated SQL pool.
Data in the container is stored in the following folder structure.
/in/{YYYY}/{MM}/{DD}/{HH}/{mm}
The earliest folder is /in/2021/01/01/00/00. The latest folder is /in/2021/01/15/01/45.

You need to configure a pipeline trigger to meet the following requirements:
- Existing data must be loaded.
- Data must be loaded every 30 minutes.
- Late-arriving data of up to two minutes must he included in the load for the time at which the data should have arrived.

How should you configure the pipeline trigger?
Type:
A. Event
B. On demand
C. Schedule
D. Tumbling window
Additional properties:
A. Prefix /in/, Event: Blob created
B. Recurrence: 30 minutes, Start time: 2021-01-01T00:00
C. Recurrence: 30 minutes, Start time: 2021-01-01T00:00, Delay: 2 minutes.
D. Recurrence: 32 minutes, Start time: 2021-01-15T01:45

A

Tumbling window
Recurrence: 30 minutes, Start time: 2021-01-01T00:00, Delay: 2 minutes.
Explain:To be able to use the Delay parameter we select Tumbling window

41
Q

You have two Azure Data Factory instances named ADFdev and ADFprod. ADFdev connects to an Azure DevOps Git repository.
You publish changes from the main branch of the Git repository to ADFdev.
You need to deploy the artifacts from ADFdev to ADFprod.

What should you do first?
(Azure Data Factory)
From ADFdev, modify the Git configuration.
From ADFdev, create a linked service.
From Azure DevOps, create a release pipeline.
From Azure DevOps, update the main branch.

A

From Azure DevOps, create a release pipeline.

42
Q

You have a self-hosted integration runtime in Azure Data Factory.
The current status of the integration runtime has the following configurations:
- Status: Running
- Type: Self-Hosted
- Version: 4.4.7292.1
- Running / Registered Node(s): 1/1
- High Availability Enabled: False
- Linked Count: 0
- Queue Length: 0
- Average Queue Duration. 0.00s

The integration runtime has the following node details:
- Name: X-M
- Status: Running
- Version: 4.4.7292.1
- Available Memory: 7697MB
- CPU Utilization: 6%
- Network (In/Out): 1.21KBps/0.83KBps
- Concurrent Jobs (Running/Limit): 2/14
- Role: Dispatcher/Worker
- Credential Status: In Sync

A
43
Q

You have a self-hosted integration runtime in Azure Data Factory.
The current status of the integration runtime has the following configurations:
- Status: Running
- Type: Self-Hosted
- Version: 4.4.7292.1
- Running / Registered Node(s): 1/1
- High Availability Enabled: False
- Linked Count: 0
- Queue Length: 0
- Average Queue Duration. 0.00s

The integration runtime has the following node details:
- Name: X-M
- Status: Running
- Version: 4.4.7292.1
- Available Memory: 7697MB
- CPU Utilization: 6%
- Network (In/Out): 1.21KBps/0.83KBps
- Concurrent Jobs (Running/Limit): 2/14
- Role: Dispatcher/Worker
- Credential Status: In Sync
IF the X-M node becomes unavailable , all executed pipelines will:
A. Fail until the node comes back online.
B. Switch to another integration runtime
C. Exceed the CPU limit.
The number of concurrent jobs and the CPU usage indicate the concurrent jobs value should be:
A. Raised
B. Lowered
C. left as is

A

Fails until the node comes back online
Lowered

44
Q

You have an Azure data factory.
You need to examine the pipeline failures from the last 60 days.

What should you use?

A. The Activity log blade for the Data Factory resource
B. The Monitor & Manage app in Data Factory
C. The Resource health blade for the Data Factory resource
D. Azure Monitor

A

Azure Monitor
Data Factory stores pipeline-run data for only 45 days. Use Azure Monitor if you want to keep that data for a longer time.

45
Q

A company plans to develop solutions to perform batch processing of multiple sets of geospatial data.
You need to implement the solutions.

Which Azure services should you use?

A

Box 1: HDInsight Tools for Visual Studio
Azure HDInsight Tools for Visual Studio Code is an extension in the Visual Studio Code Marketplace for developing Hive Interactive Query, Hive Batch Job and PySpark Job against Microsoft HDInsight.

Box 2: Hive View
You can use Apache Ambari Hive View with Apache Hadoop in HDInsight. The Hive View allows you to author, optimize, and run Hive queries from your web browser.

Box 3: HDInsight REST API
Azure HDInsight REST APIs are used to create and manage HDInsight resources through Azure Resource Manager.

46
Q

You have an Azure Data Factory pipeline that performs an incremental load of source data to an Azure Data Lake Storage Gen2 account.
Data to be loaded is identified by a column named LastUpdatedDate in the source table.
You plan to execute the pipeline every four hours.
You need to ensure that the pipeline execution meets the following requirements:
- Automatically retries the execution when the pipeline run fails due to concurrency or throttling limits.
- Supports backfilling existing data in the table.

Which type of trigger should you use?

A. event
B. on-demand
C. schedule
D. tumbling window

A

Tumbling window
Tumbling Window Trigger: Can be used to process history data. Also can define Delay, Max concurrency, retry policy etc.

47
Q

You are building an Azure Data Factory solution to process data received from Azure Event Hubs, and then ingested into an Azure Data Lake Storage Gen2 container.
The data will be ingested every five minutes from devices into JSON files. The files have the following naming pattern.
/{deviceType}/in/{YYYY}/{MM}/{DD}/{HH}/{deviceID}{YYYY}{MM}{DD}HH}{mm}.json
You need to prepare the data for batch data processing so that there is one dataset per hour per deviceType. The solution must minimize read times.
How should you configure the sink for the copy activity?
Paramter:
@pipeline(),triggertime
@pipeline(),triggertype
@trigger().outputs.windowStartTime
@trigger().Starttime
Naming pattern:
/(deviceID)/out/{YYYY}/{MM}/{DD}/{HH}.json
/{YYYY}/{MM}/{DD}/{devicetype}.json
/{YYYY}/{MM}/{DD}/{HH}.json
/{YYYY}/{MM}/{DD}/{HH}
{devicetype}.json
Copy behavior:
Add dynamic content
Flatten hierarchy
Merge files

A

@trigger().outputs.windowStartTime
/{YYYY}/{MM}/{DD}/{HH}.json
Merge files
Explain:
1) @trigger().outputs.windowStartTime - this output is from a tumbling window trigger, and is required to identify the correct directory at the /{HH}/ level. Using windowStartTime will give the hour with complete data. The @trigger().startTime is for a schedule trigger, which corresponds to the hour for which data has not arrived yet.
2) /{YYYY}/{MM}/{DD}/{HH}_{deviceType}.json is the naming pattern to achieve an hourly dataset for each device type.
3) Multiple files for each device type will exist on the source side, since the naming pattern starts with {deviceID}… so the files must be merged in the sink to create a single file per device type.

48
Q

You have the following Azure Data Factory pipelines:
- Ingest Data from System1
- Ingest Data from System2
- Populate Dimensions
- Populate Facts

Ingest Data from System1 and Ingest Data from System2 have no dependencies. Populate Dimensions must execute after Ingest Data from System1 and Ingest Data from System2. Populate Facts must execute after Populate Dimensions pipeline. All the pipelines must execute every eight hours.

What should you do to schedule the pipelines for execution?
(Azure Data Factory)
Add an event trigger to all four pipelines.
Add a schedule trigger to all four pipelines.
Create a parent pipeline that contains the four pipelines and use a schedule trigger.
Create a patient pipeline that contains the four pipelines and use an event trigger.

A

Create a parent pipeline that contains the four pipelines and use a schedule trigger.

49
Q

You have an Azure Data Factory instance that contains two pipelines named Pipeline1 and Pipeline2.
Pipeline1 has the activities shown in the following exhibit.
Stored procedure1 -> (x) Set variable 1
Pipeline2 has the activities shown in the following exhibit.
Execute pipeline1 -> (x) Set variable1

You execute Pipeline2, and Stored procedure1 in Pipeline1 fails.
What is the status of the pipeline runs?
Pipeline1 and Pipeline2 succeeded.
Pipeline1 and Pipeline2 failed.
Pipeline1 succeeded and Pipeline2 failed.
Pipeline1 failed and Pipeline2 succeeded.

A

Pipeline1 and Pipeline2 succeeded.
Explain: Activities are linked together via dependencies. A dependency has a condition of one of the following: Succeeded, Failed, Skipped, or Completed.
Consider Pipeline1:
If we have a pipeline with two activities where Activity2 has a failure dependency on Activity1, the pipeline will not fail just because Activity1 failed. If Activity1 fails and Activity2 succeeds, the pipeline will succeed. This scenario is treated as a try-catch block by Data Factory.

The failure dependency means this pipeline reports success.
Note:
If we have a pipeline containing Activity1 and Activity2, and Activity2 has a success dependency on Activity1, it will only execute if Activity1 is successful. In this scenario, if Activity1 fails, the pipeline will fail.

The trick is the fact that pipeline 1 only has a Failure dependency between de activity’s. In this situation this results in a Succeeded pipeline if the Stored procedure failed. If also the success connection was linked to a follow up activity, and the SP would fail, the pipeline would be indeed marked as failed.

50
Q

You have an Azure Data Factory instance named ADF1 and two Azure Synapse Analytics workspaces named WS1 and WS2.
ADF1 contains the following pipelines:
● P1: Uses a copy activity to copy data from a nonpartitioned table in a dedicated SQL pool of WS1 to an Azure Data Lake Storage Gen2 account
● P2: Uses a copy activity to copy data from text-delimited files in an Azure Data Lake Storage Gen2 account to a nonpartitioned table in a dedicated SQL pool of

WS2 -
You need to configure P1 and P2 to maximize parallelism and performance.
Which dataset settings should you configure for the copy activity if each pipeline?
P1:
Set the copy method to Bulk insert
Set the copy method to Polybase
Set the isolation level to Repeatable read
Set the partition option to dynamic range
P2:
Set the copy method to Bulk insert
Set the copy method to Polybase
Set the isolation level to Repeatable read
Set the partition option to dynamic range

A

Set the copy method to Polybase
Set the copy method to Polybase
PolyBase supports both export to and import from ADLS.
PolyBase does support delimited text files, which contradicts the question’s official answer. “Currently PolyBase can load data from UTF-8 and UTF-16 encoded delimited text files as well as the popular Hadoop file formats RC File, ORC, and Parquet (non-nested format).”

51
Q

You have an Azure Storage account that generates 200,000 new files daily. The file names have a format of {YYYY}/{MM}/{DD}/{HH}/{CustomerID}.csv.
You need to design an Azure Data Factory solution that will load new data from the storage account to an Azure Data Lake once hourly. The solution must minimize load times and costs.
How should you configure the solution?
Load method:
Full load.
Incremental load.
Load individual files as they arrive.
Trigger:
Fixed schedule.
New file.
Tumbling window.

A

Incremental load
Tumbling window
Explain:
Tumbling windows are a series of fixed-sized, non-overlapping and contiguous time intervals. The following diagram illustrates a stream with a series of events and how they are mapped into 10-second tumbling windows.

52
Q

You are designing a solution that will copy Parquet files stored in an Azure Blob storage account to an Azure Data Lake Storage Gen2 account.
The data will be loaded daily to the data lake and will use a folder structure of {Year}/{Month}/{Day}/.
You need to design a daily Azure Data Factory data load to minimize the data transfer between the two accounts.
Which two configurations should you include in the design?

A. Specify a file naming pattern for the destination.
b. Delete the files in the destination before loading the data.
C. Filter by the last modified date of the source files.
D. Delete the source files after they are copied.

A

Filter by the last modified date of the source files.
Delete the source files after they are copied.
Explain: Copy only the daily files by using filtering.

53
Q

You have an Azure Data Factory version 2 (V2) resource named Df1. Df1 contains a linked service.
You have an Azure Key vault named vault1 that contains an encryption key named key1.
You need to encrypt Df1 by using key1.
What should you do first?

A. Add a private endpoint connection to vaul1.
B. Enable Azure role-based access control on vault1.
C. Remove the linked service from Df1.
D. Create a self-hosted integration runtime.

A

Remove the linked service from Df1.

54
Q

You have an Azure subscription that contains an Azure Data Lake Storage account. The storage account contains a data lake named DataLake1.
You plan to use an Azure data factory to ingest data from a folder in DataLake1, transform the data, and land the data in another folder.
You need to ensure that the data factory can read and write data from any folder in the DataLake1 file system. The solution must meet the following requirements:
● Minimize the risk of unauthorized user access.
● Use the principle of least privilege.
● Minimize maintenance effort.
How should you configure access to the storage account for the data factory?
Use:
A. Azure Active Directory.
B. a shared access signature (SAS)
C. a shared key
To authenticate by using:
A. a managed identity
B. a stored access policy
C. an Authorization header

A

Azure Active Directory
A managed identity
Explain:
On Azure, managed identities eliminate the need for developers having to manage credentials by providing an identity for the Azure resource in Azure AD and using it to obtain Azure Active Directory (Azure AD) tokens.

55
Q

You have an Azure Data Factory pipeline that has the activities shown in the following exhibit.
Web1(green line) -> (x) set variable1-> Stored procedure1
Web1(orange line) -> (x) set variable 2
Stored procedure1 will execute and Set variable1:
A. Fail
B. Succeed
C. Complete
If web1 fails and Set variable2 succeeds the pipeline status will be:
A. Failed
B. Succeeded
C. Cancelled

A

Succeed
Failed
Explain:
Green line indicates success path.
Orange line indicates fail path.

56
Q

You have several Azure Data Factory pipelines that contain a mix of the following types of activities:
● Wrangling data flow
● Notebook
● Copy
● Jar
Which two Azure services should you use to debug the activities? Each correct answer presents part of the solution.
NOTE: Each correct selection is worth one point

Azure Synapse Analytics
Azure HDInsight
Azure Machine Learning
Azure Data Factory
Azure Databricks

A

Azure Data Factory
Azure Databricks
Explain:
1. Data wangling is only supported by ADF not Synapse Analytics.

  1. Notebook, Jar activity requires Databricks.
57
Q

You plan to create an Azure Data Factory pipeline that will include a mapping data flow.
You have JSON data containing objects that have nested arrays.
You need to transform the JSON-formatted data into a tabular dataset. The dataset must have one row for each item in the arrays.

Which transformation method should you use in the mapping data flow?
A. new branch
B. unpivot
C. alter row
D. flatten

A

Flatten
Use the flatten transformation to take array values inside hierarchical structures such as JSON and unroll them into individual rows. This process is known as denormalization.

58
Q

You have an Azure Data Factory pipeline that is triggered hourly.
The pipeline has had 100% success for the past seven days.
The pipeline execution fails, and two retries that occur 15 minutes apart also fail. The third failure returns the following error.

ErrorCode=UserErrorFileNotFound,’Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=ADLS Gen2 operation failed for: Operation returned an invalid status code ‘NotFound’.
Account: ‘contosoproduksouth’. Filesystem: wwi.
Path: ‘BIKES/CARBON/year=2021/month=01/day=10/hour=06’.
ErrorCode: ‘PathNotFound’.
Message: ‘The specified path does not exist.’.
RequestId: ‘6d269b78-901f-001b-4924-e7a7bc000000’.
TimeStamp: ‘Sun, 10 Jan 2021 07:45:05

What is a possible cause of the error?

A. The parameter used to generate year=2021/month=01/day=10/hour=06 was incorrect.
B. From 06:00 to 07:00 on January 10, 2021, there was no data in wwi/BIKES/CARBON.
C. From 06:00 to 07:00 on January 10, 2021, the file format of data in wwi/BIKES/CARBON was incorrect.
D. The pipeline was triggered too early.

A

B. From 06:00 to 07:00 on January 10, 2021, there was no data in wwi/BIKES/CARBON.
Explain: The error message says a missing file, which matches with answer B: missing data from 06:00. The process had re-tried three times, 15 mins apart, which explains that the error was generated 07:45.

59
Q

Which of the following terms refer to the scale of compute that is being used in an Azure SQL Synapse Analytics server?

RTU
DWU
DTU

A

DWU
Data Warehouse Units is the measure of compute scale that is assigned to an Azure SL Data Warehouse.
RTU is a compute scale unit of Cosmos DB. DTU is a compute scale unit of Azure SQL Database.

60
Q

Encrypted communication is turned on automatically when connecting to an Azure SQL Database or Azure Synapse Analytics. True or False?

True
False

A

True
Azure SQL Database enforces encryption (SSL/TLS) at all times for all connections.

61
Q

You are receiving an error message in Azure Synapse Analytics, You want to view information about the service and help to solve the problem, what can you use to quickly check the availability of the service?

Network performance monitor
Diagnose and solve problems
Azure monitor

A

Diagnose and solve problems
Diagnose and solve problems can quickly show you the service availability.
Azure Monitor allows you in collecting, analyzing, and acting on telemetry from both cloud and on-premises environments.
Network performance monitor measure the performance and reachability of the networks that you have configured.

62
Q

You have an enterprise data warehouse in Azure Synapse Analytics named DW1 on a server named Server1.
You need to verify whether the size of the transaction log file for each distribution of DW1 is smaller than 160 GB.

What should you do?
A. On the master database, execute a query against the sys.dm_pdw_nodes_os_performance_counters dynamic management view.
B. From Azure Monitor in the Azure portal, execute a query against the logs of DW1.
C. On DW1, execute a query against the sys.database_files dynamic management view.
D. Execute a query against the logs of DW1 by using the Get-AzOperationalInsightSearchResult PowerShell cmdlet.

A

On the master database, execute a query against the sys.dm_pdw_nodes_os_performance_counters.
– Transaction log size
SELECT
instance_name as distribution_db, cntr_value*1.0/1048576 as log_file_size_used_GB, pdw_node_id
FROM sys.dm_pdw_nodes_os_performance_counters
WHERE
instance_name like ‘Distribution_%’
AND counter_name = ‘Log File(s) Used Size (KB)’

63
Q

You have an enterprise data warehouse in Azure Synapse Analytics.
You need to monitor the data warehouse to identify whether you must scale up to a higher service level to accommodate the current workloads.

Which is the best metric to monitor?
More than one answer choice may achieve the goal. Select the BEST answer.

CPU percentage
DWU used
DWU percentage
Data IO percentage

A

DWU used
Defined as DWU limit * DWU percentage

64
Q

You have an enterprise data warehouse in Azure Synapse Analytics that contains a table named FactOnlineSales. The table contains data from the start of 2009 to the end of 2012.
You need to improve the performance of queries against FactOnlineSales by using table partitions. The solution must meet the following requirements:
- Create four partitions based on the order date.
- Ensure that each partition contains all the orders placed during a given calendar year.
How should you complete the T-SQL command?
Create Table [dbo].factonlinesales
([onlinesaleskey] [int] NOT NULL,
[Orderdatekey] [datetime] NOT NULL,
[Storekey] [int] NOT NULL,
[productkey] [int] NOT NULL,
[customerkey] [int] NOT NULL,
[salesordernumber] [nvarchar] (20) NOT NULL,
[salesquantity] [int] NOT NULL,
[salesamount] [int] NOT NULL,
[unitprice] [money] Null,
WITH (clustered columnstore index)
PARTITION ( [Orderdatekey] RANGE
A. RIGHT
B. LEFT
FOR VALUES
A. 20090101, 20121231
B. 20100101,20110101,20120101
C. 20090101,20100101,20110101,20120101

A

LEFT
20090101,20100101,20110101,20120101
Explain:
RANGE LEFT: Specifies the boundary value belongs to the partition on the left (lower values). The default is LEFT.
FOR VALUES ( boundary_value [,…n] ) specifies the boundary values for the partition. boundary_value is a constant expression.

65
Q

You are designing an enterprise data warehouse in Azure Synapse Analytics. You plan to load millions of rows of data into the data warehouse each day.
You must ensure that staging tables are optimized for data loading.
You need to design the staging tables.

What type of tables should you recommend?
Round-robin distributed table
Hash-distributed table
Replicated table
External table

A

Round-robin distributed table

66
Q

You have a SQL pool in Azure Synapse.
You discover that some queries fail or take a long time to complete.
You need to monitor for transactions that have rolled back.

Which dynamic management view should you query?
sys.dm_pdw_nodes_tran_database_transactions
sys.dm_pdw_waits
sys.dm_pdw_request_steps
sys.dm_pdw_exec_sessions

A

sys.dm_pdw_nodes_tran_database_transactions
Explain:
You can use Dynamic Management Views (DMVs) to monitor your workload including investigating query execution in SQL pool.
SELECT
SUM(CASE WHEN t.database_transaction_next_undo_lsn IS NOT NULL THEN 1 ELSE 0 END), t.pdw_node_id, nod.[type]
FROM sys.dm_pdw_nodes_tran_database_transactions t
JOIN sys.dm_pdw_nodes nod ON t.pdw_node_id = nod.pdw_node_id
GROUP BY t.pdw_node_id, nod.[type]

67
Q

You are designing the folder structure for an Azure Data Lake Storage Gen2 container.
Users will query data by using a variety of services including Azure Databricks and Azure Synapse Analytics serverless SQL pools. The data will be secured by subject area. Most queries will include data from the current year or current month.

Which folder structure should you recommend to support fast queries and simplified folder security?

A./{SubjectArea}/{DataSource}/{DD}/{MM}/{YYYY}/{FileData}{YYYY}{MM}{DD}.csv
B./{DD}/{MM}/{YYYY}/{SubjectArea}/{DataSource}/{FileData}
{YYYY}{MM}{DD}.csv
C./{YYYY}/{MM}/{DD}/{SubjectArea}/{DataSource}/{FileData}{YYYY}{MM}{DD}.csv
D./{SubjectArea}/{DataSource}/{YYYY}/{MM}/{DD}/{FileData}
{YYYY}{MM}{DD}.csv

A

/{SubjectArea}/{DataSource}/{YYYY}/{MM}/{DD}/{FileData}{YYYY}{MM}_{DD}.csv
Explain:
There’s an important reason to put the date at the end of the directory structure. If you want to lock down certain regions or subject matters to users/groups, then you can easily do so with the POSIX permissions.

68
Q

You are building an Azure Analytics query that will receive input data from Azure IoT Hub and write the results to Azure Blob storage.
You need to calculate the difference in readings per sensor per hour.

How should you complete the query?
SELECT sensorId,
growth = reading
A. LAG
B. LAST
C. LEAD
(reading) OVER (PARTITION BY sensorId A.LIMIT DURATION
B. OFFSET
C. WHEN
(hour, 1))
FROM input

A

LAG
LIMIT DURATION
Explain:
The LAG analytic operator allows one to look up a previous event in an event stream, within certain constraints. It is very useful for computing the rate of growth of a variable, detecting when a variable crosses a threshold, or when a condition starts or stops being true.

69
Q

You have an Azure subscription that contains a logical Microsoft SQL server named Server1. Server1 hosts an Azure Synapse Analytics SQL dedicated pool named Pool1.
You need to recommend a Transparent Data Encryption (TDE) solution for Server1.
The solution must meet the following requirements:
- Track the usage of encryption keys.
- Maintain the access of client apps to Pool1 in the event of an Azure datacenter outage that affects the availability of the encryption keys.

What should you include in the recommendation?
To track encryption key usage:
A. Always encrypted
B. TDE with customer-managed keys
C. TDE with platform-managed keys
To maintain client app access in the event of a datacenter outage:
A. Create and configure Azure Key vaults into 2 azure regions.
B. Enable Advanced Data Security on server1.
C. Implement the client apps by using Microsoft.NET framework data provider

A

TDE with customer-managed keys
Create and configure Azure Key vaults into 2 azure regions.
Explain:
Customer-managed keys are stored in the Azure Key Vault. You can monitor how and when your key vaults are accessed, and by whom.
The contents of your key vault are replicated within the region and to a secondary region at least 150 miles away, but within the same geography to maintain high durability of your keys and secrets.

70
Q

You plan to create an Azure Synapse Analytics dedicated SQL pool.
You need to minimize the time it takes to identify queries that return confidential information as defined by the company’s data privacy regulations and the users who executed the queues.

Which two components should you include in the solution?

A. Sensitivity-classification labels applied to columns that contain confidential information
B. Resource tags for databases that contain confidential information
C. Audit logs sent to a Log Analytics workspace
D. Dynamic data masking for columns that contain confidential information

A

Sensitivity-classification labels applied to columns that contain confidential information.
Audit logs sent to a Log Analytics workspace.

71
Q

You have a partitioned table in an Azure Synapse Analytics dedicated SQL pool. You need to design queries to maximize the benefits of partition elimination. What should you include in the Transact-SQL queries?

JOIN
WHERE
DISTINCT
GROUP BY

A

Where

72
Q

You implement an enterprise data warehouse in Azure Synapse Analytics.
You have a large fact table that is 10 terabytes (TB) in size.
Incoming queries use the primary key SaleKey column to retrieve data as displayed in the following table:

You need to distribute the large fact table across multiple nodes to optimize performance of the table.

Which technology should you use?

A. Hash distributed table with clustered index
B. Hash distributed table with clustered Column-store index
C. Round robin distributed table with clustered index
D. Round robin distributed table with clustered Column-store index
E. Heap table with distribution replicate

A

B. Hash distributed table with clustered Column-store index.
Explain:
Hash-distributed tables improve query performance on large fact tables.
Column-store indexes can achieve up to 100x better performance on analytics and data warehousing workloads and up to 10x better data compression than traditional row-store indexes.

73
Q

You have a SQL pool in Azure Synapse.
You plan to load data from Azure Blob storage to a staging table. Approximately 1 million rows of data will be loaded daily. The table will be truncated before each daily load.
You need to create the staging table. The solution must minimize how long it takes to load the data to the staging table.

How should you configure the table?
Distribution:
A. Hash
B. Replicated
C. Round robin
Indexing:
A. Clustered
B. Clustered column-store
C. Heap
Partitioning:
A. Date
B. None

A

Round robin
Heap
None
Explain:
Round-robin - this is the simplest distribution model, not great for querying but fast to process

Heap - The term heap basically refers to a table without a clustered index. Adding a clustered index to a temp table makes absolutely no sense and is a waste of compute resources for a table that would be entirely truncated daily. no clustered index = heap.

No partitions - Partitioning by date is useful when stage destination has data because you can hide the inserting data’s new partition (to keep users from hitting it), complete the load and then unhide the new partition.
However, in this question it states, “the table will be truncated before each daily load”, so, it appears it’s a true Staging table and there are no users with access, no existing data, and I see no reason to have a Date partition.

74
Q

You have an Azure Synapse Analytics workspace named WS1 that contains an Apache Spark pool named Pool1.
You plan to create a database named DB1 in Pool1.
You need to ensure that when tables are created in DB1, the tables are available automatically as external tables to the built-in serverless SQL pool.
Which formats should you use for the tables in DB1?

CSV
ORC
JSON
Parquet

A

CSV
Parquet

75
Q

You are designing a financial transactions table in an Azure Synapse Analytics dedicated SQL pool. The table will have a clustered columnstore index and will include the following columns:
● TransactionType: 40 million rows per transaction type
● CustomerSegment: 4 million per customer segment
● TransactionMonth: 65 million rows per month
AccountType: 500 million per account type

You have the following query requirements:
● Analysts will most commonly analyze transactions for a given month.
● Transactions analysis will typically summarize transactions by transaction type, customer segment, and/or account type
You need to recommend a partition strategy for the table to minimize query times.

On which column should you recommend partitioning the table?

CustomerSegment
AccountType
TransactionType
TransactionMonth

A

Transactionmonth

76
Q

You have an Azure Synapse Analytics dedicated SQL pool named Pool1. Pool1 contains a partitioned fact table named dbo.Sales and a staging table named stg.Sales that has the matching table and partition definitions.
You need to overwrite the content of the first partition in dbo.Sales with the content of the same partition in stg.Sales. The solution must minimize load times.
What should you do?
A. Insert the data from stg.Sales into dbo.Sales.
B. Switch the first partition from dbo.Sales to stg.Sales.
C. Switch the first partition from stg.Sales to dbo.Sales.
D. Update dbo.Sales from stg.Sales.

A

Switch the first partition from stg.Sales to dbo.Sales.
Explain: Since the need is to overwrite dbo.Sales with the content of stg.Sales. SWITCH source to target.

77
Q

You are designing a slowly changing dimension (SCD) for supplier data in an Azure Synapse Analytics dedicated SQL pool.
You plan to keep a record of changes to the available fields.
The supplier data contains the following columns.

Which three additional columns should you add to the data to create a Type 2

Surrogate primary key
Effective start date
Business key
Last modified date
Effective end date
Foreign key

A

Surrogate primary key
Effective start date
Effective end date

78
Q

You have a Microsoft SQL Server database that uses a third normal form schema.
You plan to migrate the data in the database to a star schema in an Azure Synapse Analytics dedicated SQL pool.
You need to design the dimension tables. The solution must optimize read operations.
What should you include in the solution?
Transform data for the dimension tables by:
A. Maintaining to a third table form
B. Normalizing to a fourth normal form
C. Denormalizing to a second normal form
For the primary key columns in the dimension tables, use:
A. New IDENTITY column
B. A new computed column
C. The business key column from the source sys.

A

Denormalizing to a second normal form.
New IDENTITY column.
Explain:
Denormalization increases the performance in data retrieval at cost of bringing update anomalies to a database.

79
Q

You are designing a fact table named FactPurchase in an Azure Synapse Analytics dedicated SQL pool. The table contains purchases from suppliers for a retail store. FactPurchase will contain the following columns.

FactPurchase will have 1 million rows of data added daily and will contain three years of data.
Transact-SQL queries similar to the following query will be executed daily.
SELECT
SupplierKey,
StockItemKey,
COUNT(*)
FROM FactPurchase
WHERE 1=1
AND DateKey >= 20210101
AND DateKey <= 20210131
GROUP By SupplierKey, StockItemKey
Which table distribution will minimize query times?

Replicated
Hash-distributed on PurchaseKey
Round-robin
Hash-distributed on DateKey

A

Hash-distributed on PurchaseKey
Explain:
Do not use a date column. All data for the same date lands in the same distribution. If several users are all filtering on the same date, then only 1 of the 60 distributions do all the processing work.

80
Q

You are designing a data mart for the human resources (HR) department at your company. The data mart will contain employee information and employee transactions.
From a source system, you have a flat extract that has the following fields:
● EmployeeID
● FirstName
● LastName
● Recipient
● GrossAmount
● TransactionID
● GovernmentID
● NetAmountPaid
● TransactionDate

You need to design a star schema data model in an Azure Synapse Analytics dedicated SQL pool for the data mart.
Which two tables should you create?

a dimension table for Transaction
a dimension table for Employee Transaction
a dimension table for Employee
a fact table for Employee
a fact table for Transaction

A

Dimension for employee
Fact for transaction

81
Q

You are building an Azure Synapse Analytics dedicated SQL pool that will contain a fact table for transactions from the first half of the year 2020.
You need to ensure that the table meets the following requirements:
● Minimizes the processing time to delete data that is older than 10 years
● Minimizes the I/O for queries that use year-to-date values
How should you complete the Transact-SQL statement?
With (
A. Clustered column-store index
B. Distribution
C. Partition
D. Truncate_target
(
A. [TransactionDateID]
B. [TransactionDateID], [TransactionKeyID]
C. Hash([TransactionTypeID).
D. Round_robin
RANGE RIGHT FOR VALUES
(20200101,20210201,20200301,20200401,20200501,20200601)

A

Partition
[TransactionDateID]
Explain:
Partition is used with RANGE RIGHT VALUES FOR.

82
Q

You are performing exploratory analysis of the bus fare data in an Azure Data Lake Storage Gen2 account by using an Azure Synapse Analytics serverless SQL pool.
You execute the Transact-SQL query shown in the following exhibit.
SELECT payment_type, Sum(fare_amount) As fare_total
FROM OPENROWSET(
BULK ‘csv/busfare/tripdata_2020*.csv’,
DATA_SOURCE = ‘Bus Data’,
FORMAT = ‘CSV’’, PARSER_VERSION = ‘2.0’,
FIRSTROW = 2)
What do the query results include?
A. Only CSV files in the tripdata_2020 subfolder.
B. All files that have file names that beginning with
C. All CSV files that have file names that contain
D. Only CSV that have file names that beginning with.

A

D. Only CSV that have file names that beginning with.

83
Q

ou need to implement a Type 3 slowly changing dimension (SCD) for product category data in an Azure Synapse Analytics dedicated SQL pool.
You have a table that was created by using the following Transact-SQL statement.

Which two columns should you add to the table? Each correct answer presents part of the solution?

[EffectiveStartDate] [datetime] NOT NULL,
[CurrentProductCategory] [nvarchar] (100) NOT NULL,
[EffectiveEndDate] [datetime] NULL,
[ProductCategory] [nvarchar] (100) NOT NULL,
[OriginalProductCategory] [nvarchar] (100) NOT NULL,

A