Exam questions Flashcards
(325 cards)
You have a table in an Azure Synapse Analytics dedicated SQL pool. The table was created by using the following Transact-SQL statement.
CREATE TABLE [dbo] . [DimEmployee] ( [EmployeeKey] [int] IDENTITY (1, 1) NOT NULL, [EmployeeID] [int] NOT NULL, [FirstName] [varchar] (100) NOT NULL, [LastName] [varchar] (100) NOT NULL, [JobTitle] [varchar] (100) NULL, [LastHireDate] [date] NULL, [StreetAddress] [varchar] (500) NOT NULL, [City] [varchar] (200) NOT NULL, [StateProvince] [varchar] (50) NOT NULL, [Portalcode] [varchar] (10) NOT NULL )
You need to alter the table to meet the following requirements:
✑ Ensure that users can identify the current manager of employees.
✑ Support creating an employee reporting hierarchy for your entire company.
✑ Provide fast lookup of the managers’ attributes such as name and job title.
Which column should you add to the table?
A. [ManagerEmployeeID] [smallint] NULL
B. [ManagerEmployeeKey] [smallint] NULL
C. [ManagerEmployeeKey] [int] NULL
D. [ManagerName] varchar NULL
Correct Answer: C
We need an extra column to identify the Manager. Use the data type as the EmployeeKey column, an int column.
C as the data types of the primary key should be same for the manager.
Reference:
https://docs.microsoft.com/en-us/analysis-services/tabular-models/hierarchies-ssas-tabular
You have an Azure Synapse workspace named MyWorkspace that contains an Apache Spark database named mytestdb.
You run the following command in an Azure Synapse Analytics Spark pool in MyWorkspace.
CREATE TABLE mytestdb.myParquetTable( EmployeeID int, EmployeeName string, EmployeeStartDate date)
USING Parquet - You then use Spark to insert a row into mytestdb.myParquetTable. The row contains the following data.
EmployeeName|EmployeeStartDate|EmployeeID
Alice | 2020-01-25 | 24
One minute later, you execute the following query from a serverless SQL pool in MyWorkspace.
SELECT EmployeeID - FROM mytestdb.dbo.myParquetTable WHERE EmployeeName = 'Alice';
What will be returned by the query?
A. 24
B. an error
C. a null value
I did a test, waited for one minute and tried the query in a serverless sql pool and received 24 as the result, so I don’t understand that B has been voted so much because the answer is A) 24 without a doubt
Debate on B as dollows
Answer is B, but not because of the lowercase. The case has nothing to do with the error.
If you look attentively, you will notice that we create table mytestdb.myParquetTable, but the select statement contains the reference to table mytestdb.dbo.myParquetTable (!!! - dbo).
Here is the error message I got:
Error: spark_catalog requires a single-part namespace, but got [mytestdb, dbo].
I just tried to run the commands, and that error you had is due to the fact that you queried through Spark pool (!!), I did that as a test and got the exact same error. To query the data using Spark Pool, you don’t use the “.dbo” reference, this only works if you’re using a Synapse Serverless Pool.
So the correct answer is A!
HARD
DRAG DROP -
You have a table named SalesFact in an enterprise data warehouse in Azure Synapse Analytics. SalesFact contains sales data from the past 36 months and has the following characteristics:
✑ Is partitioned by month
✑ Contains one billion rows
✑ Has clustered columnstore index
At the beginning of each month, you need to remove data from SalesFact that is older than 36 months as quickly as possible.
Which three actions should you perform in sequence in a stored procedure? To answer, move the appropriate actions from the list of actions to the answer area and arrange them in the correct order.
Select and Place:
Actions
- Switch the partition containing the stale data from SalesFact to SalesFact_Work.
- Truncate the partition containing the stale data.
- Drop the SalesFact_Work table.
- Create an empty table named SalesFact_Work that has the same schema as SalesFact.
- Execute a DELETE statement where the value in the Date column is more than 36 months ago.
- Copy the data to a new table by using CREATE TABLE AS SELECT (CTAS).
Step 1: Create an empty table named SalesFact_work that has the same schema as SalesFact.
Step 2: Switch the partition containing the stale data from SalesFact to SalesFact_Work.
SQL Data Warehouse supports partition splitting, merging, and switching. To switch partitions between two tables, you must ensure that the partitions align on their respective boundaries and that the table definitions match.
Loading data into partitions with partition switching is a convenient way stage new data in a table that is not visible to users the switch in the new data.
Step 3: Drop the SalesFact_Work table.
Reference:
https://docs.microsoft.com/en-us/azure/sql-data-warehouse/sql-data-warehouse-tables-partition
You have files and folders in Azure Data Lake Storage Gen2 for an Azure Synapse workspace as shown in the following exhibit.
/topfolder/*<children are following> /File1.csv /folder1/File2.csv /folder2/File3.csv /File4.csv
You create an external table named ExtTable that has LOCATION=’/topfolder/’.
When you query ExtTable by using an Azure Synapse Analytics serverless SQL pool, which files are returned?
A. File2.csv and File3.csv only
B. File1.csv and File4.csv only
C. File1.csv, File2.csv, File3.csv, and File4.csv
D. File1.csv only
I believe the answer should be B.
In case of a serverless pool a wildcard should be added to the location.
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/develop-tables-external-tables?tabs=hadoop#arguments-create-external-table
“Serverless SQL pool can recursively traverse folders only if you specify /** at the end of path.”
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/query-folders-multiple-csv-files
HOTSPOT -
You are planning the deployment of Azure Data Lake Storage Gen2.
You have the following two reports that will access the data lake:
✑ Report1: Reads three columns from a file that contains 50 columns.
✑ Report2: Queries a single record based on a timestamp.
You need to recommend in which format to store the data in the data lake to support the reports. The solution must minimize read times.
What should you recommend for each report? To answer, select the appropriate options in the answer area.
NOTE: Each correct selection is worth one point.
Hot Area:
Report1: * Avro * CSV * Parquet * TSV Report2: * Avro * CSV * Parquet * TSV
1: Parquet - column-oriented binary file format
2: AVRO- Row based format, and has logical type timestamp
https://youtu.be/UrWthx8T3UY
You are designing the folder structure for an Azure Data Lake Storage Gen2 container.
Users will query data by using a variety of services including Azure Databricks and Azure Synapse Analytics serverless SQL pools. The data will be secured by subject area. Most queries will include data from the current year or current month.
Which folder structure should you recommend to support fast queries and simplified folder security?
A. /{SubjectArea}/{DataSource}/{DD}/{MM}/{YYYY}/{FileData}{YYYY}{MM}{DD}.csv
B. /{DD}/{MM}/{YYYY}/{SubjectArea}/{DataSource}/{FileData}{YYYY}{MM}{DD}.csv
C. /{YYYY}/{MM}/{DD}/{SubjectArea}/{DataSource}/{FileData}{YYYY}{MM}{DD}.csv
D. /{SubjectArea}/{DataSource}/{YYYY}/{MM}/{DD}/{FileData}{YYYY}{MM}{DD}.csv
Correct Answer: D
There’s an important reason to put the date at the end of the directory structure. If you want to lock down certain regions or subject matters to users/groups, then you can easily do so with the POSIX permissions. Otherwise, if there was a need to restrict a certain security group to viewing just the UK data or certain planes, with the date structure in front a separate permission would be required for numerous directories under every hour directory. Additionally, having the date structure in front would exponentially increase the number of directories as time went on.
Note: In IoT workloads, there can be a great deal of data being landed in the data store that spans across numerous products, devices, organizations, and customers. It’s important to pre-plan the directory layout for organization, security, and efficient processing of the data for down-stream consumers. A general template to consider might be the following layout:
{Region}/{SubjectMatter(s)}/{yyyy}/{mm}/{dd}/{hh}/
Serverless SQL Pools offers a straight-forward method of querying data including CSV, JSON, and Parquet format stored in Azure Storage.
So, setting up the csv files within azure storage in hive-formated folder hierarchy i.e. /{yyyy}/{mm}/{dd}/ actually helps in sql querying the data much faster since only the partitioned segment of the data is queried.
HOTSPOT -
You need to output files from Azure Data Factory.
Which file format should you use for each type of output? To answer, select the appropriate options in the answer area.
NOTE: Each correct selection is worth one point.
Hot Area:
Columnar format: * Avro * GZip * Parquet * TXT JSON with a timestamp: * Avro * GZip * Parquet * TXT
Box 1: Parquet -
Parquet stores data in columns, while Avro stores data in a row-based format. By their very nature, column-oriented data stores are optimized for read-heavy analytical workloads, while row-based databases are best for write-heavy transactional workloads.
Box 2: Avro -
An Avro schema is created using JSON format.
AVRO supports timestamps.
Note: Azure Data Factory supports the following file formats (not GZip or TXT).
Avro format -
✑ Binary format
✑ Delimited text format
✑ Excel format
✑ JSON format
✑ ORC format
✑ Parquet format
✑ XML format
Reference:
https://www.datanami.com/2018/05/16/big-data-file-formats-demystified
HOTSPOT -
You use Azure Data Factory to prepare data to be queried by Azure Synapse Analytics serverless SQL pools.
Files are initially ingested into an Azure Data Lake Storage Gen2 account as 10 small JSON files. Each file contains the same data attributes and data from a subsidiary of your company.
You need to move the files to a different folder and transform the data to meet the following requirements:
✑ Provide the fastest possible query times.
✑ Automatically infer the schema from the underlying files.
How should you configure the Data Factory copy activity? To answer, select the appropriate options in the answer area.
NOTE: Each correct selection is worth one point.
Hot Area:
Copy behavior: * Flatten hierarchy * Merge files * Preserve hierarchy Sink file type: * CSV * JSON * Parquet * TXT
1. Merge Files
2. Parquet
https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-performance-tuning-guidance
Larger files lead to better performance and reduced costs.
Typically, analytics engines such as HDInsight have a per-file overhead that involves tasks such as listing, checking access, and performing various metadata operations. If you store your data as many small files, this can negatively affect performance. In general, organize your data into larger sized files for better performance (256 MB to 100 GB in size). S
Hard
HOTSPOT -
You have a data model that you plan to implement in a data warehouse in Azure Synapse Analytics as shown in the following exhibit.
see site for img of this
Dim_Employee
* iEmployeeID
* vcEmployeeLastName
* vcEmployeeMName
* vcEmployeeFirstName
* dtEmployeeHireDate
* dtEmployeeLevel
* dtEmployeeLastPromotion
Fact_DailyBookings
* iDailyBookingsID
* iCustomerID
* iTimeID
* iEmployeeID
* iItemID
* iQuantityOrdered
* dExchangeRate
* iCountryofOrigin
* mUnitPrice
Dim_Customer
* iCustomerID
* vcCustomerName
* vcCustomerAddress1
* vcCustomerCity
Dim_Time
* iTimeID
* iCalendarDay
* iCalendarWeek
* iCalendarMonth
* vcDayofWeek
* vcDayofMonth
* vcDayofYear
* iHolidayIndicator
All the dimension tables will be less than 2 GB after compression, and the fact table will be approximately 6 TB. The dimension tables will be relatively static with very few data inserts and updates.
Which type of table should you use for each table? To answer, select the appropriate options in the answer area.
NOTE: Each correct selection is worth one point.
Hot Area:
Dim_Customer: * Hash distributed * Round-robin * Replicated Dim_Employee: * Hash distributed * Round-robin * Replicated Dim_Time: * Hash distributed * Round-robin * Replicated Fact_DailyBookings: * Hash distributed * Round-robin * Replicated
Box 1: Replicated -
Replicated tables are ideal for small star-schema dimension tables, because the fact table is often distributed on a column that is not compatible with the connected dimension tables. If this case applies to your schema, consider changing small dimension tables currently implemented as round-robin to replicated.
Box 2: Replicated -
Box 3: Replicated -
Box 4: Hash-distributed -
For Fact tables use hash-distribution with clustered columnstore index. Performance improves when two hash tables are joined on the same distribution column.
Reference:
https://azure.microsoft.com/en-us/updates/reduce-data-movement-and-make-your-queries-more-efficient-with-the-general-availability-of-replicated-tables/ https://azure.microsoft.com/en-us/blog/replicated-tables-now-generally-available-in-azure-sql-data-warehouse/
The answer is correct.
The Dims are under 2gb so no point in use hash.
Common distribution methods for tables:
The table category often determines which option to choose for distributing the table.
Table category Recommended distribution option
Fact -Use hash-distribution with clustered columnstore index. Performance improves when two hash tables are joined on the same distribution column.
Dimension - Use replicated for smaller tables. If tables are too large to store on each Compute node, use hash-distributed.
Staging - Use round-robin for the staging table. The load with CTAS is fast. Once the data is in the staging table, use INSERT…SELECT to move the data to production tables.
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-overview#common-distribution-methods-for-tables
SIMILIAR TO ANOTHER QUESTION BUT SAME ANWSERS
HOTSPOT -
You have an Azure Data Lake Storage Gen2 container.
Data is ingested into the container, and then transformed by a data integration application. The data is NOT modified after that. Users can read files in the container but cannot modify the files.
You need to design a data archiving solution that meets the following requirements:
✑ New data is accessed frequently and must be available as quickly as possible.
✑ Data that is older than five years is accessed infrequently but must be available within one second when requested.
✑ Data that is older than seven years is NOT accessed. After seven years, the data must be persisted at the lowest cost possible.
✑ Costs must be minimized while maintaining the required availability.
How should you manage the data? To answer, select the appropriate options in the answer area.
NOTE: Each correct selection is worth one point
Hot Area:
Five-year-old data: * Delete the blob. * Move to archive storage. * Move to cool storage. * Move to hot storage. Seven-year-old data: * Delete the blob. * Move to archive storage. * Move to cool storage. * Move to hot storage.
Box 1: Move to cool storage -
Box 2: Move to archive storage -
Archive - Optimized for storing data that is rarely accessed and stored for at least 180 days with flexible latency requirements, on the order of hours.
The following table shows a comparison of premium performance block blob storage, and the hot, cool, and archive access tiers.
link
DRAG DROP -
You need to create a partitioned table in an Azure Synapse Analytics dedicated SQL pool.
How should you complete the Transact-SQL statement? To answer, drag the appropriate values to the correct targets. Each value may be used once, more than once, or not at all. You may need to drag the split bar between panes or scroll to view content.
NOTE: Each correct selection is worth one point.
Select and Place:
Values
* CLUSTERED INDEX
* COLLATE
* DISTRIBUTION
* PARTITION
* PARTITION FUNCTION
* PARTITION SCHEME
Answer Area
CREATE TABLE table1 ( ID INTEGER, col1 VARCHAR(10), col2 VARCHAR (10) ) WITH <XXXXXXXXXXXX> = HASH (ID) , <YYYYYYYYYYYYY> (ID RANGE LEFT FOR VALUES (1, 1000000, 2000000))
Box 1: DISTRIBUTION -
Table distribution options include DISTRIBUTION = HASH ( distribution_column_name ), assigns each row to one distribution by hashing the value stored in distribution_column_name.
Box 2: PARTITION -
Table partition options. Syntax:
PARTITION ( partition_column_name RANGE [ LEFT | RIGHT ] FOR VALUES ( [ boundary_value [,…n] ] ))
Reference:
https://docs.microsoft.com/en-us/sql/t-sql/statements/create-table-azure-sql-data-warehouse
You need to design an Azure Synapse Analytics dedicated SQL pool that meets the following requirements:
✑ Can return an employee record from a given point in time.
✑ Maintains the latest employee information.
✑ Minimizes query complexity.
How should you model the employee data?
A. as a temporal table
B. as a SQL graph table
C. as a degenerate dimension table
D. as a Type 2 slowly changing dimension (SCD) table
Correct Answer: D 🗳️
A Type 2 SCD supports versioning of dimension members. Often the source system doesn’t store versions, so the data warehouse load process detects and manages changes in a dimension table. In this case, the dimension table must use a surrogate key to provide a unique reference to a version of the dimension member. It also includes columns that define the date range validity of the version (for example, StartDate and EndDate) and possibly a flag column (for example,
IsCurrent) to easily filter by current dimension members.
Reference:
https://docs.microsoft.com/en-us/learn/modules/populate-slowly-changing-dimensions-azure-synapse-analytics-pipelines/3-choose-between-dimension-types
Hard
You have an enterprise-wide Azure Data Lake Storage Gen2 account. The data lake is accessible only through an Azure virtual network named VNET1.
You are building a SQL pool in Azure Synapse that will use data from the data lake.
Your company has a sales team. All the members of the sales team are in an Azure Active Directory group named Sales. POSIX controls are used to assign the
Sales group access to the files in the data lake.
You plan to load data to the SQL pool every hour.
You need to ensure that the SQL pool can load the sales data from the data lake.
Which three actions should you perform? Each correct answer presents part of the solution.
NOTE: Each area selection is worth one point.
A. Add the managed identity to the Sales group.
B. Use the managed identity as the credentials for the data load process.
C. Create a shared access signature (SAS).
D. Add your Azure Active Directory (Azure AD) account to the Sales group.
E. Use the shared access signature (SAS) as the credentials for the data load process.
F. Create a managed identity.
F. Create a managed identity.
A. Add the managed identity to the Sales group.
B. Use the managed identity as the credentials for the data load process.
The managed identity grants permissions to the dedicated SQL pools in the workspace.
Note: Managed identity for Azure resources is a feature of Azure Active Directory. The feature provides Azure services with an automatically managed identity in
Azure AD -
Reference:
https://docs.microsoft.com/en-us/azure/synapse-analytics/security/synapse-workspace-managed-identity
14
VIEW WEBSITE FOR IMGs
HOTSPOT -
You have an Azure Synapse Analytics dedicated SQL pool that contains the users shown in the following table.
[MISSING STUFF]
User1 executes a query on the database, and the query returns the results shown in the following exhibit.
[MISSING STUFF]
User1 is the only user who has access to the unmasked data.
Use the drop-down menus to select the answer choice that completes each statement based on the information presented in the graphic.
NOTE: Each correct selection is worth one point.
Hot Area::
When User2 queries the YearlyIncome column, the values returned will be [answer choice]. * a random number * the values stored in the database * XXXX * 0 When User1 queries the BirthDate column, the values returned will be [answer choice]. * a random date * the values stored in the database * xxxX * 1900-01-01
Box 1: 0 -
The YearlyIncome column is of the money data type.
The Default masking function: Full masking according to the data types of the designated fields
✑ Use a zero value for numeric data types (bigint, bit, decimal, int, money, numeric, smallint, smallmoney, tinyint, float, real).
Box 2: the values stored in the database
Users with administrator privileges are always excluded from masking, and see the original data without any mask.
Reference:
https://docs.microsoft.com/en-us/azure/azure-sql/database/dynamic-data-masking-overview
* Use a zero value for numeric data types (bigint, bit, decimal, int, money, numeric, smallint, smallmoney, tinyint, float, real).
* Use 01-01-1900 for date/time data types (date, datetime2, datetime, datetimeoffset, smalldatetime, time).
You have an enterprise data warehouse in Azure Synapse Analytics.
Using PolyBase, you create an external table named [Ext].[Items] to query Parquet files stored in Azure Data Lake Storage Gen2 without importing the data to the data warehouse.
The external table has three columns.
You discover that the Parquet files have a fourth column named ItemID.
Which command should you run to add the ItemID column to the external table?
A.
ALTER EXTERNAL TABLE [Ext]. [Items] ADD [ItemID] int;
B.
DROP EXTERNAL FILE FORMAT parquetfilel; CREATE EXTERNAL FILE FORMAT parquetfile1 WITH ( FORMAT_TYPE = PARQUET, DATA_COMPRESSION = 'org.apache. hadoop.io. compress. SnappyCodec'
C.
DROP EXTERNAL TABLE [Ext]. [Items] CREATE EXTERNAL TABLE [Ext]. [Items] ([ItemID] [int] NULL, [ItemName] nvarchar (50) NULL, [ItemType] nvarchar (20) NULL, [ItemDescription] nvarchar (250) ) WITH LOCATION= '/Items/', DATA_SOURCE = AzureDataLakeStore, FILE_FORMAT = PARQUET, REJECT_TYPE = VALUE, REJECT_VALUE = 0
D.
ALTER TABLE [Ext] . [Items] ADD [ItemID] int;
C is correct, since “altering the schema or format of an external SQL table is not supported”.
https://learn.microsoft.com/en-us/azure/data-explorer/kusto/management/external-sql-tables
16
HOTSPOT -
You have two Azure Storage accounts named Storage1 and Storage2. Each account holds one container and has the hierarchical namespace enabled. The system has files that contain data stored in the Apache Parquet format.
You need to copy folders and files from Storage1 to Storage2 by using a Data Factory copy activity. The solution must meet the following requirements:
✑ No transformations must be performed.
✑ The original folder structure must be retained.
✑ Minimize time required to perform the copy activity.
How should you configure the copy activity? To answer, select the appropriate options in the answer area.
NOTE: Each correct selection is worth one point.
Hot Area:
Source dataset type: * Binary * Parquet * Delimited text Copy activity copy behavior: * FlattenHierarchy * MergeFiles * PreserveHierarchy
Box 1: Parquet -
For Parquet datasets, the type property of the copy activity source must be set to ParquetSource.
Box 2: PreserveHierarchy -
PreserveHierarchy (default): Preserves the file hierarchy in the target folder. The relative path of the source file to the source folder is identical to the relative path of the target file to the target folder.
Incorrect Answers:
✑ FlattenHierarchy: All files from the source folder are in the first level of the target folder. The target files have autogenerated names.
✑ MergeFiles: Merges all files from the source folder to one file. If the file name is specified, the merged file name is the specified name. Otherwise, it’s an autogenerated file name.
Reference:
https://docs.microsoft.com/en-us/azure/data-factory/format-parquet https://docs.microsoft.com/en-us/azure/data-factory/connector-azure-data-lake-storage
Answer seems correct as data is store is parquet already and requirement is to do no transformation so answer is right
You have an Azure Data Lake Storage Gen2 container that contains 100 TB of data.
You need to ensure that the data in the container is available for read workloads in a secondary region if an outage occurs in the primary region . The solution must minimize costs.
Which type of data redundancy should you use?
A. geo-redundant storage (GRS)
B. read-access geo-redundant storage (RA-GRS)
C. zone-redundant storage (ZRS)
D. locally-redundant storage (LRS)
B is right
Geo-redundant storage (with GRS or GZRS) replicates your data to another physical location in the secondary region to protect against regional outages. However, that data is available to be read only if the customer or Microsoft initiates a failover from the primary to secondary region. When you enable read access to the secondary region, your data is available to be read at all times, including in a situation where the primary region becomes unavailable.
You plan to implement an Azure Data Lake Gen 2 storage account.
You need to ensure that the data lake will remain available if a data center failsin the primary Azure region. The solution must minimize costs.
Which type of replication should you use for the storage account?
A. geo-redundant storage (GRS)
B. geo-zone-redundant storage (GZRS)
C. locally-redundant storage (LRS)
D. zone-redundant storage (ZRS)
First, about the Question:
What fails? -> The (complete) DataCenter, not the region and not components inside a DataCenter.
So, what helps us in this situation?
LRS: “..copies your data synchronously three times within a single physical location in the primary region.” Important is here the SINGLE PHYSICAL LOCATION (meaning inside the same Data Center. So in our scenario all copies wouldn’t work anymore.)
-> C is wrong.
ZRS: “…copies your data synchronously across three Azure availability zones in the primary region” (meaning, in different Data Centers. In our scenario this would meet the requirements)
-> D is right
GRS/GZRS: are like LRS/ZRS but with the Data Centers in different azure regions. This works too but is more expensive than ZRS. So ZRS is the right answer.
https://docs.microsoft.com/en-us/azure/storage/common/storage-redundancy
Hard
HOTSPOT -
You have a SQL pool in Azure Synapse.
You plan to load data from Azure Blob storage to a staging table. Approximately 1 million rows of data will be loaded daily. The table will be truncated before each daily load.
You need to create the staging table. The solution must minimize how long it takes to load the data to the staging table.
How should you configure the table? To answer, select the appropriate options in the answer area.
NOTE: Each correct selection is worth one point.
Hot Area:
Distribution: * Hash * Replicated * Round-robin Indexing: * Clustered * Clustered columnstore * Heap Partitioning: * Date * None
Distribution: Round-Robin
Indexing: Heap
PartitionIng: None
Round-robin - this is the simplest distribution model, not great for querying but fast to process
Heap - no brainer when creating staging tables
No partitions - this is a staging table, why add effort to partition, when truncated daily?
You are designing a fact table named FactPurchase in an Azure Synapse Analytics dedicated SQL pool. The table contains purchases from suppliers for a retail store. FactPurchase will contain the following columns.
**Name** **Data type** **Nullable** PurchaseKey Bigint No DateKey Int No SupplierKey Int No StockItemKey Int No PurchaseOrderID Int No OrderedQuantity Int Yes OrderedOuters Int No ReceivedOuters Int No Package Nvarchar(50) No IsOrderFinalized Bit No LineageKey Int No
FactPurchase will have 1 million rows of data added daily and will contain three years of data.
Transact-SQL queries similar to the following query will be executed daily.
SELECT - SupplierKey, StockItemKey, IsOrderFinalized, COUNT(*) FROM FactPurchase - WHERE DateKey >= 20210101 - AND DateKey <= 20210131 - GROUP By SupplierKey, StockItemKey, IsOrderFinalized
Which table distribution will minimize query times?
A. replicated
B. hash-distributed on PurchaseKey
C. round-robin
D. hash-distributed on IsOrderFinalized
Correct Answer: B
Hash-distributed tables improve query performance on large fact tables.
To balance the parallel processing, select a distribution column that:
✑ Has many unique values. The column can have duplicate values. All rows with the same value are assigned to the same distribution. Since there are 60 distributions, some distributions can have > 1 unique values while others may end with zero values.
✑ Does not have NULLs, or has only a few NULLs.
✑ Is not a date column.
Incorrect Answers:
C: Round-robin tables are useful for improving loading speed.
Reference:
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-distribute
Is it hash-distributed on PurchaseKey and not on IsOrderFinalized because ‘IsOrderFinalized’ yields less distributions(rows either contain yes,no values) compared to PurchaseKey?
HOTSPOT -
From a website analytics system, you receive data extracts about user interactions such as downloads, link clicks, form submissions, and video plays.
The data contains the following columns.
Name
- EventCategory
- EventAction
- EventLabel
- ChannelGrouping
- TotalEvents
- UniqueEvents
- SessionWith Events
- Date
Sample value
- 15 Jan 2021
- Videos
- Play
- Contoso Promotional
- Social
- 150
- 120
- 99
You need to design a star schema to support analytical queries of the data. The star schema will contain four tables including a date dimension.
To which table should you add each column? To answer, select the appropriate options in the answer area.
NOTE: Each correct selection is worth one point.
Hot Area:
EventCategory: * DimChannel * DimDate * DimEvent * FactEvents ChannelGrouping: * DimChannel * DimDate * DimEvent * FactEvents TotalEvents: * DimChannel * DimDate * DimEvent * FactEvents
Box 1: DimEvent -
Box 2: DimChannel -
Box 3: FactEvents -
Fact tables store observations or events, and can be sales orders, stock balances, exchange rates, temperatures, etc
Reference:
https://docs.microsoft.com/en-us/power-bi/guidance/star-schema
Note: This question is part of a series of questions that present the same scenario. Each question in the series contains a unique solution that might meet the stated goals. Some question sets might have more than one correct solution, while others might not have a correct solution.
After you answer a question in this section, you will NOT be able to return to it. As a result, these questions will not appear in the review screen.
You have an Azure Storage account that contains 100 GB of files. The files contain rows of text and numerical values. 75% of the rows contain description data that has an average length of 1.1 MB.
You plan to copy the data from the storage account to an enterprise data warehouse in Azure Synapse Analytics.
You need to prepare the files to ensure that the data copies quickly.
Solution: You convert the files to compressed delimited text files.
Does this meet the goal?
A. Yes
B. No
The answer is A
All file formats have different performance characteristics. For the fastest load, use compressed delimited text files.
Reference:
https://docs.microsoft.com/en-us/azure/sql-data-warehouse/guidance-for-loading-data
Compression doesn’t not only help to reduce the size or space occupied by a file in a storage but also increases the speed of file movement during transfer
Hard
Note: This question is part of a series of questions that present the same scenario. Each question in the series contains a unique solution that might meet the stated goals. Some question sets might have more than one correct solution, while others might not have a correct solution.
After you answer a question in this section, you will NOT be able to return to it. As a result, these questions will not appear in the review screen.
You have an Azure Storage account that contains 100 GB of files. The files contain rows of text and numerical values. 75% of the rows contain description data that has an average length of 1.1 MB.
You plan to copy the data from the storage account to an enterprise data warehouse in Azure Synapse Analytics.
You need to prepare the files to ensure that the data copies quickly.
Solution: You copy the files to a table that has a columnstore index.
Does this meet the goal?
A. Yes
B. No
Correct Answer: B
Instead convert the files to compressed delimited text files.
Reference:
https://docs.microsoft.com/en-us/azure/sql-data-warehouse/guidance-for-loading-data
From the documentation, loads to heap table are faster than indexed tables. So, better to use heap table than columnstore index table in this case.
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-index#heap-tables
Note: This question is part of a series of questions that present the same scenario. Each question in the series contains a unique solution that might meet the stated goals. Some question sets might have more than one correct solution, while others might not have a correct solution.
After you answer a question in this section, you will NOT be able to return to it. As a result, these questions will not appear in the review screen.
You have an Azure Storage account that contains 100 GB of files. The files contain rows of text and numerical values. 75% of the rows contain description data that has an average length of 1.1 MB.
You plan to copy the data from the storage account to an enterprise data warehouse in Azure Synapse Analytics.
You need to prepare the files to ensure that the data copies quickly.
Solution: You modify the files to ensure that each row is more than 1 MB.
Does this meet the goal?
A. Yes
B. No
Correct Answer: B
Instead convert the files to compressed delimited text files.
Reference:
https://docs.microsoft.com/en-us/azure/sql-data-warehouse/guidance-for-loading-data
No, rows need to have less than 1 MB. A batch size between 100 K to 1M rows is the recommended baseline for determining optimal batch size capacity.