Data engineering Flashcards
(36 cards)
Which 2 file formats store data in column based and row based files
parquet = column-based file format
avro = row-based file format
What does this SQL command do?
DBCC PDW_SHOWSPACEUSED(‘[scheme].[table]’)
This command specific to microsoft’s SQL server parallel data warehouse (dedicated SQL pool).
DBCC stand for data console commands
PDW_SHOWSPACEUSED show information about storage used by the table. It returns, amongst others, the rows contained per distributions which can be summed up to get the total row count of the table.
There is a Retry option for activities in azure data factory. What does this option specify?
How many retries a certain activity should run before failing the whole pipeline
There are 3 types of dimensions in dim/fact modelling. What are the characteristics of type 0,1,2
Type 0 = Static, data gets entered then never changes.
Type 1 = Here you just override the data, so if a record should be updated, you simply just override the original row.
Type 2 = These dimension are classified as slowly changing dimensions and often has multiple records of the same instance to track historical data. They often have a start-date, end-date and isCurrent attribute to help identify the current record. Soft deletion can also happen here.
In a Azure Stream Analytics job, you see a high number of Backlogged input events. How can you solve this problem.
Increase the number of streaming units assigned to the job.
How many distribution is the on a default dedicated sql pool in azure synapse
60
What is the 3 kinds distribution that you can use in a dedicated sql pool?
Round-robin (default)
Hash
Replicated
How does these 3 distribution work?
Round-robin: is the same concept as in load-balancing, so it just takes the first row and places it on the first distribution, the second row get put on the second distribution and so on.
hash: just like only hashing takes a input x which can be one or more column values, calculates a hash and place it on some distribution, thing with the same hash will of course be placed on the same distribution. Can however create data skew
replicated: Basically just replicates a table across all notes.
If you assume you have a data warehouse under normal conditions (star schema, table sizes, how things are joined together and so on…) Where could we see the 3 kinds of distributions be used
Hash = Large fact tables
Replicated = smaller dimension tables
Round-robin: the staging layer here we just need to get the rows placed somewhere before we transform it.
There are several Azure Stream Analytics job in place. We want to ensure that the data stream is segmented into distinct time segments and ensure that events don’t overlap. Which windowing function should we use?
Tumbling window.
For a normal General Purpose V2 storage account, what needs to be enabled for it to behave like a data lake gen 2 account
Hierarchical namespace.
There are several Azure Stream Analytics job in place. We want to ensure to output events only for points in time when the content of the window actually changes? Which windowing function should we use?
Sliding window.
What is data skew
When you use hash distribution and you hash a value which appears often, the data will accumulate on certain distributions and others will barely have any data. This causes sub optimal performance
Which service could you use to host your data-warehouse?
Azure Synapse Analytics
Which sort of SQL pools are able to work with EXTERNAL TABLES?
Both the serverless and dedicated SQL Server pools can use EXTERNAL TABLES.
Are EXTERNAL TABLES persisted or just a logical view on data?
A logical view.
What is polybase?
It is the most effective way to move large amount of data from external sources such as csv, json, avro, parquet into the dedicated sql server pool.
Based on microsoft own number what is the minimum number of rows you should have per distribution and partition to get good performance?
1 million rows.
In Azure data factory if you to store pipeline-run data for 60 days, what should you then configure?
Diagnostic setting
What is a heap table
heap tables are tables that does not have a clustered index. Therefore data is not stored in any particular order, however they do not have any extra overhead of indexes which means data insertion is faster as data does not need to be sorted first. This makes them a good option for small or temporary tables.
what is type 3 dimensions.
In this approach you add columns instead for rows. so if a named changed in a dimension record, you add a new column named something like “changed name” and the name column might be rewritten as “original name”
what is indexes
Indexes in SQL are tools that allow developers to create additional structures within a database to speed up data retrieval operations. They are categorized into rowstore and columnstore indexes. Some Rowstore indexes change psychical order of the data stored in the table (which is row-wise), and other creates a separate structure to facilitate faster search and access to rows. Columnstore indexes, on the other hand, organize data physically in a column-wise format, optimizing for queries that read large volumes of data but only a few columns
In a normal relational database system, how is the data normally stored
Row-wise
In a data warehouse system, how is the data normally stored
column-wise