DP-200 Study Notes Flashcards

1
Q

What is a data schema?

A

It’s a skeleton of your data and how it relates data to itself.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the difference between a physical schema and a logical schema?

A

A physical schema is on disk.

The logical schema is tables, views.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What’s the difference between a snowflake schema and a star schema

A

Star Schema has one fact table and 1 level of dimension tables. It’s not normalised and very easy to query.

Snowflake has multiple levels of dimnension tables. It’s normalised and much slower to query (because you need to make joins)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What’s the general difference between a Azure Data Lake, SQL Database and a SQL Data Warehouse

A

A Data Warehouse is used for complex queries with some strong business incentive.

Database has very recent and topical data such as how many current orders are there in the business.

A data lake is where you store all your data before its refinement (unstructured)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Briefly describe NoSQL

A
  • No Schema
  • Best when there is a large amount of data when there relationship changes and the data changes over time
  • Horizontal Scaling (shards out across machines)
  • Typically JSON, key-value, graph and column-wide data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Briefly describe a Data Lake

A

Data in its original form, sitting and waiting until it is needed for processing.

Ingest from source into Data Lake using Data Factory. Databricks can be used to process it for reporting.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are the 5 consistency levels of CosmosDB

A

Strong, Boundless Staleness, Session, Consistent Prefix, Eventual.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What does it mean that CosmosDB is multimodel?

A

It supports SQL API, Cassandra API, CosmosDB API, Gremlin API, Tables API.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the SLA and latency of CosmosDB?

A

SLA is 99.999%

Latency is <10ms

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How many copies of your data in cosmosDB?

A

Copies of your data is = n*4 (where n is the number of regions you are in)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the idea behind Logical Partitions

A

Use a key that spreads across partitions well
10gb per partitions
Throughput costs depend on the container #, keys and workload distribution

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is Azure Table Storage?

A

It’s NoSQL, key-value store, commonly used for large amounts of structured data. Storing datasets that don’t require complex joins or queries.
Used when you don’t need global replication (when you might use CosmosDB)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is Azure Data Warehouse?

A

A repo of non-redundant data
ETL takes data from the lake into the warehouse
Cloud is a good place for this for the scalability and the cost and performance
Fact table contains all the dimensions
Dimensions are the attributes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How does Polybase useful for data warheousing in Azure?

A

Polybase allows a data warehouse to be used as an ELT. You load the data into the warehouse before you transform it.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are the three types of shading used in a data warehouse?

A

Hash, Round Robin, Replicate

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What’s the difference between hash, round robin and replicate

A

Hash has the highest query performance for large tables
Round Robin is the easiest to create for staging tables
Replicate - fastest

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Is Azure SQL Database vertical or horizontal scaling?

A

It is vertical scaling

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What are the three options for Azure SQL Database?

A

Single Database, elastic pool or managed instance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

When do you use a single database in Azure SQL Database?

A

You use Single database when you have one set of resources.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

When do you use elastic pool in Azure SQL Database?

A

Elastic pool is a group of databases that share resources.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What is a managed instance of Azure SQL Database?

A

A managed instance is similar to the sharing of elastic pool resources but has functionality similar to on-prem

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What are the two pricing models for Azure SQL Database?

A

You can buy vCore or DTU

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What’s the difference between DTUs and vCores?

A

DTUs are pre-configured solutions. If you have more than 300 DTUs it is better to switch to vCores. vCores is better for high cost databases or you need to monitor and trim costs. DTUs are best when running inexpensive databases.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

How does DTUs and vCores diffe in terms of pricing?

A

With DTUs you pay one fixed price for your compute as well as data storage and backup retention.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

What is hyper scale in Azure SQL Database?

A

Hyperscale available with vCores, allows databases to grow to 100 times their size.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Can you use vCores or DTUs for elastic pools in Azure SQL Database?

A

Yes, you can use both

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

When should you use SQL Database Managed Instance?

A

It’s ideal for migrating from on-prem to Azure Cloud. It has native vent integration. It has automated backups but it is more expensive than elastic pool.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

Explain Azure SQL Database Backups

A

You can automate backups via read-access geo-redundant storage.

Backups are done full weekly, differential every 12 hours and transactions every 5-10 minutes

Backups are good for restoring deleted tables. Restoring to a different region, and restoring to a point in time.

Log term retention is for longer term storage as blob.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

What is dynamic data masking?

A

DDM limits who can see data items in real time

30
Q

How do you implement DDM?

A

Using the Azure portal or adding a rule in powershelgl

31
Q

Should you use DDM with other security techniques?

A

Yes!

32
Q

What does the following datamask do?

EMAIL varchar(100) MASKED WITH (FUNCTION = ‘email()’) NULL

A

This shows the first letter of the email and nothing else

33
Q

What does the following datamask do?

FirstName varchar(100) MASKED WITH (FUNCTION = ‘partial(1, “XXXXXXX”, 0)’) NULL

A

Gets the first letter of the FirstName and masks the rest with X.

34
Q

What does it mean to encrypt data at rest?

A

Transparent data encryption - encrypting at the file level, preventing the data from being copied to another location to be modified. Done by using AES and Triple Data Encryption

CosmosDB uses key storage and encrypts at rest by default.

DataLake - on by default and the keys are managed by you.

35
Q

What is encrypt data in motion?

A

TLS/SSL are used for transferring data between the cloud and the customer.
Perfect forward secrecy (PFS) is also used to proctect data between customers client systems and cloud

Shared access signatures
* Delegates accèss to Azure Storage objects

DataLake
HTTPS is only protocol available for REST

36
Q

What is batch processing?

A

It’ s a method fo running high volume repetitive jobs without user interaction

  • In a bank, nightly run of all the transactions
  • Challenge is the input data not being correct. What happens if an error occurs.
  • Batch processing - data factory, usql, spark, pig or hive
  • Extract Transformation and Load is stored into a structured data warehouse
37
Q

What is Apache Spark?

A

Cluster computing to solve a problem and work on big data projects.

38
Q

What is a driver process in Apache Spark?

A

Driver process maintains information about maintains information about spark application, responding to program or input and analysis’s and schedules work across the cluster.

39
Q

What does an executor process do in Apache Spark?

A

Executor does the work that the driver process tell it to do.

40
Q

Can multiple applications work on the same cluster in Apache Spark?

A

Yes!

41
Q

What is RDD in Apache Spark?

A

RDD is resilient distributed datasets. Fault tolerant data that exists on several nodes.

Fundamental data structure of apache spark.

42
Q

What are transformations and action in Apache Spark?

A
  • Data is only loaded when necessary
  • Transformations only occur when the driver requests information
  • Returns a value after running computation
43
Q

What is data bricks in Azure?

A

Jupiter Notebooks basically.

44
Q

What is Polybase and what is it used for?

A

Allows you to bypass ETL process in favour of the ELT processes.

Polybase performs the steps within the data warehouse.

45
Q

What are the steps involved in Polybase?

A
  1. Extract the source data into text files
  2. Land the data into Data Lake or Azure Blob
  3. Prepare the data for loading
  4. Load the data into SQL Data Warehouse staging table with Polybase
  5. Transform the data
  6. Insert the data into production tables
46
Q

What is Stream Analytics?

A

Real time sending of data

47
Q

What are some use cases of stream analytics?

A

Security or fraud detection, traffic sensors, healthcares

48
Q

Where does stream analytics get data from usually?

A

Azure Event hubs or IoT hubs

49
Q

What are the 3 components to Stream Analytics job?

A

Input, query and output

50
Q

What are queries based on in Stream analytics?

A

Based on SQL and uses different windowing functions

51
Q

What is the idea behind windowing in Stream analytics?

A

You have to put some sort of boundary on streaming data.

52
Q

Explain a tumbling window?

A

A tumbling window has a fixed length. The next window is placed right after the end of the previous one on the time axis. Tumbling windows do not overlap and span the whole time domain, i.e. each event is assigned to exactly one window

53
Q

Explain a hopping window?

A

A hopping window is similar to tumbling window. You have a set time for the window, however there is now a “hop” or an overlap that allow you two different windows.

54
Q

Explain a sliding window?

A

Sliding windows are, for example, used to compute moving averages.

The window moves across and calculates the number of events per window etc.

55
Q

Explain a session window?

A

A session window is used to group events that happen together. There is usually a group of them and then a gap.

A session window will have a size, and any events that occur within that frame are added to it.

56
Q

When are events triggered in Data factory

A

At certain times of day or when an event occurs

57
Q

What are linked services used for in Data Factory?

A

They’re used for connecting to the dataset

58
Q

How do you monitor Storage Accounts in Azure?

A

Use Azure Storage Analytics for metrics about storage services and logs for blobs, queues and tables.

Useful for tracking requests, analysing usage trends and diagnosing issues with storage.

Good for tracking requests and analysing usage trends

Metrics is nice fancy graphs
Diagnostic settings is basically settings
Alerts allow you set alerts if usage is getting high

59
Q

What’s the difference between Azure Monitor and Azure Storage Analytics?

A

Azure Monitor is a centralised place for all of your monitoring. It assists with troubleshooting issues. Diagnose, trace and debug failures in near real time. Optimizes performance, reduces bottlenecks and reduces costs.

  • Application monitoring - performance and functioning of code
  • Guest OS monitoring data - operating system
  • Azure resource monitoring
  • Azure subscription -verall health of Azure and your subscription
  • Azure tenant monitoring data

Logs and Metrics are the two pieces of Azure Monitor. Metrics = graphs. Logs = output

Can create a rule to tell Azure monitor to take corrective action and notify you of critical conditions. Can get this notification via email or sms.

Autoscale, allows you to have the right amount of load when necessary to handle how much you need based on Azure Monitor.

60
Q

How do you monitor in Azure SQL Database?

A

Key metrics - CPU usage, wait statistics (why are queries waiting on resources), I/O usage - I/O limits of underlying storage, memory usage.

Tools include

  • Metrics charts: for DTU consumptions and resources approaching max utilization
  • SQL Server Management Studio - performance dashboard to monitor top resource consuming queries.
  • SQL Database Advisor - view recommendations for creating and dropping indexes, parameterising queries and fixing schema issues
  • Azure SQL Intelligent Insights - automatic monitoring of database performance
  • Automatic tuning - let Azure automatically apply changes
61
Q

How do you monitor in SQL Data Warehouse?

A

SQL Data Warehouse has the same metrics and you use mostly Azure monitor.

62
Q

What are the three ways to gather metrics?

A
  • performance metrics on the metrics page on the side panel
  • Performance metrics using Azure monitoring
  • Performance metrics on the Account page
63
Q

How do you setup alerts in CosmosDB?

A

You can setup alerts in the portal. Rules can be configured through CosmosDB alert rules.

Alert rules can be setup in the side panel. With the resource, condition and action flow.

64
Q

What are the four main monitoring items in Stream Analytics?

A

The four main ones are

  • SU % utilitization (More SU, the CPU used to run your job - How well are we using the streaming units?)
  • runtime errors (should have 0 - can test to see)
  • watermark delay - reliable signals of job health (input and output). Calculated by looking at when an event first occurs and then the processing, and then the output.
  • input deserialisation error - if input stream has malformed messages such as a missing semicolon etc.
  • alerts can be setup in resource, condition, and action flow.
65
Q

What does monitoring in data factory look like?

A

You can configure alerts by defining logic, severity and notification type.

No code required. Stored run data is kept for 45 days.

Use Azure Monitor for complex queries and monitoring across multiple data factories. Can create test alert for pipeline fail alerts in data factory. Advanced stuff can be done in Azure Monitor.

66
Q

How is monitoring like in Azure Databricks?

A

Different than other products.

Ganglia is used to check if the cluster is healthy.
Azure monitor is recommended but isn’t configured natively for data bricks. Grifana is open source graphing tool to visualise.

67
Q

How do you optimise Synapse Analytics (SQL Data Warehouse)?

A
  1. Have you checked for Advisor recommendations?
  2. Polybase - use polybase for ELT and use CTAS
  3. Hash-distrubte large tables
  4. Remember the rule of 60 (don’t over partition).
  5. Minimize transaction size
  6. Reduce query result sizes
  7. Minimize possible column size
  8. Maximise throughput by breaking gzip into 60+ files.
68
Q

How do you optimise data lake?

A

Hard to do.
Parallel copy in data factory will maximise parallelisation.
Keep file sizes between 256mb to 100gb where possible.

69
Q

How do you optimise Stream Analytics?

A

Review the SU utilisation %metric

  • 80% is the redline
  • Start with 6 SUs for queries not using Partition by
  • Outputs need to be partitioned
  • Events should go to the same partition of your input
  • Partition BY should be used in all steps
70
Q

How do you optimise sql database?

A

Intelligent Performance tab
* Summary of database performance and the first stop for optimisation
Performance Recommendations
* Indexes to create or drop
* Schema issues identified in the database
* When queries can benefit from being parametrised queries

Query Performance Insight

  • insight into DTU consumption
  • CPU consuming queriries
  • Drill down into detail on query results

Automatic tuning
* ML for appropriate tuning