DP-200 Study Notes Flashcards

Question

What is hyper scale in Azure SQL Database?

Answer 1

Hyperscale available with vCores, allows databases to grow to 100 times their size.

Answer 2

Yes, you can use both

Answer 3

It's ideal for migrating from on-prem to Azure Cloud. It has native vent integration. It has automated backups but it is more expensive than elastic pool.

Answer 4

You can automate backups via read-access geo-redundant storage. Backups are done full weekly, differential every 12 hours and transactions every 5-10 minutes Backups are good for restoring deleted tables. Restoring to a different region, and restoring to a point in time. Log term retention is for longer term storage as blob.

Answer 5

DDM limits who can see data items in real time

Answer 6

Using the Azure portal or adding a rule in powershelgl

Answer 7

This shows the first letter of the email and nothing else

Answer 8

Gets the first letter of the FirstName and masks the rest with X.

Answer 9

Transparent data encryption - encrypting at the file level, preventing the data from being copied to another location to be modified. Done by using AES and Triple Data Encryption CosmosDB uses key storage and encrypts at rest by default. DataLake - on by default and the keys are managed by you.

Answer 10

TLS/SSL are used for transferring data between the cloud and the customer. Perfect forward secrecy (PFS) is also used to proctect data between customers client systems and cloud Shared access signatures * Delegates accèss to Azure Storage objects DataLake HTTPS is only protocol available for REST

Answer 11

It' s a method fo running high volume repetitive jobs without user interaction * In a bank, nightly run of all the transactions * Challenge is the input data not being correct. What happens if an error occurs. * Batch processing - data factory, usql, spark, pig or hive * Extract Transformation and Load is stored into a structured data warehouse

Answer 12

Cluster computing to solve a problem and work on big data projects.

Answer 13

Driver process maintains information about maintains information about spark application, responding to program or input and analysis’s and schedules work across the cluster.

Answer 14

Executor does the work that the driver process tell it to do.

Answer 15

RDD is resilient distributed datasets. Fault tolerant data that exists on several nodes. Fundamental data structure of apache spark.

Answer 16

* Data is only loaded when necessary * Transformations only occur when the driver requests information * Returns a value after running computation

Answer 17

Jupiter Notebooks basically.

Answer 18

Allows you to bypass ETL process in favour of the ELT processes. Polybase performs the steps within the data warehouse.

Answer 19

1. Extract the source data into text files 2. Land the data into Data Lake or Azure Blob 3. Prepare the data for loading 4. Load the data into SQL Data Warehouse staging table with Polybase 5. Transform the data 6. Insert the data into production tables

Answer 20

Real time sending of data

Answer 21

Security or fraud detection, traffic sensors, healthcares

Answer 22

Azure Event hubs or IoT hubs

Answer 23

Input, query and output

Answer 24

Based on SQL and uses different windowing functions

Answer 25

You have to put some sort of boundary on streaming data.

Answer 26

A tumbling window has a fixed length. The next window is placed right after the end of the previous one on the time axis. Tumbling windows do not overlap and span the whole time domain, i.e. each event is assigned to exactly one window

Answer 27

A hopping window is similar to tumbling window. You have a set time for the window, however there is now a "hop" or an overlap that allow you two different windows.

Answer 28

Sliding windows are, for example, used to compute moving averages. The window moves across and calculates the number of events per window etc.

Answer 29

A session window is used to group events that happen together. There is usually a group of them and then a gap. A session window will have a size, and any events that occur within that frame are added to it.

Answer 30

At certain times of day or when an event occurs

Answer 31

They're used for connecting to the dataset

Answer 32

Use Azure Storage Analytics for metrics about storage services and logs for blobs, queues and tables. Useful for tracking requests, analysing usage trends and diagnosing issues with storage. Good for tracking requests and analysing usage trends Metrics is nice fancy graphs Diagnostic settings is basically settings Alerts allow you set alerts if usage is getting high

Answer 33

Azure Monitor is a centralised place for all of your monitoring. It assists with troubleshooting issues. Diagnose, trace and debug failures in near real time. Optimizes performance, reduces bottlenecks and reduces costs. * Application monitoring - performance and functioning of code * Guest OS monitoring data - operating system * Azure resource monitoring * Azure subscription -verall health of Azure and your subscription * Azure tenant monitoring data Logs and Metrics are the two pieces of Azure Monitor. Metrics = graphs. Logs = output Can create a rule to tell Azure monitor to take corrective action and notify you of critical conditions. Can get this notification via email or sms. Autoscale, allows you to have the right amount of load when necessary to handle how much you need based on Azure Monitor.

Answer 34

Key metrics - CPU usage, wait statistics (why are queries waiting on resources), I/O usage - I/O limits of underlying storage, memory usage. Tools include * Metrics charts: for DTU consumptions and resources approaching max utilization * SQL Server Management Studio - performance dashboard to monitor top resource consuming queries. * SQL Database Advisor - view recommendations for creating and dropping indexes, parameterising queries and fixing schema issues * Azure SQL Intelligent Insights - automatic monitoring of database performance * Automatic tuning - let Azure automatically apply changes

Answer 35

SQL Data Warehouse has the same metrics and you use mostly Azure monitor.

Answer 36

* performance metrics on the metrics page on the side panel * Performance metrics using Azure monitoring * Performance metrics on the Account page

Answer 37

You can setup alerts in the portal. Rules can be configured through CosmosDB alert rules. Alert rules can be setup in the side panel. With the resource, condition and action flow.

Answer 38

The four main ones are * SU % utilitization (More SU, the CPU used to run your job - How well are we using the streaming units?) * runtime errors (should have 0 - can test to see) * watermark delay - reliable signals of job health (input and output). Calculated by looking at when an event first occurs and then the processing, and then the output. * input deserialisation error - if input stream has malformed messages such as a missing semicolon etc. * alerts can be setup in resource, condition, and action flow.

Answer 39

You can configure alerts by defining logic, severity and notification type. No code required. Stored run data is kept for 45 days. Use Azure Monitor for complex queries and monitoring across multiple data factories. Can create test alert for pipeline fail alerts in data factory. Advanced stuff can be done in Azure Monitor.

Answer 40

Different than other products. Ganglia is used to check if the cluster is healthy. Azure monitor is recommended but isn't configured natively for data bricks. Grifana is open source graphing tool to visualise.

Answer 41

1. Have you checked for Advisor recommendations? 2. Polybase - use polybase for ELT and use CTAS 3. Hash-distrubte large tables 4. Remember the rule of 60 (don’t over partition). 5. Minimize transaction size 6. Reduce query result sizes 7. Minimize possible column size 8. Maximise throughput by breaking gzip into 60+ files.

Answer 42

Hard to do. Parallel copy in data factory will maximise parallelisation. Keep file sizes between 256mb to 100gb where possible.

Answer 43

Review the SU utilisation %metric * 80% is the redline * Start with 6 SUs for queries not using Partition by * Outputs need to be partitioned * Events should go to the same partition of your input * Partition BY should be used in all steps

Answer 44

Intelligent Performance tab * Summary of database performance and the first stop for optimisation Performance Recommendations * Indexes to create or drop * Schema issues identified in the database * When queries can benefit from being parametrised queries Query Performance Insight * insight into DTU consumption * CPU consuming queriries * Drill down into detail on query results Automatic tuning * ML for appropriate tuning

DP-200 Study Notes Flashcards

(70 cards)