Introduction to Data Engineering Flashcards

Question

Consideration for data lake

Answer 1

1. Types of data to be stored 2. How the data will be transformed 3. Who should access the data 4. What are the typical access patterns how to plan for access control governance across your lake

Answer 2

1. isn't a standalone Azure service, but rather a configurable capability of a StorageV2 2. select the option to Enable hierarchical namespace in the Advanced page when creating the storage account in the Azure portal OR use the Data Lake Gen2 upgrade wizard in the Azure portal page for your storage account resource.

Answer 3

1. Azure Blob storage, you can store large amounts of unstructured ("object") data in a flat namespace within a blob container. 2. blobs are stored as a single-level hierarchy in a flat namespace. 3. Blob: You can access this data by using HTTP or HTTPs 4. Azure Data Lake Storage Gen2 builds on blob storage and optimizes I/O of high-volume data by using a hierarchical namespace that organizes blob data into directories, and stores metadata about each directory and the files within it. 5. directory renames and deletes, to be performed in a single atomic operation 6. Hierarchical namespace: better storage and retrieval performance for an analytical use case and lowers the cost of analysis. 7. applications can use either the Blob APIs or the Azure Data Lake Storage Gen2 file system APIs to access data.

Answer 4

1. Ingest 2. Store 3. Prep and train 4. Model and serve

Answer 5

Batch: Azure Synapse Analytics or Azure Data Factory Stream: Apache Kafka for HDInsight or Stream Analytics acquire the source data

Answer 6

Azure Data Lake Storage Gen2 where the ingested data should be placed

Answer 7

Azure Synapse Analytics, Azure Databricks, Azure HDInsight, and Azure Machine Learning. perform data preparation and model training and scoring for machine learning solutions

Answer 8

Microsoft Power BI, or analytical data stores such as Azure Synapse Analytics present the data to users.

Answer 9

1. Big data processing and analytics 2. Data warehousing 3. Real-time data analytics 4. Data science and machine learning

Answer 10

1. provides a scalable and secure distributed data store on which big data services such as Azure Synapse Analytics, Azure Databricks, and Azure HDInsight can apply data processing frameworks such as Apache Spark, Hive, and Hadoop 2. enables tasks to be performed in parallel, resulting in high-performance and scalability

Answer 11

1. integrate large volumes of data stored as files in a data lake with relational tables in a data warehouse. 2. the data is staged in a data lake in order to facilitate distributed processing before being loaded into a relational data warehouse.

Answer 12

1. the data warehouse uses external tables to define a relational metadata layer over files in the data lake 2. The data warehouse can then support analytical queries for reporting and visualization.

Answer 13

1. Streaming events are often captured in a queue for processing (i.e. Azure Event Hubs) 2. Azure Stream Analytics enables you to create jobs that query and aggregate event data as it arrives, and write the results in an output sink. 3. One such sink is Azure Data Lake Storage Gen2; from where the captured real-time data can be analyzed and visualized.

Answer 14

1. Descriptive analytics 2. Diagnostic analytics 3. Predictive analytics 4. Prescriptive analytics

Answer 15

“What is happening in my business?” creation of a data warehouse in which historical data is persisted in relational tables for multidimensional modeling and reporting.

Answer 16

Why is it happening? may involve exploring information that already exists in a data warehouse, but typically involves a wider search of your data estate to find more data to support this type of analysis.

Answer 17

autonomous decision making based on real-time or near real-time analysis of data, using predictive analytics.

Answer 18

1. an instance of the Synapse Analytics service in which you can manage the services and data resources needed for your analytics solution 2. Synapse Studio; a web-based portal for Azure Synapse Analytics. 3. A workspace typically has a default data lake, which is implemented as a linked service to an Azure Data Lake Storage Gen2 container 4. Azure Synapse Analytics includes built-in support for creating, running, and managing pipelines that orchestrate the activities necessary to retrieve data from a range of sources, transform the data as required, and load the resulting transformed data into an analytical store. 5. can create one or more Spark pools and use interactive notebooks to combine code and notes as you build solutions for data analytics 6. Azure Synapse Data Explorer is a data processing engine in Azure Synapse Analytics that is based on the Azure Data Explorer service.

Answer 19

1. Azure Synapse Link enables near-realtime synchronization between operational data 2. Microsoft Power BI - data visualisation 3. Microsoft Purview - catalogue data assets 4. Azure Machine Learning

Introduction to Data Engineering Flashcards

(44 cards)