Data Engineering Flashcards

(132 cards)

1
Q

What is Data Engineering?

A

Is about designing, building, and managing systems that handle data efficiently. It involves collecting, storing, processing, and transforming raw data into meaningful information that businesses can use

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the primary focus of Data Engineering?

A

focuses on building data pipelines and managing data infrastructure.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What does a Data Analyst do?

A

Analyzes and interprets data to extract insights for business decisions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the role of Data Scientists?

A

Analyzes data to derive insights

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How do Data Engineers support Data Scientists?

A

Ensure data is clean, reliable, and available

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

DBA

A

Manages and optimizes databases

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are examples of databases used in data engineering?

A
  • PostgreSQL
  • MySQL
  • SQL Server
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are examples of Data Warehouses?

A
  • Snowflake
  • Redshift
  • BigQuery
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are examples of Data Lakes?

A
  • Azure Data Lake
  • AWS S3
  • HDFS
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are examples of NoSQL databases?

A
  • MongoDB
  • Cassandra
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How do data engineers use cloud platforms?

A

To build scalable data pipelines and warehouses

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Difference between Batch and Streaming Data Processing?

A
  • Batch Processing: Processes large amounts at scheduled intervals(e.g., ADF, Apache Spark)
  • Streaming Processing: Processes data in real-time as it arrives(e.g., Kafka, Azure Stream Analytics)
    Scenario: Use batch processing for end-of-day sales reports and stream processing for real-time fraud detection.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is the Ingestion Layer in data architecture?

A

Collects data from multiple sources(e.g., APIs, databases, logs)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is the Storage Layer in data architecture?

A

Stores raw, processed, and structured data (e.g., Data Lake, Data Warehouse).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the Processing Layer in data architecture?

A

Transforms and enriches data(e.g., Spark, Databricks, ADF).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is the Serving Layer in data architecture?

A

Exposes data for analysis(e.g., Power BI, Tableau)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is Slowly Changing Dimension (SCD)?

A

A method to handle historical changes in dimension tables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What are the types of Slowly Changing Dimensions?

A
  • SCD Type 1: Overwrites old data(e.g., updating customer address)
  • SCD Type 2: Keeps history with new records
  • SCD Type 3: Stores previous and current values in separate columns
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is Data Partitioning?

A

Splits large datasets into smaller chunks for faster queries and processing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What are the types of Data Partitioning?

A
  • Horizontal Partitioning: Divides by row(e.g., by date, region)
  • Vertical Partitioning: Stores specific columns separately
    Scenario: Partitioning sales data by year (year=2023, year=2024) to speed up queries for a specific year.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What are the benefits of Data Partitioning?

A

Improves query performance, reduces scan time, and enhances parallel processing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Difference between OLTP and OLAP?

A
  • OLTP (Online Transaction Processing): Used for real-time transactions (e.g., banking, e-commerce). Handles transactional data (e.g., inserting, updating records). Used for day-to-day operations
  • OLAP: Used for analytics and reporting (e.g., data warehouses). Handles analytical queries on historical data. Used for business intelligence.
  • OLTP is normalized, while OLAP uses denormalized schema for fast queries
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What is Change Data Capture (CDC)?

A

Captures changes in a database and propagates them to a target system.
Used in real-time data replication and incremental ETL

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What is the purpose of a schema in databases?

A

defines the structure of data (e.g., tables, columns, data types). It ensures data consistency and helps in querying and analysis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
What is SQL?
is used to query and manipulate relational databases. Data engineers use SQL to extract, transform, and analyze data
26
Difference between structured and unstructured, semi-structured data?
* Structured Data: Data that is organized in a fixed format (schema), like tables with rows and columns. (e.g., SQL tables) * Unstructured Data: No predefined structure (e.g., images, videos, text). Data that doesn’t follow a specific format, making it harder to organize and analyze. * Semi-structured Data: Data that has some structure but doesn’t fit neatly into tables. It often includes tags or markers to separate data elements. (eg. XML, JSON, NoSQL Db)
27
What is a NoSQL database?
store unstructured or semi-structured data and are highly scalable. Examples include MongoDB and Cassandra. Use them for flexible schemas and high-speed data access.
28
What is version control?
tracks changes to code and scripts. It ensures collaboration, reproducibility, and rollback in case of errors.
29
What is the role of APIs in data engineering?
allow data engineers to extract data from external systems (e.g., social media platforms, payment gateways)
30
Difference between primary key and foreign key?
* Primary Key: Uniquely identifies a record * Foreign Key: Links two tables by referencing the primary key of another table
31
What is Data Normalization?
organizes data to reduce redundancy and improve integrity. It involves splitting tables and defining relationships.
32
What is Data Replication?
copying data to multiple locations to ensure availability and fault tolerance. It’s important for disaster recovery and high availability.
33
What is a Data Model?
Defines how data is structured and related. To design one: * Identify entities (e.g., customers, orders). * Define relationships (e.g., one customer can place many orders). * Normalize data to reduce redundancy.
34
What is a Star Schema?
* A star schema is a data warehouse design with a central fact table connected to dimension tables. * Dimension tables are linked to the fact table using foreign keys. * The layout looks like a star, with the fact table at the center * Simple and easy to understand. * It’s used for fast query performance in analytical workloads
35
What is a Snowflake Schema?
* Similar to a star schema but more detailed Dimension tables are further broken down into sub-dimension tables. * Data is normalized, reducing redundancy. * The layout looks like a snowflake due to multiple table layers. * More complex and scalable * Ideal for larger datasets requiring detailed analysis
36
What is Data Governance?
Managing data quality, security, and compliance. It’s important to ensure data is accurate, secure, and used responsibly
37
What is a Fact Table?
* The central table in the schema. * Contains the data you want to analyze, like sales numbers or transaction amounts. * Each row represents a specific event or transaction
38
What are Dimension Tables?
* Provide context to the data in the fact table. * Contain descriptive information or attributes. * Each row in a dimension table represents a unique value or category.
39
What are Parquet and ORC files?
Columnar storage formats optimized for big data processing. They reduce storage costs and improve query performance
40
What is Data Deduplication?
Removing duplicate records from a dataset. It’s achieved by identifying unique keys or using tools like Apache Spark
41
Difference between horizontal and vertical scaling?
* Horizontal Scaling: Adding more machines to a system. * Vertical Scaling: Adding more resources (e.g., CPU, RAM) to a single machine
42
Difference between Full Load and Incremental Load?
* Full Load: Replaces the entire dataset in the target system * Incremental Load: Updates only new or changed data **Scenario:** Use a full load for initial data migration and incremental loads for daily updates.
43
What is Big Data?
* Refers to extremely large datasets that come from various sources and require advanced tools to store, process, and analyze effectively. * It includes structured, semi-structured, and unstructured data that can range from terabytes to petabytes or even exabytes
44
Key Characteristics of Big Data
* Volume: Refers to the massive amount of data generated every second from multiple sources. * Variety: Data comes in different formats like text, images, videos, audio, and sensor data. * Velocity: Data is generated at a high speed, requiring quick processing. * Veracity: Refers to the accuracy and reliability of the data. * Value: Extracting meaningful insights from data creates business value
45
What does ACID stand for?
* Atomicity * Consistency * Isolation * Durability Ensure database transactions are reliable and consistent.
46
Atomicity
A transaction is a single unit of work, meaning it must either be completed fully or not at all. For example, if you’re transferring money between two bank accounts, both the debit and credit actions must happen together. If one fails, both should fail to keep the system consistent
47
Consistency
A transaction should bring the database from one valid state to another. The database should always follow its rules and constraints. For example, if a rule says a person’s age must be over 18, a transaction trying to add someone younger than 18 should be stopped.
48
Isolation
Transactions should not interfere (ကြားဝင် / နှောင့်ယှက် ) with each other. Each transaction must happen independently, so that no other transaction affects it. For example, if two people are trying to withdraw money from the same account at the same time, the system must prevent both from succeeding and causing an error.
49
Durability
Once a transaction is completed, it must be permanent, even if there’s a system failure or crash. The changes made by the transaction should be saved and not lost.
50
What is the role of a data engineer in data security?
Ensures data is secure by implementing encryption, access controls, and monitoring systems.
51
How to handle duplicate records in a dataset?
Use SQL DISTINCT or ROW_NUMBER() to remove duplicates
52
How to optimize a slow SQL query?
* Use Indexes to speed up searches. * Choose the Right Join -> Optimize JOINs by selecting only necessary columns. * Use Partitioning for large tables. * Avoid SELECT * (fetch only required columns). * Check Query Execution Plan for bottlenecks.
53
Steps to debug a job that has slowed down?
* Check data volume * Review query performance * Check resource allocation * Optimize partitions and indexing * Monitor network
54
What is the purpose of using indexes in SQL queries?
Indexes improve the speed of data retrieval operations on a database table. ## Footnote Indexes allow the database to find rows more quickly without scanning the entire table.
55
What should you check if a job's execution time has increased significantly?
1. Check data volume 2. Review query performance 3. Check resource allocation 4. Optimize partitions and indexing 5. Monitor network & I/O bottlenecks ## Footnote Assessing these factors can help identify the cause of performance issues.
56
What is a recommended approach for loading 1 billion records into a database?
* Use bulk inserts instead of row-by-row inserts. * Use partitioning and indexing to speed up writes. * Disable indexes and constraints before load and re-enable after. * Use parallel processing (e.g., Apache Spark, ADF). * Store data in compressed formats like Parquet for efficiency. ## Footnote Bulk inserts are more efficient for loading large datasets.
57
How can you ensure data quality in a data pipeline?
1. Validation Checks 2. Monitoring and Logging 3. Unit Testing 4. Data Profiling ## Footnote These steps help maintain data integrity throughout the pipeline.
58
What are common failures in a data pipeline?
1. Network Failures 2. Schema Changes 3. Data Skew 4. Job Failures ## Footnote Each type of failure requires specific handling strategies.
59
What is a data warehouse?
A data warehouse is a centralized repository for structured data, optimized for querying and analysis. ## Footnote It provides a unified view of data for decision-making.
60
What are best practices for designing a Data Warehouse?
1. Use Star Schema or Snowflake Schema 2. Optimize storage using columnar formats 3. Implement partitioning and clustering 4. Design for scalability ## Footnote These practices enhance efficiency and performance.
61
What is the difference between a Data Lake and a Data Warehouse?
Data Lake: Stores raw data, unstructured/semi-structured, without a predefined schema, allowing data in any format. Data Warehouse: Stores structured data, organized into tables with a predefined schema. optimized for analytical queries. Data Lakehouse (e.g., Delta Lake, Snowflake) combines the features of a data lake and a data warehouse, enabling both storage and analytics. ## Footnote A Data Lakehouse combines features of both, enabling storage and analytics.
62
What is the role of Apache Spark in data engineering?
Apache Spark is a distributed computing framework used for processing large datasets. It’s faster than Hadoop MapReduce and supports batch and real-time processing ## Footnote It is faster than Hadoop MapReduce.
63
What does Docker do in data engineering?
Docker is used to containerize applications, ensuring consistency across environments. Data engineers use it to deploy ETL pipelines and data processing tools ## Footnote Data engineers use it to deploy ETL pipelines and data processing tools.
64
How do you handle late-arriving data in a streaming pipeline?
1. Use watermarking 2. Store late data in a separate table 3. Use event-time processing ## Footnote These strategies help manage data that arrives after the expected time.
65
What is the purpose of using encryption for sensitive data?
Encryption protects sensitive data from unauthorized access by encoding it. ## Footnote AES-256 is a common encryption standard.
66
What is the function of Apache Kafka in data engineering?
Apache Kafka is a distributed streaming platform used for building real-time data pipelines. It enables high-throughput, fault-tolerant messaging between systems ## Footnote It is commonly used for building real-time data pipelines.
67
Azure Synapse Analytics
* cloud-based service by Microsoft * combines "big data analytics" and "data warehousing" into a single platform * allows you to analyze the large amount of data on both structure and unstructured data quickly and easily * can handle very large datasets * you can run complex queries on both structure and unstructured data * quick "data processing" and "analytics" with powerful querying capabilities * support both SQL and Spark * scale up or down based on workloads (your need), ensuring flexibility and cost efficiency * built-in integration with machine learning, data lakes and other services data processing means -> data collecting, storing, transforming, analyzing and visualization
68
Azure SQL Data Warehouse vs Azure Synapse
* Synapse - Synapse is the evolution of Azure SQL Data Warehouse - Key Difference is Synapse combines "big data analytics" and "Data Warehousing" into single platform - offer mores - including "big data processing" for unstructured data - built-in integration with machine learning, data lakes and other services * SQL Data Warehouse - mainly focus on data warehousing for structure data and analytics
69
Azure SQL Database vs Azure Synapse
* Synapse - is designed for "big data analytics" and "Data Warehousing" - can handle very large datasets - support both SQL and Spark * SQL DB - is a traditional relational database - for smaller-scale application - is not optimized for big data
70
SQL Pool ( Data Warehousing )
* used for large-scale data warehousing * used for "structure data" processing with traditional relational SQL queries * can run complex queries on large datasets efficiently * ideal for OLAP workloads, analytics purpose and data warehousing * do not support unstructured data * 2 types -> dedicated SQL pools for high performance , serverless SQL pools for ad-hoc queries without needing to set up or manage infra
71
Serverless SQL pools
* can run SQL queries on data stored in Azure , without needing to set up or manage infra * pay-per-query service , you only pay for the data you query * best for on-demand, ad-hoc queries * small to medium workloads * Difference - don't need to provision compute resources - is on-demand service where you pay per query - is more cost effective for ad-hoc queries and smaller datasets
72
Dedicated SQL pools
* distributed database system * uses multiple nodes to store and process data * used for performing complex queries on large datasets quickly * optimized for large-scale data processing and analytics * Difference - is provisioned resources for large-scale data warehousing - pay for fixed amount of compute and storage - best for predictable, high-performance workloads - used for running SQL queries on "structure data" - optimized for analytics and data warehousing
73
Spark Pool
* used for big data processing using Apache Spark * run unstructured / structured data processing, big data analytics and machine learning tasks on large datasets ( large-scale) * "scalable environment" is provided for running spark jobs (means the environment can scale up or down based on size and complexity of the data being processed ) * supports both structure and unstructured data * use the languages like Python, Scala, SQL * ideal for unstructured data and machine learning
74
Apache Spark
* open source framework * used for big data processing * helps in running complex analytics, machine learning, and data transformation tasks on large datasets that are not easily processed by traditional relational databases
75
Data Lake
* used to store large amount of structure , semi-structured or unstructured data * Integrate with Azure Synapse, allow you to store raw data in scalable and secure env * can be later processed and analyzed using SQL or Spark pools * Integration -can use PolyBase or serverless SQL pools to directly query data in Data Lakes without loading it into a dedicated SQL pool. -can move data from Data Lakes into SQL pools using Azure Synapse pipelines or Azure Data Factory.
76
PolyBase
allow querying the external data sources without moving data physically
77
External Datasource
is a connection that allow Synapse to access data, stored outside of Synapse - Azure Blob or Azure Data Lake
78
External file format
is a structure of external data ( CSV, Parquet ) , so Synapse knows how to read or interpret it
79
External Table
* is a table that points to data, stored outside of Synapse * allow you to query that data without moving it into Synapse
80
Materialized View
* are computed views * store the result of queries in database * improve performance by avoiding repetitive computation * store the data physically * querying it is faster than running the original query each time. * faster to access * periodically refreshed to keep the data up-to-date
81
Regular View
* liked save query * doesn't store data * retrieves the latest data every time you query it
82
Lake Database
* used to manage and query large amount of "unstructured" or "semi-structured" data in "data lake" * ideal for big data analytics and exploration
83
Normal SQL query
* run on a single node * used for smaller , less complex datasets
84
Distributed SQL query
* are split across multiple nodes in Synapse SQL pools * making them faster for large datasets by parallelizing the query execution across different resources
85
Clustered Columnstore indexes
* store data in column instead of rows * ideal for large-scale analytics * compresses the data to save space * improve performance for queries that need to scan a lot * Efficient for large data analysis (column-based storage) * When you define a Clustered Columnstore Index (CCI) for a table, every column in the table is stored in a columnar format (compressed and optimized for fast scanning).
86
Heap Table
* is a table without any index * data is stored randomly * when you query it, the system has to scan the entire table to find what you need * slower for large table * might be fine for smaller, less frequently accessed tables * Tables without indexing, slower for large data
87
Clustered index
* store and sort the data in the table in specific order * there is only one clustered index per table bcoz it rearranges the data physically * great for quickly retrieving rows based on a column, like a primary key * Sorts data and speeds up retrieval (only one per table).
88
Non-Clustered index
* like a lookup table that points to the data * Speed up queries without changing data order * can have multiple nonclustered indexes on a table
89
protocols
HTTPS and ABFSS are 2 different protocols for accessing data in Azure Data Lake Storage
90
HTTPS protocols (Hypertext Transfer Protocol Secure)
* is standard protocols * used for general secure web access * used for secure the data transfer over the internet * when you use HTTPS with Azure Data Lake, you access data over the web securely using URLs like `https://.dfs.core.windows.net`.
91
ABFSS (Azure Blob File System Secure)
* This is a specialized protocol designed for Azure Data Lake Storage Gen2 * It’s an optimized version of the HTTPS protocol * optimized for accessing large datasets, * providing better integration with Azure analytics tools * ABFSS URLs look like `abfss://@.dfs.core.windows.net`. * offering better performance and functionality for big data analytics. * is preferred for data lake-related tasks
92
parser version
when you are working with "CSV" file formats, different methods for parsing data during data load operations Parser Version 1.0 - Less flexible - Basic functionality - can be slower or less efficient with large or complex files. Parser Version 2.0 - Better performance - handles larger files more efficiently and offers better performance, especially for complex data or large datasets. - Supports better error handling, column data type detection, and enhanced parsing options
93
Collation
* determine how text data is sorted and compared * Case Sensitivity - Whether uppercase and lowercase letters are treated the same or differently * Accent Sensitivity - Whether accented characters (like é or è) are treated as different from non-accented ones (like e) * Sorting - How data is ordered, such as alphabetically or numerically * Consistency - It ensures that comparisons and sorting happen the way you expect * Performance - The right collation can improve query performance when working with large datasets
94
Distribution
determine how data is stored across compute nodes for "parallel processing and performance optimization"
95
Hash Distributed
* Data is "evenly distributed" across nodes based on a selected "hash key" (column). * best for "Large fact tables" in "star schema" (e.g., Sales Transactions). * Improves "query performance" by minimizing data movement.
96
Round Robin Distributed
* Data is "randomly" distributed across nodes "without a specific key". * best for "Staging tables" or tables with "no clear distribution column". * Simple and ensures even distribution but may cause "data movement during joins"
97
Replicated
* A "full copy" of the table is stored on "each compute node". * best for "Small dimension tables" used in "joins" (e.g., Product, Country). * "Eliminates data movement", improving join performance. * Not recommended for "large tables" due to storage overhead.
98
Partition
* In Azure Synapse, partitioning improves query performance by dividing large tables into smaller ones. Benefits: * Faster queries by scanning only relevant partitions. * Improved data load performance. * Better maintenance and indexing CREATE TABLE SalesData ( SalesID INT IDENTITY(1,1), ProductName VARCHAR(100), SaleAmount DECIMAL(10,2), SaleDate DATE ) WITH ( DISTRIBUTION = HASH(SalesID), -- Choose appropriate distribution CLUSTERED COLUMNSTORE INDEX, PARTITION (SaleDate RANGE RIGHT FOR VALUES ('2023-01-01', '2023-04-01', '2023-07-01', '2023-10-01') ) );
99
OPENROWSET
* allows querying external data directly from files in Azure Data Lake Storage (ADLS) or Blob Storage without needing to load data into a table. Common Use Cases: 🔹 Ad-hoc analysis of raw files 🔹 Quick data exploration without creating tables Limitations: ⚠ Only works in Serverless SQL pools (not Dedicated SQL pools) ⚠ No DML operations (INSERT/UPDATE/DELETE)
100
Control Node
🔹 **Control Node** - Acts as the **brain** of the Synapse Dedicated SQL Pool. - Manages query execution and optimization. - Distributes work to **compute nodes**. - Handles metadata, session control, and coordination. **Analogy:** - **Control Node = Manager** (assigns tasks). - **Compute Nodes = Workers** (do the actual work).
101
Compute Node
🔹 **Compute Node** - Executes the actual **query workload**. - Stores and processes data using **distributed storage**. - Works in parallel for **faster query performance**. - The number of compute nodes depends on **DWU (Data Warehouse Unit) scale**.
102
How to load data
* "Copy Command" - a simple way to copy data from sources like Azure Blob Storage * "PolyBase" - is a tool to load data from external sources like Hadoop or Azure Data Lake * "Azure Data Factory" - to move and transform data
103
handle security / how to secure
* Azure AD - for identity and authentication - for secure login * Role-based Access Control ( RBAC) - to manage user access - to grant permission * Firewall Rules - to restrict access to the Synapse Workspace - to restrict access to specific IP addresses
104
handle big data
* Apache Spark - for big data processing - allow you to analyze "unstructured data" * Data Lake - integrate with Azure Data Lake - can store massive amounts of "unstructured data" * SQL Pool - for fast querying of "structured data"
105
handle ETL processes
* using Synapse Pipeline * Extract data from various sources * Transform data using built-in data flows or by running scripts (SQL, Spark) * Load the transformed data into Azure Synapse or other data storage services
106
handle schema changes
you can use standard ALTER TABLE commands
107
how to monitor performance
* Azure Synapse Studio: - Use the built-in monitoring features in Synapse Studio, such as the Monitor hub, to view pipeline, Spark, SQL pool (dedicated and serverless), and activity run metrics.
108
how to optimize performance
* use "Clustered ColumnStore Index" - to store data in column-wise format - is faster for analytics * distribute tables properly - use hash distribution for large table to evenly spread data across nodes
109
how to optimize costs
* use "serverless sql pools" for ad-hoc queries to avoid provision costs * paused "dedicated sql pools" when not in use to save costs * use "data compression" to reduce storage costs
110
Scale
* scales both Compute and Storage independently * compute scaling - in dedicated sql pools -> can scale the compute resources (DWUs) up or down based on your workloads - for spark pools -> can adjust the number of nodes and type of nodes * storage scaling - is scalable by default - best practice - scale compute resources during peak time - scale down during idle time to "save costs" - leverage serverless sql pools for smaller ad-hoc queries to "avoid the costs of dedicated resources"
111
SELECT data from external csv file
SELECT Customersgender,COUNT(customer_id) FROM OPENROWSET( BULK 'https://azdelab.dfs.core.windows.net/raw/Customers.csv', FORMAT = 'CSV',         PARSER_VERSION = '2.0', HEADER_ROW = TRUE ) AS [result] GROUP BY Customersgender;
112
Copy into command
COPY INTO dbo.green_tripdata ( VendorID 1, store_and_fwd_flag 4, RatecodeID 5, PULocationID 6 , DOLocationID 7, passenger_count 8,trip_distance 9, fare_amount 10, mta_tax 12, tip_amount 13,tolls_amount 14, ehail_fee 15, improvement_surcharge 16, total_amount 17,payment_type 18, trip_type 19, congestion_surcharge 20 ) FROM 'https://azdelab.dfs.core.windows.net/raw/green_tripdata_2019-12.parquet' WITH ( FILE_TYPE = 'PARQUET' ,MAXERRORS = 0 ,IDENTITY_INSERT = 'OFF' --,AUTO_CREATE_TABLE ='ON' )
113
Create External File Format, Data Source, Credential with Shared Access Signature
-- Create Serverless Database CREATE DATABASE serverless_db_demo -- Create External Data Source USE serverless_db_demo CREATE MASTER KEY ENCRYPTION BY PASSWORD = 'Welcome@123' CREATE DATABASE SCOPED CREDENTIAL adls_azdelab_credential WITH IDENTITY = 'SHARED ACCESS SIGNATURE' ,SECRET = 'sv=2022-11-02&ss=bfqt&srt=sco&sp=rwdlacupyx&se=2025-04-01T18:51:11Z&st=2025-03-01T10:51:11Z&spr=https&sig=4BoyfWPZ2%2FBauY9MbfoUEIJPSAySoof2bDipQz9O7R4%3D' GO CREATE EXTERNAL DATA SOURCE ext_ds_adls_azdelab WITH ( LOCATION = 'https://azdelab.dfs.core.windows.net/', CREDENTIAL = adls_azdelab_credential ); -- Create File Format - Supported File Format Parquet and Delimited Text CREATE EXTERNAL FILE FORMAT ext_ff_delimited_text WITH ( FORMAT_TYPE = DELIMITEDTEXT, FORMAT_OPTIONS (FIELD_TERMINATOR = ',', STRING_DELIMITER = '"') ); CREATE EXTERNAL FILE FORMAT ext_ff_parquet WITH ( FORMAT_TYPE = PARQUET, DATA_COMPRESSION = 'org.apache.hadoop.io.compress.SnappyCodec' --'org.apache.hadoop.io.compress.GzipCodec' );
114
Create External File Format, Data Source, Credential with Managed Identity
CREATE DATABASE serverless_db_demo_1 COLLATE Latin1_General_100_BIN2_UTF8; USE serverless_db_demo_1 GO /* first, need to grant 'Storage Blob Data Contributor' role for Synapse in RBAC of Storage Account */ CREATE MASTER KEY ENCRYPTION BY PASSWORD = 'Welcome@123' CREATE DATABASE SCOPED CREDENTIAL adls_azdelab_credential WITH IDENTITY = 'Managed Identity' GO CREATE EXTERNAL DATA SOURCE ext_ds_adls_azdelab WITH ( LOCATION = 'https://azdelab.dfs.core.windows.net/', CREDENTIAL = adls_azdelab_credential ); CREATE EXTERNAL FILE FORMAT ext_ff_parquet WITH ( FORMAT_TYPE = PARQUET, DATA_COMPRESSION = 'org.apache.hadoop.io.compress.SnappyCodec' --'org.apache.hadoop.io.compress.GzipCodec' );
115
Azure Data Factory (ADF)
* Azure Data Factory is a cloud-based data integration service * allows you to create, schedule, and orchestrate data pipelines * supports data movement and transformation from various sources to different destinations. * supports both cloud and on-premises data sources. * Azure Data Factory is a cloud-based **ETL/ELT** service for orchestrating and automating data movement and transformation across various sources (on-premises/cloud)
116
Key components of Azure Data Factory
* Pipelines: Group of activities to perform ETL processes. * Activities – Individual tasks within a pipeline (e.g., Copy Activity, Data Flow). * Datasets: Represent data structures, pointing to data sources or destinations. * Linked Services: Connections to data stores and compute environments. * Triggers: Automate pipeline execution based on schedules or events.
117
Pipelines in Azure Data Factory
* A pipeline in ADF is a logical grouping of activities (ETL tasks) that together perform a task. * A pipeline can contain one or more activities, such as copying data from one source to another or transforming data. * How they are used can be used to orchestrate data movement, transformation, and load operations from source to destination
118
'Activities' in Azure Data Factory
* are the individual tasks in a pipeline that perform a specific operation. * types of activities can you perform in ADF - Data movement activities (e.g., Copy Data) - Data transformation activities (e.g., Data Flow, HDInsight, Databricks) - Control flow activities (e.g., ForEach, If Condition, Until) - Copy Activity: Used for copying data from source to destination. - Data Flow Activity: For data transformation. - Execute Pipeline Activity: To call another pipeline. - Stored Procedure Activity: Executes a stored procedure.
119
Datasets in Azure Data Factory
* represents the structure of data within ADF and points to the source or destination data. * For example, a dataset can represent a table in a database, a file in Blob Storage, or an Azure SQL Database table.
120
Linked service in Azure Data Factory
* defines the connection information for data sources or destinations. * like a connection string that tells ADF how to connect to data stores, like Azure SQL Database, Blob Storage, or on-premises databases
121
Triggers in Azure Data Factory
* Triggers are used to automatically initiate pipeline execution. * There are different types: * Schedule Trigger: Executes pipelines on a defined schedule. * Event-based Trigger: Executes pipelines when a specific event occurs, such as a file being uploaded to Blob Storage. * Manual Trigger: Initiated by the user when needed. Scenario: A pipeline is scheduled to run daily to extract sales data from an SQL database and load it into a Data Lake.
122
Integration Runtimes (IR) in Azure Data Factory
* compute infrastructure used by ADF to move data between data stores. * There are three types: - Azure IR: For cloud-to-cloud data transfer. - Self-hosted IR: For on-premise data movement or hybrid scenarios. - Azure SSIS IR: For running SSIS packages in the cloud
123
'Copy Data' activity in Azure Data Factory
* activity is used to copy data from a source to a destination. * It can handle data movement between on-premises and cloud sources Scenario: If you need to copy data from an on-premises SQL Server to Azure Blob Storage, you would configure a Copy Data activity to specify the source and destination.
124
Lookup activity in ADF
* activity is used to retrieve a value or a set of values from a data source. * It’s commonly used to retrieve parameters or configurations before performing other activities. Scenario: Before running a data copy operation, you may use a Lookup activity to check the last successful run time from a configuration table and pass that value as a parameter to filter records in the source system
125
Mapping Data Flows in ADF
* Mapping Data Flows provide a visual, code-free way to design data transformations (e.g., joins, aggregations, lookups) within ADF, using a Spark-based execution engine
126
Data flow transformations in Azure Data Factory
* ADF's Data Flow provides a no-code, visually-designed environment for building transformations * Unlike traditional ETL where transformations happen in separate systems (e.g., SQL or Databricks), Data Flow integrates the entire pipeline from extraction to transformation in a visual, scalable manner * Data flows allow complex transformations like: - Filter: Remove unwanted data. - Join: Merge data from multiple sources. - Aggregate: Group data by a column and apply aggregations like SUM, COUNT - Derived Column: Add or modify columns. - Sort: Sort data based on one or more columns. Scenario: If you want to calculate the total sales per region and filter out regions with sales below a threshold, you could use Aggregate and Filter transformations in Data Flow.
127
Parameter in ADF
* are used to pass dynamic values to pipelines, datasets, or activities. * can define pipeline parameters and pass values at runtime. (e.g., `@pipeline().parameters.FileName`)
128
ETL and ELT
* ETL stands for Extract, Transform, Load. It is the process of: * Extract: Pulling data from various sources (e.g., databases, APIs). * Transform: Cleaning, filtering, and structuring the data. * Load: Storing the data in a target system (e.g., data warehouse). Difference * ETL (Extract, Transform, Load) → Data is transformed before loading into the target system (used in traditional data warehouses) * ELT (Extract, Load, Transform) → Data is first loaded and then transformed (used in modern cloud-based architectures like Snowflake, BigQuery). **Why it’s important:** ETL ensures data is consistent, accurate, and ready for analysis.
129
Difference between a Data Pipeline and a Data Flow in ADF
* Pipeline: A logical grouping of activities that perform a unit of work. Pipelines define the orchestration of data movements and transformations. * Data Flow: A visual design tool for creating data transformation logic. It is a part of the pipeline and focuses on data transformation. Scenario: If you are moving data from SQL Server to Azure Data Lake and want to perform data cleansing, you might use a data flow for the transformation step inside the pipeline.
130
Difference between Copy Activity and Data Flow in ADF
* Copy Activity: Used for simple data movement from source to destination without/with minimal transformation. * Data Flow: Used for complex transforming data within the pipeline before moving it to the destination. It is a visual interface for data transformation. Scenario: If you just need to move data from a flat file in Blob Storage to an Azure SQL Database, you use the Copy Activity. If you need to clean or aggregate the data before loading it, you would use Data Flow
131
Difference between Azure Data Factory and SSIS
* ADF is cloud-based, whereas SSIS is an on-premises ETL tool. * ADF is serverless and scales dynamically, while SSIS requires dedicated servers. * SSIS has rich transformations in SQL Server Data Tools (SSDT), while ADF relies on Mapping Data Flows for complex transformations
132
Difference between Azure Data Factory (ADF) and Azure Synapse Analytics
ADF is focused on ETL and data integration, while Azure Synapse (formerly Azure SQL Data Warehouse) combines big data and data warehousing, offering on-demand SQL queries, Spark, and data exploration