Data Engineering Flashcards

Question

What is SQL?

Answer 1

is used to query and manipulate relational databases. Data engineers use SQL to extract, transform, and analyze data

Answer 2

* Structured Data: Data that is organized in a fixed format (schema), like tables with rows and columns. (e.g., SQL tables) * Unstructured Data: No predefined structure (e.g., images, videos, text). Data that doesn’t follow a specific format, making it harder to organize and analyze. * Semi-structured Data: Data that has some structure but doesn’t fit neatly into tables. It often includes tags or markers to separate data elements. (eg. XML, JSON, NoSQL Db)

Answer 3

store unstructured or semi-structured data and are highly scalable. Examples include MongoDB and Cassandra. Use them for flexible schemas and high-speed data access.

Answer 4

tracks changes to code and scripts. It ensures collaboration, reproducibility, and rollback in case of errors.

Answer 5

allow data engineers to extract data from external systems (e.g., social media platforms, payment gateways)

Answer 6

* Primary Key: Uniquely identifies a record * Foreign Key: Links two tables by referencing the primary key of another table

Answer 7

organizes data to reduce redundancy and improve integrity. It involves splitting tables and defining relationships.

Answer 8

copying data to multiple locations to ensure availability and fault tolerance. It’s important for disaster recovery and high availability.

Answer 9

Defines how data is structured and related. To design one: * Identify entities (e.g., customers, orders). * Define relationships (e.g., one customer can place many orders). * Normalize data to reduce redundancy.

Answer 10

* A star schema is a data warehouse design with a central fact table connected to dimension tables. * Dimension tables are linked to the fact table using foreign keys. * The layout looks like a star, with the fact table at the center * Simple and easy to understand. * It’s used for fast query performance in analytical workloads

Answer 11

* Similar to a star schema but more detailed Dimension tables are further broken down into sub-dimension tables. * Data is normalized, reducing redundancy. * The layout looks like a snowflake due to multiple table layers. * More complex and scalable * Ideal for larger datasets requiring detailed analysis

Answer 12

Managing data quality, security, and compliance. It’s important to ensure data is accurate, secure, and used responsibly

Answer 13

* The central table in the schema. * Contains the data you want to analyze, like sales numbers or transaction amounts. * Each row represents a specific event or transaction

Answer 14

* Provide context to the data in the fact table. * Contain descriptive information or attributes. * Each row in a dimension table represents a unique value or category.

Answer 15

Columnar storage formats optimized for big data processing. They reduce storage costs and improve query performance

Answer 16

Removing duplicate records from a dataset. It’s achieved by identifying unique keys or using tools like Apache Spark

Answer 17

* Horizontal Scaling: Adding more machines to a system. * Vertical Scaling: Adding more resources (e.g., CPU, RAM) to a single machine

Answer 18

* Full Load: Replaces the entire dataset in the target system * Incremental Load: Updates only new or changed data **Scenario:** Use a full load for initial data migration and incremental loads for daily updates.

Answer 19

* Refers to extremely large datasets that come from various sources and require advanced tools to store, process, and analyze effectively. * It includes structured, semi-structured, and unstructured data that can range from terabytes to petabytes or even exabytes

Answer 20

* Volume: Refers to the massive amount of data generated every second from multiple sources. * Variety: Data comes in different formats like text, images, videos, audio, and sensor data. * Velocity: Data is generated at a high speed, requiring quick processing. * Veracity: Refers to the accuracy and reliability of the data. * Value: Extracting meaningful insights from data creates business value

Answer 21

* Atomicity * Consistency * Isolation * Durability Ensure database transactions are reliable and consistent.

Answer 22

A transaction is a single unit of work, meaning it must either be completed fully or not at all. For example, if you’re transferring money between two bank accounts, both the debit and credit actions must happen together. If one fails, both should fail to keep the system consistent

Answer 23

A transaction should bring the database from one valid state to another. The database should always follow its rules and constraints. For example, if a rule says a person’s age must be over 18, a transaction trying to add someone younger than 18 should be stopped.

Answer 24

Transactions should not interfere (ကြားဝင် / နှောင့်ယှက် ) with each other. Each transaction must happen independently, so that no other transaction affects it. For example, if two people are trying to withdraw money from the same account at the same time, the system must prevent both from succeeding and causing an error.

Answer 25

Once a transaction is completed, it must be permanent, even if there’s a system failure or crash. The changes made by the transaction should be saved and not lost.

Answer 26

Ensures data is secure by implementing encryption, access controls, and monitoring systems.

Answer 27

Use SQL DISTINCT or ROW_NUMBER() to remove duplicates

Answer 28

* Use Indexes to speed up searches. * Choose the Right Join -> Optimize JOINs by selecting only necessary columns. * Use Partitioning for large tables. * Avoid SELECT * (fetch only required columns). * Check Query Execution Plan for bottlenecks.

Answer 29

* Check data volume * Review query performance * Check resource allocation * Optimize partitions and indexing * Monitor network

Answer 30

Indexes improve the speed of data retrieval operations on a database table. ## Footnote Indexes allow the database to find rows more quickly without scanning the entire table.

Answer 31

1. Check data volume 2. Review query performance 3. Check resource allocation 4. Optimize partitions and indexing 5. Monitor network & I/O bottlenecks ## Footnote Assessing these factors can help identify the cause of performance issues.

Answer 32

* Use bulk inserts instead of row-by-row inserts. * Use partitioning and indexing to speed up writes. * Disable indexes and constraints before load and re-enable after. * Use parallel processing (e.g., Apache Spark, ADF). * Store data in compressed formats like Parquet for efficiency. ## Footnote Bulk inserts are more efficient for loading large datasets.

Answer 33

1. Validation Checks 2. Monitoring and Logging 3. Unit Testing 4. Data Profiling ## Footnote These steps help maintain data integrity throughout the pipeline.

Answer 34

1. Network Failures 2. Schema Changes 3. Data Skew 4. Job Failures ## Footnote Each type of failure requires specific handling strategies.

Answer 35

A data warehouse is a centralized repository for structured data, optimized for querying and analysis. ## Footnote It provides a unified view of data for decision-making.

Answer 36

1. Use Star Schema or Snowflake Schema 2. Optimize storage using columnar formats 3. Implement partitioning and clustering 4. Design for scalability ## Footnote These practices enhance efficiency and performance.

Answer 37

Data Lake: Stores raw data, unstructured/semi-structured, without a predefined schema, allowing data in any format. Data Warehouse: Stores structured data, organized into tables with a predefined schema. optimized for analytical queries. Data Lakehouse (e.g., Delta Lake, Snowflake) combines the features of a data lake and a data warehouse, enabling both storage and analytics. ## Footnote A Data Lakehouse combines features of both, enabling storage and analytics.

Answer 38

Apache Spark is a distributed computing framework used for processing large datasets. It’s faster than Hadoop MapReduce and supports batch and real-time processing ## Footnote It is faster than Hadoop MapReduce.

Answer 39

Docker is used to containerize applications, ensuring consistency across environments. Data engineers use it to deploy ETL pipelines and data processing tools ## Footnote Data engineers use it to deploy ETL pipelines and data processing tools.

Answer 40

1. Use watermarking 2. Store late data in a separate table 3. Use event-time processing ## Footnote These strategies help manage data that arrives after the expected time.

Answer 41

Encryption protects sensitive data from unauthorized access by encoding it. ## Footnote AES-256 is a common encryption standard.

Answer 42

Apache Kafka is a distributed streaming platform used for building real-time data pipelines. It enables high-throughput, fault-tolerant messaging between systems ## Footnote It is commonly used for building real-time data pipelines.

Answer 43

* cloud-based service by Microsoft * combines "big data analytics" and "data warehousing" into a single platform * allows you to analyze the large amount of data on both structure and unstructured data quickly and easily * can handle very large datasets * you can run complex queries on both structure and unstructured data * quick "data processing" and "analytics" with powerful querying capabilities * support both SQL and Spark * scale up or down based on workloads (your need), ensuring flexibility and cost efficiency * built-in integration with machine learning, data lakes and other services data processing means -> data collecting, storing, transforming, analyzing and visualization

Answer 44

* Synapse - Synapse is the evolution of Azure SQL Data Warehouse - Key Difference is Synapse combines "big data analytics" and "Data Warehousing" into single platform - offer mores - including "big data processing" for unstructured data - built-in integration with machine learning, data lakes and other services * SQL Data Warehouse - mainly focus on data warehousing for structure data and analytics

Answer 45

* Synapse - is designed for "big data analytics" and "Data Warehousing" - can handle very large datasets - support both SQL and Spark * SQL DB - is a traditional relational database - for smaller-scale application - is not optimized for big data

Answer 46

* used for large-scale data warehousing * used for "structure data" processing with traditional relational SQL queries * can run complex queries on large datasets efficiently * ideal for OLAP workloads, analytics purpose and data warehousing * do not support unstructured data * 2 types -> dedicated SQL pools for high performance , serverless SQL pools for ad-hoc queries without needing to set up or manage infra

Answer 47

* can run SQL queries on data stored in Azure , without needing to set up or manage infra * pay-per-query service , you only pay for the data you query * best for on-demand, ad-hoc queries * small to medium workloads * Difference - don't need to provision compute resources - is on-demand service where you pay per query - is more cost effective for ad-hoc queries and smaller datasets

Answer 48

* distributed database system * uses multiple nodes to store and process data * used for performing complex queries on large datasets quickly * optimized for large-scale data processing and analytics * Difference - is provisioned resources for large-scale data warehousing - pay for fixed amount of compute and storage - best for predictable, high-performance workloads - used for running SQL queries on "structure data" - optimized for analytics and data warehousing

Answer 49

* used for big data processing using Apache Spark * run unstructured / structured data processing, big data analytics and machine learning tasks on large datasets ( large-scale) * "scalable environment" is provided for running spark jobs (means the environment can scale up or down based on size and complexity of the data being processed ) * supports both structure and unstructured data * use the languages like Python, Scala, SQL * ideal for unstructured data and machine learning

Answer 50

* open source framework * used for big data processing * helps in running complex analytics, machine learning, and data transformation tasks on large datasets that are not easily processed by traditional relational databases

Answer 51

* used to store large amount of structure , semi-structured or unstructured data * Integrate with Azure Synapse, allow you to store raw data in scalable and secure env * can be later processed and analyzed using SQL or Spark pools * Integration -can use PolyBase or serverless SQL pools to directly query data in Data Lakes without loading it into a dedicated SQL pool. -can move data from Data Lakes into SQL pools using Azure Synapse pipelines or Azure Data Factory.

Answer 52

allow querying the external data sources without moving data physically

Answer 53

is a connection that allow Synapse to access data, stored outside of Synapse - Azure Blob or Azure Data Lake

Answer 54

is a structure of external data ( CSV, Parquet ) , so Synapse knows how to read or interpret it

Answer 55

* is a table that points to data, stored outside of Synapse * allow you to query that data without moving it into Synapse

Answer 56

* are computed views * store the result of queries in database * improve performance by avoiding repetitive computation * store the data physically * querying it is faster than running the original query each time. * faster to access * periodically refreshed to keep the data up-to-date

Answer 57

* liked save query * doesn't store data * retrieves the latest data every time you query it

Answer 58

* used to manage and query large amount of "unstructured" or "semi-structured" data in "data lake" * ideal for big data analytics and exploration

Answer 59

* run on a single node * used for smaller , less complex datasets

Answer 60

* are split across multiple nodes in Synapse SQL pools * making them faster for large datasets by parallelizing the query execution across different resources

Answer 61

* store data in column instead of rows * ideal for large-scale analytics * compresses the data to save space * improve performance for queries that need to scan a lot * Efficient for large data analysis (column-based storage) * When you define a Clustered Columnstore Index (CCI) for a table, every column in the table is stored in a columnar format (compressed and optimized for fast scanning).

Answer 62

* is a table without any index * data is stored randomly * when you query it, the system has to scan the entire table to find what you need * slower for large table * might be fine for smaller, less frequently accessed tables * Tables without indexing, slower for large data

Answer 63

* store and sort the data in the table in specific order * there is only one clustered index per table bcoz it rearranges the data physically * great for quickly retrieving rows based on a column, like a primary key * Sorts data and speeds up retrieval (only one per table).

Answer 64

* like a lookup table that points to the data * Speed up queries without changing data order * can have multiple nonclustered indexes on a table

Answer 65

HTTPS and ABFSS are 2 different protocols for accessing data in Azure Data Lake Storage

Answer 66

* is standard protocols * used for general secure web access * used for secure the data transfer over the internet * when you use HTTPS with Azure Data Lake, you access data over the web securely using URLs like `https://.dfs.core.windows.net`.

Answer 67

* This is a specialized protocol designed for Azure Data Lake Storage Gen2 * It’s an optimized version of the HTTPS protocol * optimized for accessing large datasets, * providing better integration with Azure analytics tools * ABFSS URLs look like `abfss://@.dfs.core.windows.net`. * offering better performance and functionality for big data analytics. * is preferred for data lake-related tasks

Answer 68

when you are working with "CSV" file formats, different methods for parsing data during data load operations Parser Version 1.0 - Less flexible - Basic functionality - can be slower or less efficient with large or complex files. Parser Version 2.0 - Better performance - handles larger files more efficiently and offers better performance, especially for complex data or large datasets. - Supports better error handling, column data type detection, and enhanced parsing options

Answer 69

* determine how text data is sorted and compared * Case Sensitivity - Whether uppercase and lowercase letters are treated the same or differently * Accent Sensitivity - Whether accented characters (like é or è) are treated as different from non-accented ones (like e) * Sorting - How data is ordered, such as alphabetically or numerically * Consistency - It ensures that comparisons and sorting happen the way you expect * Performance - The right collation can improve query performance when working with large datasets

Answer 70

determine how data is stored across compute nodes for "parallel processing and performance optimization"

Answer 71

* Data is "evenly distributed" across nodes based on a selected "hash key" (column). * best for "Large fact tables" in "star schema" (e.g., Sales Transactions). * Improves "query performance" by minimizing data movement.

Answer 72

* Data is "randomly" distributed across nodes "without a specific key". * best for "Staging tables" or tables with "no clear distribution column". * Simple and ensures even distribution but may cause "data movement during joins"

Answer 73

* A "full copy" of the table is stored on "each compute node". * best for "Small dimension tables" used in "joins" (e.g., Product, Country). * "Eliminates data movement", improving join performance. * Not recommended for "large tables" due to storage overhead.

Answer 74

* In Azure Synapse, partitioning improves query performance by dividing large tables into smaller ones. Benefits: * Faster queries by scanning only relevant partitions. * Improved data load performance. * Better maintenance and indexing CREATE TABLE SalesData ( SalesID INT IDENTITY(1,1), ProductName VARCHAR(100), SaleAmount DECIMAL(10,2), SaleDate DATE ) WITH ( DISTRIBUTION = HASH(SalesID), -- Choose appropriate distribution CLUSTERED COLUMNSTORE INDEX, PARTITION (SaleDate RANGE RIGHT FOR VALUES ('2023-01-01', '2023-04-01', '2023-07-01', '2023-10-01') ) );

Answer 75

* allows querying external data directly from files in Azure Data Lake Storage (ADLS) or Blob Storage without needing to load data into a table. Common Use Cases: 🔹 Ad-hoc analysis of raw files 🔹 Quick data exploration without creating tables Limitations: ⚠ Only works in Serverless SQL pools (not Dedicated SQL pools) ⚠ No DML operations (INSERT/UPDATE/DELETE)

Answer 76

🔹 **Control Node** - Acts as the **brain** of the Synapse Dedicated SQL Pool. - Manages query execution and optimization. - Distributes work to **compute nodes**. - Handles metadata, session control, and coordination. **Analogy:** - **Control Node = Manager** (assigns tasks). - **Compute Nodes = Workers** (do the actual work).

Answer 77

🔹 **Compute Node** - Executes the actual **query workload**. - Stores and processes data using **distributed storage**. - Works in parallel for **faster query performance**. - The number of compute nodes depends on **DWU (Data Warehouse Unit) scale**.

Answer 78

* "Copy Command" - a simple way to copy data from sources like Azure Blob Storage * "PolyBase" - is a tool to load data from external sources like Hadoop or Azure Data Lake * "Azure Data Factory" - to move and transform data

Answer 79

* Azure AD - for identity and authentication - for secure login * Role-based Access Control ( RBAC) - to manage user access - to grant permission * Firewall Rules - to restrict access to the Synapse Workspace - to restrict access to specific IP addresses

Answer 80

* Apache Spark - for big data processing - allow you to analyze "unstructured data" * Data Lake - integrate with Azure Data Lake - can store massive amounts of "unstructured data" * SQL Pool - for fast querying of "structured data"

Answer 81

* using Synapse Pipeline * Extract data from various sources * Transform data using built-in data flows or by running scripts (SQL, Spark) * Load the transformed data into Azure Synapse or other data storage services

Answer 82

you can use standard ALTER TABLE commands

Answer 83

* Azure Synapse Studio: - Use the built-in monitoring features in Synapse Studio, such as the Monitor hub, to view pipeline, Spark, SQL pool (dedicated and serverless), and activity run metrics.

Answer 84

* use "Clustered ColumnStore Index" - to store data in column-wise format - is faster for analytics * distribute tables properly - use hash distribution for large table to evenly spread data across nodes

Answer 85

* use "serverless sql pools" for ad-hoc queries to avoid provision costs * paused "dedicated sql pools" when not in use to save costs * use "data compression" to reduce storage costs

Answer 86

* scales both Compute and Storage independently * compute scaling - in dedicated sql pools -> can scale the compute resources (DWUs) up or down based on your workloads - for spark pools -> can adjust the number of nodes and type of nodes * storage scaling - is scalable by default - best practice - scale compute resources during peak time - scale down during idle time to "save costs" - leverage serverless sql pools for smaller ad-hoc queries to "avoid the costs of dedicated resources"

Answer 87

SELECT Customersgender,COUNT(customer_id) FROM OPENROWSET( BULK 'https://azdelab.dfs.core.windows.net/raw/Customers.csv', FORMAT = 'CSV',         PARSER_VERSION = '2.0', HEADER_ROW = TRUE ) AS [result] GROUP BY Customersgender;

Answer 88

COPY INTO dbo.green_tripdata ( VendorID 1, store_and_fwd_flag 4, RatecodeID 5, PULocationID 6 , DOLocationID 7, passenger_count 8,trip_distance 9, fare_amount 10, mta_tax 12, tip_amount 13,tolls_amount 14, ehail_fee 15, improvement_surcharge 16, total_amount 17,payment_type 18, trip_type 19, congestion_surcharge 20 ) FROM 'https://azdelab.dfs.core.windows.net/raw/green_tripdata_2019-12.parquet' WITH ( FILE_TYPE = 'PARQUET' ,MAXERRORS = 0 ,IDENTITY_INSERT = 'OFF' --,AUTO_CREATE_TABLE ='ON' )

Answer 89

-- Create Serverless Database CREATE DATABASE serverless_db_demo -- Create External Data Source USE serverless_db_demo CREATE MASTER KEY ENCRYPTION BY PASSWORD = 'Welcome@123' CREATE DATABASE SCOPED CREDENTIAL adls_azdelab_credential WITH IDENTITY = 'SHARED ACCESS SIGNATURE' ,SECRET = 'sv=2022-11-02&ss=bfqt&srt=sco&sp=rwdlacupyx&se=2025-04-01T18:51:11Z&st=2025-03-01T10:51:11Z&spr=https&sig=4BoyfWPZ2%2FBauY9MbfoUEIJPSAySoof2bDipQz9O7R4%3D' GO CREATE EXTERNAL DATA SOURCE ext_ds_adls_azdelab WITH ( LOCATION = 'https://azdelab.dfs.core.windows.net/', CREDENTIAL = adls_azdelab_credential ); -- Create File Format - Supported File Format Parquet and Delimited Text CREATE EXTERNAL FILE FORMAT ext_ff_delimited_text WITH ( FORMAT_TYPE = DELIMITEDTEXT, FORMAT_OPTIONS (FIELD_TERMINATOR = ',', STRING_DELIMITER = '"') ); CREATE EXTERNAL FILE FORMAT ext_ff_parquet WITH ( FORMAT_TYPE = PARQUET, DATA_COMPRESSION = 'org.apache.hadoop.io.compress.SnappyCodec' --'org.apache.hadoop.io.compress.GzipCodec' );

Answer 90

CREATE DATABASE serverless_db_demo_1 COLLATE Latin1_General_100_BIN2_UTF8; USE serverless_db_demo_1 GO /* first, need to grant 'Storage Blob Data Contributor' role for Synapse in RBAC of Storage Account */ CREATE MASTER KEY ENCRYPTION BY PASSWORD = 'Welcome@123' CREATE DATABASE SCOPED CREDENTIAL adls_azdelab_credential WITH IDENTITY = 'Managed Identity' GO CREATE EXTERNAL DATA SOURCE ext_ds_adls_azdelab WITH ( LOCATION = 'https://azdelab.dfs.core.windows.net/', CREDENTIAL = adls_azdelab_credential ); CREATE EXTERNAL FILE FORMAT ext_ff_parquet WITH ( FORMAT_TYPE = PARQUET, DATA_COMPRESSION = 'org.apache.hadoop.io.compress.SnappyCodec' --'org.apache.hadoop.io.compress.GzipCodec' );

Answer 91

* Azure Data Factory is a cloud-based data integration service * allows you to create, schedule, and orchestrate data pipelines * supports data movement and transformation from various sources to different destinations. * supports both cloud and on-premises data sources. * Azure Data Factory is a cloud-based **ETL/ELT** service for orchestrating and automating data movement and transformation across various sources (on-premises/cloud)

Answer 92

* Pipelines: Group of activities to perform ETL processes. * Activities – Individual tasks within a pipeline (e.g., Copy Activity, Data Flow). * Datasets: Represent data structures, pointing to data sources or destinations. * Linked Services: Connections to data stores and compute environments. * Triggers: Automate pipeline execution based on schedules or events.

Answer 93

* A pipeline in ADF is a logical grouping of activities (ETL tasks) that together perform a task. * A pipeline can contain one or more activities, such as copying data from one source to another or transforming data. * How they are used can be used to orchestrate data movement, transformation, and load operations from source to destination

Answer 94

* are the individual tasks in a pipeline that perform a specific operation. * types of activities can you perform in ADF - Data movement activities (e.g., Copy Data) - Data transformation activities (e.g., Data Flow, HDInsight, Databricks) - Control flow activities (e.g., ForEach, If Condition, Until) - Copy Activity: Used for copying data from source to destination. - Data Flow Activity: For data transformation. - Execute Pipeline Activity: To call another pipeline. - Stored Procedure Activity: Executes a stored procedure.

Answer 95

* represents the structure of data within ADF and points to the source or destination data. * For example, a dataset can represent a table in a database, a file in Blob Storage, or an Azure SQL Database table.

Answer 96

* defines the connection information for data sources or destinations. * like a connection string that tells ADF how to connect to data stores, like Azure SQL Database, Blob Storage, or on-premises databases

Answer 97

* Triggers are used to automatically initiate pipeline execution. * There are different types: * Schedule Trigger: Executes pipelines on a defined schedule. * Event-based Trigger: Executes pipelines when a specific event occurs, such as a file being uploaded to Blob Storage. * Manual Trigger: Initiated by the user when needed. Scenario: A pipeline is scheduled to run daily to extract sales data from an SQL database and load it into a Data Lake.

Answer 98

* compute infrastructure used by ADF to move data between data stores. * There are three types: - Azure IR: For cloud-to-cloud data transfer. - Self-hosted IR: For on-premise data movement or hybrid scenarios. - Azure SSIS IR: For running SSIS packages in the cloud

Answer 99

* activity is used to copy data from a source to a destination. * It can handle data movement between on-premises and cloud sources Scenario: If you need to copy data from an on-premises SQL Server to Azure Blob Storage, you would configure a Copy Data activity to specify the source and destination.

Answer 100

* activity is used to retrieve a value or a set of values from a data source. * It’s commonly used to retrieve parameters or configurations before performing other activities. Scenario: Before running a data copy operation, you may use a Lookup activity to check the last successful run time from a configuration table and pass that value as a parameter to filter records in the source system

Answer 101

* Mapping Data Flows provide a visual, code-free way to design data transformations (e.g., joins, aggregations, lookups) within ADF, using a Spark-based execution engine

Answer 102

* ADF's Data Flow provides a no-code, visually-designed environment for building transformations * Unlike traditional ETL where transformations happen in separate systems (e.g., SQL or Databricks), Data Flow integrates the entire pipeline from extraction to transformation in a visual, scalable manner * Data flows allow complex transformations like: - Filter: Remove unwanted data. - Join: Merge data from multiple sources. - Aggregate: Group data by a column and apply aggregations like SUM, COUNT - Derived Column: Add or modify columns. - Sort: Sort data based on one or more columns. Scenario: If you want to calculate the total sales per region and filter out regions with sales below a threshold, you could use Aggregate and Filter transformations in Data Flow.

Answer 103

* are used to pass dynamic values to pipelines, datasets, or activities. * can define pipeline parameters and pass values at runtime. (e.g., `@pipeline().parameters.FileName`)

Answer 104

* ETL stands for Extract, Transform, Load. It is the process of: * Extract: Pulling data from various sources (e.g., databases, APIs). * Transform: Cleaning, filtering, and structuring the data. * Load: Storing the data in a target system (e.g., data warehouse). Difference * ETL (Extract, Transform, Load) → Data is transformed before loading into the target system (used in traditional data warehouses) * ELT (Extract, Load, Transform) → Data is first loaded and then transformed (used in modern cloud-based architectures like Snowflake, BigQuery). **Why it’s important:** ETL ensures data is consistent, accurate, and ready for analysis.

Answer 105

* Pipeline: A logical grouping of activities that perform a unit of work. Pipelines define the orchestration of data movements and transformations. * Data Flow: A visual design tool for creating data transformation logic. It is a part of the pipeline and focuses on data transformation. Scenario: If you are moving data from SQL Server to Azure Data Lake and want to perform data cleansing, you might use a data flow for the transformation step inside the pipeline.

Answer 106

* Copy Activity: Used for simple data movement from source to destination without/with minimal transformation. * Data Flow: Used for complex transforming data within the pipeline before moving it to the destination. It is a visual interface for data transformation. Scenario: If you just need to move data from a flat file in Blob Storage to an Azure SQL Database, you use the Copy Activity. If you need to clean or aggregate the data before loading it, you would use Data Flow

Answer 107

* ADF is cloud-based, whereas SSIS is an on-premises ETL tool. * ADF is serverless and scales dynamically, while SSIS requires dedicated servers. * SSIS has rich transformations in SQL Server Data Tools (SSDT), while ADF relies on Mapping Data Flows for complex transformations

Answer 108

ADF is focused on ETL and data integration, while Azure Synapse (formerly Azure SQL Data Warehouse) combines big data and data warehousing, offering on-demand SQL queries, Spark, and data exploration

Data Engineering Flashcards

(132 cards)