Data Engineering Part 2 Flashcards
(15 cards)
What does ETL stand for?
Extract, Transform, Load.
What is the purpose of ETL?
To move data from source systems to a centralized location for analysis.
What is the difference between ETL and ELT?
ETL transforms data before loading; ELT loads raw data and transforms later.
What is the ‘extract’ step in ETL?
Retrieving raw data from source systems.
What is the ‘transform’ step in ETL?
Cleaning, standardizing, or reshaping data for analysis or loading.
What is batch processing?
Processing large volumes of data at once on a scheduled basis.
What is stream processing?
Processing data in real time as it arrives.
When is batch processing preferred?
For historical analysis or when real-time isn’t required.
What is a micro-batch?
Small, time-bounded data chunks used in near-real-time processing.
What is latency in data pipelines?
The delay between data generation and processing/output.
What is a DAG in data engineering?
A Directed Acyclic Graph representing dependencies between tasks.
What is Apache Airflow?
An open-source tool for authoring, scheduling, and monitoring workflows as code.
What is task scheduling?
Setting when and how often tasks in a pipeline should run.
What is task dependency?
A rule that defines which task must complete before another can begin.
What is task failure recovery?
A method to retry or resume failed steps in a pipeline.