Data Engineering Part 2 Flashcards by James Clyne

What does ETL stand for?

Extract, Transform, Load.

How well did you know this?

Not at all

Perfectly

What is the purpose of ETL?

To move data from source systems to a centralized location for analysis.

How well did you know this?

Not at all

Perfectly

What is the difference between ETL and ELT?

ETL transforms data before loading; ELT loads raw data and transforms later.

How well did you know this?

Not at all

Perfectly

What is the ‘extract’ step in ETL?

Retrieving raw data from source systems.

How well did you know this?

Not at all

Perfectly

What is the ‘transform’ step in ETL?

Cleaning, standardizing, or reshaping data for analysis or loading.

How well did you know this?

Not at all

Perfectly

What is batch processing?

Processing large volumes of data at once on a scheduled basis.

How well did you know this?

Not at all

Perfectly

What is stream processing?

Processing data in real time as it arrives.

How well did you know this?

Not at all

Perfectly

When is batch processing preferred?

For historical analysis or when real-time isn’t required.

How well did you know this?

Not at all

Perfectly

What is a micro-batch?

Small, time-bounded data chunks used in near-real-time processing.

How well did you know this?

Not at all

Perfectly

What is latency in data pipelines?

The delay between data generation and processing/output.

How well did you know this?

Not at all

Perfectly

What is a DAG in data engineering?

A Directed Acyclic Graph representing dependencies between tasks.

How well did you know this?

Not at all

Perfectly

What is Apache Airflow?

An open-source tool for authoring, scheduling, and monitoring workflows as code.

How well did you know this?

Not at all

Perfectly

What is task scheduling?

Setting when and how often tasks in a pipeline should run.

How well did you know this?

Not at all

Perfectly

What is task dependency?

A rule that defines which task must complete before another can begin.

How well did you know this?

Not at all

Perfectly

What is task failure recovery?

A method to retry or resume failed steps in a pipeline.

How well did you know this?

Not at all

Perfectly

Data Engineering Part 2 Flashcards

(15 cards)