Data Engineer Flashcards

Question

Python

Answer 1

A general-purpose programming language commonly used in data engineering for scripting and automation.

Answer 2

A distributed computing system used for big data processing and analytics.

Answer 3

An open-source framework that allows for the distributed storage and processing of large data sets.

Answer 4

A platform to programmatically author, schedule, and monitor workflows (used to manage ETL pipelines).

Answer 5

A tool that packages applications and their dependencies into containers for consistent deployment.

Answer 6

A version control system for tracking changes in source code.

Answer 7

A web-based tool used to write and run code, often used for data analysis and visualization.

Answer 8

A script written to be executed by a Unix shell, often used for automation.

Answer 9

Services offering scalable infrastructure and tools for managing and processing data in the cloud.

Answer 10

A database that organizes data into tables with rows and columns, using relationships between them.

Answer 11

A unique identifier for each record in a table.

Answer 12

A field in one table that links to the primary key in another table.

Answer 13

A data structure that speeds up data retrieval from a database.

Answer 14

A request to retrieve or manipulate data from a database.

Answer 15

The process of improving a query’s performance.

Answer 16

Combining rows from two or more tables based on a related column.

Answer 17

Structuring a database to reduce redundancy and improve efficiency.

Answer 18

A sequence of database operations treated as a single logical unit.

Answer 19

A saved collection of SQL statements that can be executed as a single unit.

Answer 20

The principles of transactions: Atomicity, Consistency, Isolation, Durability.

Answer 21

A non-relational database designed for flexible, scalable storage.

Answer 22

Splitting data into smaller parts and storing them across multiple servers.

Answer 23

Similar to ETL, but transformation happens after loading into storage.

Answer 24

The sequence of steps in a data process.

Answer 25

The automated coordination of tasks in a workflow (e.g., Airflow managing jobs).

Answer 26

A diagram that shows tasks and their dependencies in a workflow without loops.

Answer 27

Handling data in groups at scheduled times.

Answer 28

Processing data in real time as it arrives.

Answer 29

Bringing data from multiple sources into a system.

Answer 30

Cleaning, formatting, or enriching data before it’s stored or used.

Answer 31

Automating when tasks run.

Answer 32

The destination system where processed data is stored.

Answer 33

The origin of the data.

Answer 34

A leading cloud platform offering storage, compute, and analytics services.

Answer 35

Google’s cloud service, popular for BigQuery and machine learning tools.

Answer 36

Microsoft’s cloud service, integrated with enterprise systems.

Answer 37

Object storage system used for data lakes.

Answer 38

AWS’s cloud data warehouse for analytics.

Answer 39

GCP’s fully managed serverless data warehouse.

Answer 40

A cloud-based data warehouse known for scalability and easy sharing.

Answer 41

A data platform for big data analytics and AI, built on Apache Spark.

Answer 42

Central storage for raw data in original formats.

Answer 43

Structured storage for processed and query-ready data.

Answer 44

Cloud model where infrastructure is managed automatically.

Answer 45

The ability to handle growing amounts of work without performance loss.

Answer 46

Cloud feature where resources expand or shrink automatically based on demand.

Answer 47

How accurate, complete, reliable, and timely data is.

Answer 48

The framework of policies and standards that ensures data is managed properly across an organization.

Answer 49

The process of checking data for accuracy and completeness.

Answer 50

Tracking where data comes from, how it moves, and how it changes over time.

Answer 51

The trustworthiness and consistency of data across its lifecycle.

Answer 52

Assigning responsibility to people for managing and protecting data.

Answer 53

Following rules, laws, and standards (e.g., GDPR, HIPAA)

Answer 54

European law on personal data protection and privacy.

Answer 55

Any information that can identify an individual (e.g., name, email, SSN).

Answer 56

Modifying data to remove personally identifiable details.

Answer 57

A system that organizes and documents datasets across an organization.

Answer 58

A record that shows who accessed or modified data, and when.

Answer 59

A short daily team meeting to discuss progress, plans, and blockers

Answer 60

A set period (usually 1–2 weeks) during which specific work is completed in agile teams.

Answer 61

A prioritized list of work items that need to be done.

Answer 62

A completed product, feature, or report expected at the end of a sprint.

Answer 63

An obstacle that prevents progress.

Answer 64

A task that relies on another task being finished first.

Answer 65

A meeting at the end of a sprint to reflect on what went well and what can improve.

Answer 66

Working together to achieve a shared goal.

Answer 67

A group with different expertise (e.g., engineers, analysts, PMs) working on the same project.

Answer 68

A method of communication such as Slack, Teams, or email.

Data Engineer Flashcards

(92 cards)