Data Engineer Flashcards

(92 cards)

1
Q

Data Engineer

A

A professional responsible for designing and maintaining data systems and pipelines.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Pipeline

A

A series of automated processes that move and transform data from one system to another.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

ETL (Extract-Transform-Load)

A

A data integration process involving data extraction, transformation, and loading into storage systems.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Structured Data

A

Data organized in tables, such as rows and columns (e.g., in SQL databases).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Unstructured Data

A

Data that does not have a predefined format (e.g., images, videos, text documents).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Big Data

A

Large volumes of data that require advanced processing and analysis tools.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Scalable

A

Able to handle increased data or workload without performance issues.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Workflow

A

A sequence of tasks or processes to achieve a goal.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Collaboration

A

Working together with team members to achieve a shared goal.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Proficiency

A

A high level of skill in a particular area (e.g., “proficient in SQL”).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Background

A

Professional history, experience, and education.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Database

A

A structured collection of data stored electronically.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Data Warehouse

A

A central repository of integrated data used for reporting and analysis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Data Lake

A

A large storage repository that holds raw, unstructured, and structured data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Schema

A

The structure or blueprint of a database (tables, columns, data types).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Ingestion

A

The process of bringing data from various sources into a system.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Transformation

A

The process of cleaning, modifying, or enriching data for analysis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Batch Processing

A

Data processing that happens in groups or batches at scheduled intervals.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Streaming Processing

A

Real-time data processing as the data arrives

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Data Modeling

A

Designing the structure and relationships of data in a database.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Normalization

A

Organizing data to reduce redundancy and improve efficiency.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Indexing

A

Creating data structures that improve the speed of data retrieval.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Metadata

A

Data that describes other data (e.g., file name, size, creation date).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

SQL

A

Structured Query Language used to manage and query relational databases.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Python
A general-purpose programming language commonly used in data engineering for scripting and automation.
26
Apache Spark
A distributed computing system used for big data processing and analytics.
27
Apache Hadoop
An open-source framework that allows for the distributed storage and processing of large data sets.
28
Airflow
A platform to programmatically author, schedule, and monitor workflows (used to manage ETL pipelines).
29
Docker
A tool that packages applications and their dependencies into containers for consistent deployment.
30
Git
A version control system for tracking changes in source code.
31
Jupyter Notebook
A web-based tool used to write and run code, often used for data analysis and visualization.
32
Shell Script
A script written to be executed by a Unix shell, often used for automation.
33
Cloud Platforms (AWS, GCP, Azure)
Services offering scalable infrastructure and tools for managing and processing data in the cloud.
34
Relational Database (RDBMS)
A database that organizes data into tables with rows and columns, using relationships between them.
35
Primary Key
A unique identifier for each record in a table.
36
Foreign Key
A field in one table that links to the primary key in another table.
37
Index
A data structure that speeds up data retrieval from a database.
38
Query
A request to retrieve or manipulate data from a database.
39
Query Optimization
The process of improving a query’s performance.
40
Join
Combining rows from two or more tables based on a related column.
41
Normalization
Structuring a database to reduce redundancy and improve efficiency.
42
Transaction
A sequence of database operations treated as a single logical unit.
43
Stored Procedure
A saved collection of SQL statements that can be executed as a single unit.
44
ACID Properties
The principles of transactions: Atomicity, Consistency, Isolation, Durability.
45
NoSQL Database
A non-relational database designed for flexible, scalable storage.
46
Sharding
Splitting data into smaller parts and storing them across multiple servers.
47
ELT (Extract-Load-Transform)
Similar to ETL, but transformation happens after loading into storage.
48
Workflow
The sequence of steps in a data process.
49
Orchestration
The automated coordination of tasks in a workflow (e.g., Airflow managing jobs).
50
DAG (Directed Acyclic Graph)
A diagram that shows tasks and their dependencies in a workflow without loops.
51
Batch Processing
Handling data in groups at scheduled times.
52
Streaming Processing
Processing data in real time as it arrives.
53
Data Ingestion
Bringing data from multiple sources into a system.
54
Data Transformation
Cleaning, formatting, or enriching data before it’s stored or used.
55
Job Scheduling
Automating when tasks run.
56
Data Sink
The destination system where processed data is stored.
57
Data Source
The origin of the data.
58
AWS (Amazon Web Services)
A leading cloud platform offering storage, compute, and analytics services.
59
GCP (Google Cloud Platform)
Google’s cloud service, popular for BigQuery and machine learning tools.
60
Azure
Microsoft’s cloud service, integrated with enterprise systems.
61
S3 (Amazon Simple Storage Service)
Object storage system used for data lakes.
62
Redshift
AWS’s cloud data warehouse for analytics.
63
BigQuery
GCP’s fully managed serverless data warehouse.
64
Snowflake
A cloud-based data warehouse known for scalability and easy sharing.
65
Databricks
A data platform for big data analytics and AI, built on Apache Spark.
66
Data Lake
Central storage for raw data in original formats.
67
Data Warehouse
Structured storage for processed and query-ready data.
68
Serverless
Cloud model where infrastructure is managed automatically.
69
Scalability
The ability to handle growing amounts of work without performance loss.
70
Elasticity
Cloud feature where resources expand or shrink automatically based on demand.
71
Data Quality
How accurate, complete, reliable, and timely data is.
72
Data Governance
The framework of policies and standards that ensures data is managed properly across an organization.
73
Data Validation
The process of checking data for accuracy and completeness.
74
Data Lineage
Tracking where data comes from, how it moves, and how it changes over time.
75
Data Integrity
The trustworthiness and consistency of data across its lifecycle.
76
Data Stewardship
Assigning responsibility to people for managing and protecting data.
77
Compliance
Following rules, laws, and standards (e.g., GDPR, HIPAA)
78
GDPR (General Data Protection Regulation)
European law on personal data protection and privacy.
79
PII (Personally Identifiable Information)
Any information that can identify an individual (e.g., name, email, SSN).
80
Anonymization
Modifying data to remove personally identifiable details.
81
Data Catalog
A system that organizes and documents datasets across an organization.
82
Audit Trail
A record that shows who accessed or modified data, and when.
83
Stand-up Meeting
A short daily team meeting to discuss progress, plans, and blockers
84
Sprint
A set period (usually 1–2 weeks) during which specific work is completed in agile teams.
85
Backlog
A prioritized list of work items that need to be done.
86
Deliverable
A completed product, feature, or report expected at the end of a sprint.
87
Blocker
An obstacle that prevents progress.
88
Dependency
A task that relies on another task being finished first.
89
Retrospective
A meeting at the end of a sprint to reflect on what went well and what can improve.
90
Collaboration
Working together to achieve a shared goal.
91
Cross-functional Team
A group with different expertise (e.g., engineers, analysts, PMs) working on the same project.
92
Communication Channel
A method of communication such as Slack, Teams, or email.