Data Engineering Fundamentals Flashcards

Data Engineering Fundamentals

1
Q

What are the three main types of data?

A

Structured, Unstructured, and Semi-structured

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Give an example of structured data.

A

Database table, CSV file, Excel spreadsheet

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is a key characteristic of unstructured data?

A

It doesn’t have a predefined schema or structure.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Give an example of unstructured data.

A

Image, audio file, email

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is semi-structured data?

A

Data with some organization, like tags or hierarchies, but not as rigid as structured data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are the three Vs of Big Data?

Properties of Data

A

Volume, Velocity, Variety

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What does “Volume” refer to in the context of data?

A

The amount or size of data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What does “Velocity” refer to in the context of data?

A

The speed at which data is generated and processed.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What does “Variety” refer to in the context of data?

A

The different types and sources of data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is a data warehouse optimized for?

Data Warehouses vs. Data Lakes

A

Complex queries and analysis of structured data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is a key characteristic of a data lake?

A

It can store vast amounts of raw data in its native format.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the schema approach for a data warehouse?

A

Schema-on-write (schema is defined before data is loaded)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is the schema approach for a data lake?

A

Schema-on-read (schema is defined when data is read)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is a data lakehouse?

Data Lakehouse and Data Mesh

A

A hybrid architecture combining features of data lakes and data warehouses.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is a key concept of a data mesh?

A

Domain-based data management.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What does ETL stand for?

ETL Pipelines

A

Extract, Transform, Load

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What happens in the “Transform” stage of an ETL pipeline?

A

Data is cleaned, converted, and prepared for the target system.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is CSV commonly used for?

Data Formats

A

Storing data in a tabular format, often for spreadsheets or databases.

19
Q

What is JSON commonly used for?

A

Data interchange, especially in web applications.

20
Q

What is a key advantage of Avro?

A

It stores data and its schema together.

21
Q

What is Parquet optimized for?

A

Analytics and efficient querying of large datasets.

22
Q

What is data lineage?

Data Modeling and Lineage

A

Tracking the flow and transformation of data.

23
Q

What is a star schema?

A

A data modeling technique often used in data warehouses with fact tables and dimensions.

24
Q

What is the purpose of indexing in a database?

Database Performance Optimization

A

To speed up data retrieval.

25
What is database partitioning?
Dividing a database into smaller, more manageable parts.
26
What is the goal of stratified sampling? ## Footnote Data Sampling
To ensure representation of different subgroups in a sample.
27
What is data skew? ## Footnote Data Skew
An imbalance of data across partitions in a distributed system.
28
Give an example of a technique to address data skew.
Adaptive partitioning, salting, repartitioning.
29
What is data completeness? ## Footnote Data Validation
Ensuring all required data is present.
30
What is data consistency?
Ensuring data values are consistent across different datasets
31
What does the COUNT() function do in SQL? ## Footnote SQL
Counts the number of rows.
32
What does the GROUP BY clause do in SQL?
Groups rows based on the values in one or more columns.
33
What is the purpose of pivoting in SQL?
To turn row-level data into columnar data.
34
What does git init do? ## Footnote Git
Initializes a new Git repository.
35
What does git add do?
Adds changes to the staging area.
36
What does git commit do?
Records changes to the repository.
37
What does git branch do?
Lists, creates, or deletes branches.
38
What does git checkout do?
Switches to a different branch.
39
What does git merge do?
Combines changes from different branches.
40
What does git push do?
Sends local commits to a remote repository.
41
What does git pull do?
Fetches changes from a remote repository and merges them.
42
What does git stash do?
Temporarily saves changes.
43
What does git rebase do?
Reapplies commits onto a different base branch.