Data Engineering Basics Flashcards

1
Q

What are the three types of data?

A

Structured, Unstructured, Semistructured

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the definition of structured data?

A

Data that is organized in a manner or schema. Typically found in relational databases. Consistent structure and uses rows and columns.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the definition of unstructured data?

A

Data that does not have a predefined structure. Examples include videos, audio files, images, emails, and work processing documents

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the definition of semi-structured data?

A

It has some structure in the form of tags, hierarchies or other patterns. XML and JSON is a good example of this.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the definition of volume in data engineering terms?

A

It refers to the amount or size of the data. It could be GB, MB, PB.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the definition of velocity in data engineering terms?

A

It refers to the speed at which new data is generated, collected, and processed.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the definition of variety in data engineering terms?

A

It refers to the different types, structures, and sources of data. structured, unstructured, etc..

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the definition of a data warehouse?

A

It is a centralized repository optimized for analysis where data from different sources is stored in a structured format.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are some characteristics of a data warehouse?

A

Designed for complex queries
Loaded via an ETL process
Optimized for read-heavy operations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is the definition of a data lake?

A

A storage repository that holds vast amounts of raw data in its native format including structured, semi-structured, and unstructured data.. Think about S3 or HDFS.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are some characteristics of a data lake?

A

No predefined schema

Data is loaded as-is, not preprocessed

supports batch, realtime, and streaming processing

can be queried for data transformation or exploration

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the difference between ELT and ETL

A

ETL is used with data warehouses. You extract the data, transform it, and the load it.

ELT is used with data lakes. You extract the data, load the data as needed, and then transform it.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is the downside of a data warehouse?

A

It is less agile and could require schema and data changes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is traditionally more cost-effective, a data lake or data warehouse?

A

A data lake, but storage costs could exceed data warehouse costs.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is a data lakehouse?

A

A hybrid of a data warehouse and a data lake. It can provide ACID transactions. An example is AWS Lake Formation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is the difference between ODBC and JDBC

A

JDBC is platform independent, but requires you to use Java. ODBC is platform dependent, but you can use it with any language.

17
Q

What is Avro?

A

It is a binary format that stores both the data and its schema together. Good for big fata and real-time processing systems.

18
Q

What is Parquet?

A

A columnar storage format optimized for analytics. Good for large datasets with an analytics engine.

19
Q

What is data lineage?

A

A visual representation that traces the flow and transformation of data through its lifecycle from source to final destination. This helps tracking error back to the source. May be required for compliance.

20
Q

What is schema eveolution?

A

The ability to adapt and change the schema of a dataset over time without disrupting existing processes or systems. Maintains backward compatibility. An example is the Glue Schema Registry

21
Q

What is stratified sampling?

A

You divide your population into subgroups (strata) and randomly sample each one to ensure representation of all subgroups.

22
Q

What is systemic sampling?

A

An example is picking every 4th order.

23
Q

What is data skew?

A

It refers to the unequal distribution or imbalance of fata across various nodes or partitions.

24
Q

What is data completeness?

A

Ensures all required data is present and essential parts are not missing.

25
Q

What is data consistency?

A

Ensures data values are consistent across datasets and do not contradict each other.

26
Q

What is data accuracy?

A

Ensures data is correct and reliable

27
Q

What is data integrity?

A

Ensures data maintains its correctness and consistency over its lifecycle and across systems.

28
Q

What git command creates a repository?

A

git init

29
Q

what git command lists all branches

A

git branch

30
Q

what git command switches to a specific branch

A

git checkout

31
Q

What git command merges branches?

A

git merge

32
Q
A
33
Q
A