Data Engineering Part 1 Flashcards

Question 1

Q

What is structured data?

Answer

A

Data that resides in a fixed schema, such as rows and columns in a table.

Question 2

Q

What is unstructured data?

Answer

A

Data without a predefined model, such as text, images, or audio.

Question 3

Q

What is semi-structured data?

Answer

A

Data with some organizational properties, like JSON or XML.

Question 4

Q

What is a row-oriented storage format?

Answer

A

Stores data row-by-row, suitable for transactional systems.

Question 5

Q

What is a column-oriented storage format?

Answer

A

Stores data column-by-column, optimized for analytics and compression.

Question 6

Q

What is a CSV file?

Answer

A

Comma-Separated Values — a flat, plain text format for tabular data.

Question 7

Q

What is a JSON file?

Answer

A

JavaScript Object Notation — a lightweight format for storing structured data as key-value pairs.

Question 8

Q

What is a Parquet file?

Answer

A

A columnar storage format that supports efficient compression and encoding.

Question 9

Q

What is the advantage of Parquet over CSV?

Answer

A

Parquet is columnar, compressed, and better for analytics workloads.

Question 10

Q

When is JSON preferred?

Answer

A

For hierarchical or nested data structures such as logs or API responses.

Question 11

Q

What is data compression?

Answer

A

Reducing file size by encoding data more efficiently.

Question 12

Q

What are common compression formats for data files?

Answer

A

gzip, snappy, bzip2

Question 13

Q

Why is compression useful in data engineering?

Answer

A

It reduces storage costs and improves I/O efficiency.

Question 14

Q

What is serialization?

Answer

A

Converting an object into a byte stream for storage or transmission.

Question 15

Q

What is schema evolution?

Answer

A

The ability of a data format to adapt as schemas change over time.

Question 16

Q

What is a distributed file system?

Answer

Study These Flashcards

A

A file system that stores data across multiple nodes for fault tolerance and scalability.

Question 17

Q

What is HDFS?

Answer

Study These Flashcards

A

Hadoop Distributed File System — designed for storing large files across a cluster.

Question 18

Q

What is object storage?

Answer

Study These Flashcards

A

A type of storage where data is managed as objects with metadata and a unique ID.

Question 19

Q

Give an example of an object store.

Answer

Study These Flashcards

A

Amazon S3.

Question 20

Q

What is a data partition?

Answer

Study These Flashcards

A

Splitting data into segments based on values like date or region to optimize access.

Data Engineering Part 1 Flashcards

(20 cards)