{ "@context": "https://schema.org", "@type": "Organization", "name": "Brainscape", "url": "https://www.brainscape.com/", "logo": "https://www.brainscape.com/pks/images/cms/public-views/shared/Brainscape-logo-c4e172b280b4616f7fda.svg", "sameAs": [ "https://www.facebook.com/Brainscape", "https://x.com/brainscape", "https://www.linkedin.com/company/brainscape", "https://www.instagram.com/brainscape/", "https://www.tiktok.com/@brainscapeu", "https://www.pinterest.com/brainscape/", "https://www.youtube.com/@BrainscapeNY" ], "contactPoint": { "@type": "ContactPoint", "telephone": "(929) 334-4005", "contactType": "customer service", "availableLanguage": ["English"] }, "founder": { "@type": "Person", "name": "Andrew Cohen" }, "description": "Brainscape’s spaced repetition system is proven to DOUBLE learning results! Find, make, and study flashcards online or in our mobile app. Serious learners only.", "address": { "@type": "PostalAddress", "streetAddress": "159 W 25th St, Ste 517", "addressLocality": "New York", "addressRegion": "NY", "postalCode": "10001", "addressCountry": "USA" } }

Data Engineering Part 1 Flashcards

(20 cards)

1
Q

What is structured data?

A

Data that resides in a fixed schema, such as rows and columns in a table.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is unstructured data?

A

Data without a predefined model, such as text, images, or audio.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is semi-structured data?

A

Data with some organizational properties, like JSON or XML.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is a row-oriented storage format?

A

Stores data row-by-row, suitable for transactional systems.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is a column-oriented storage format?

A

Stores data column-by-column, optimized for analytics and compression.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is a CSV file?

A

Comma-Separated Values — a flat, plain text format for tabular data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is a JSON file?

A

JavaScript Object Notation — a lightweight format for storing structured data as key-value pairs.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is a Parquet file?

A

A columnar storage format that supports efficient compression and encoding.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the advantage of Parquet over CSV?

A

Parquet is columnar, compressed, and better for analytics workloads.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

When is JSON preferred?

A

For hierarchical or nested data structures such as logs or API responses.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is data compression?

A

Reducing file size by encoding data more efficiently.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are common compression formats for data files?

A

gzip, snappy, bzip2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Why is compression useful in data engineering?

A

It reduces storage costs and improves I/O efficiency.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is serialization?

A

Converting an object into a byte stream for storage or transmission.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is schema evolution?

A

The ability of a data format to adapt as schemas change over time.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is a distributed file system?

A

A file system that stores data across multiple nodes for fault tolerance and scalability.

17
Q

What is HDFS?

A

Hadoop Distributed File System — designed for storing large files across a cluster.

18
Q

What is object storage?

A

A type of storage where data is managed as objects with metadata and a unique ID.

19
Q

Give an example of an object store.

20
Q

What is a data partition?

A

Splitting data into segments based on values like date or region to optimize access.