Data Engineering Part 1 Flashcards
(20 cards)
What is structured data?
Data that resides in a fixed schema, such as rows and columns in a table.
What is unstructured data?
Data without a predefined model, such as text, images, or audio.
What is semi-structured data?
Data with some organizational properties, like JSON or XML.
What is a row-oriented storage format?
Stores data row-by-row, suitable for transactional systems.
What is a column-oriented storage format?
Stores data column-by-column, optimized for analytics and compression.
What is a CSV file?
Comma-Separated Values — a flat, plain text format for tabular data.
What is a JSON file?
JavaScript Object Notation — a lightweight format for storing structured data as key-value pairs.
What is a Parquet file?
A columnar storage format that supports efficient compression and encoding.
What is the advantage of Parquet over CSV?
Parquet is columnar, compressed, and better for analytics workloads.
When is JSON preferred?
For hierarchical or nested data structures such as logs or API responses.
What is data compression?
Reducing file size by encoding data more efficiently.
What are common compression formats for data files?
gzip, snappy, bzip2
Why is compression useful in data engineering?
It reduces storage costs and improves I/O efficiency.
What is serialization?
Converting an object into a byte stream for storage or transmission.
What is schema evolution?
The ability of a data format to adapt as schemas change over time.
What is a distributed file system?
A file system that stores data across multiple nodes for fault tolerance and scalability.
What is HDFS?
Hadoop Distributed File System — designed for storing large files across a cluster.
What is object storage?
A type of storage where data is managed as objects with metadata and a unique ID.
Give an example of an object store.
Amazon S3.
What is a data partition?
Splitting data into segments based on values like date or region to optimize access.