Data Engineering Flashcards
(132 cards)
What is Data Engineering?
Is about designing, building, and managing systems that handle data efficiently. It involves collecting, storing, processing, and transforming raw data into meaningful information that businesses can use
What is the primary focus of Data Engineering?
focuses on building data pipelines and managing data infrastructure.
What does a Data Analyst do?
Analyzes and interprets data to extract insights for business decisions
What is the role of Data Scientists?
Analyzes data to derive insights
How do Data Engineers support Data Scientists?
Ensure data is clean, reliable, and available
DBA
Manages and optimizes databases
What are examples of databases used in data engineering?
- PostgreSQL
- MySQL
- SQL Server
What are examples of Data Warehouses?
- Snowflake
- Redshift
- BigQuery
What are examples of Data Lakes?
- Azure Data Lake
- AWS S3
- HDFS
What are examples of NoSQL databases?
- MongoDB
- Cassandra
How do data engineers use cloud platforms?
To build scalable data pipelines and warehouses
Difference between Batch and Streaming Data Processing?
- Batch Processing: Processes large amounts at scheduled intervals(e.g., ADF, Apache Spark)
- Streaming Processing: Processes data in real-time as it arrives(e.g., Kafka, Azure Stream Analytics)
Scenario: Use batch processing for end-of-day sales reports and stream processing for real-time fraud detection.
What is the Ingestion Layer in data architecture?
Collects data from multiple sources(e.g., APIs, databases, logs)
What is the Storage Layer in data architecture?
Stores raw, processed, and structured data (e.g., Data Lake, Data Warehouse).
What is the Processing Layer in data architecture?
Transforms and enriches data(e.g., Spark, Databricks, ADF).
What is the Serving Layer in data architecture?
Exposes data for analysis(e.g., Power BI, Tableau)
What is Slowly Changing Dimension (SCD)?
A method to handle historical changes in dimension tables
What are the types of Slowly Changing Dimensions?
- SCD Type 1: Overwrites old data(e.g., updating customer address)
- SCD Type 2: Keeps history with new records
- SCD Type 3: Stores previous and current values in separate columns
What is Data Partitioning?
Splits large datasets into smaller chunks for faster queries and processing
What are the types of Data Partitioning?
- Horizontal Partitioning: Divides by row(e.g., by date, region)
- Vertical Partitioning: Stores specific columns separately
Scenario: Partitioning sales data by year (year=2023
,year=2024
) to speed up queries for a specific year.
What are the benefits of Data Partitioning?
Improves query performance, reduces scan time, and enhances parallel processing
Difference between OLTP and OLAP?
- OLTP (Online Transaction Processing): Used for real-time transactions (e.g., banking, e-commerce). Handles transactional data (e.g., inserting, updating records). Used for day-to-day operations
- OLAP: Used for analytics and reporting (e.g., data warehouses). Handles analytical queries on historical data. Used for business intelligence.
- OLTP is normalized, while OLAP uses denormalized schema for fast queries
What is Change Data Capture (CDC)?
Captures changes in a database and propagates them to a target system.
Used in real-time data replication and incremental ETL
What is the purpose of a schema in databases?
defines the structure of data (e.g., tables, columns, data types). It ensures data consistency and helps in querying and analysis.