Introduction to Data Engineering Flashcards
(44 cards)
The role of a data engineer
- the primary role responsible for integrating, transforming, and consolidating data from various structured and unstructured data systems into structures that are suitable for building analytics solutions.
- helps ensure that data pipelines and data stores are high-performing, efficient, organized, and reliable, given a specific set of business requirements and constraints.
Types of Data
- Structured
- Unstructured
- Semi-structured
Structured data
- primarily comes from table-based source systems
- the rows and columns are aligned consistently throughout the file.
- CSV, RDB
Semi-structured Data
- may require flattening prior to loading into your source system.
- doesn’t have to fit neatly into a table structure.
Unstructured data
data stored as key-value pairs that don’t adhere to standard relational models and Other types of unstructured data that are commonly used include portable data format (PDF), word processor documents, and images.
Main data operations
- Data integration
- Data transformation
- Data consolidation
Data integration
establishing links between operational and analytical services and data sources to enable secure, reliable access to data across multiple systems.
Data transformation
extract, transform, and load (ETL) process
the data is prepared to support downstream analytical needs.
Data consolidation
- the process of combining data that has been extracted from multiple data sources into a consistent structure - usually to support analytics and reporting.
- data from operational systems is extracted, transformed, and loaded into analytical stores such as a data lake or data warehouse.
Operational data
transactional data that is generated and stored by applications, often in a relational or non-relational database.
Analytical data
data that has been optimized for analysis and reporting, often in a data warehouse.
Streaming data
perpetual sources of data that generate data values in real-time, often relating to specific events.
Data pipelines
- are used to orchestrate activities that transfer and transform data.
- Pipelines are the primary way in which data engineers implement repeatable extract, transform, and load (ETL) solutions that can be triggered based on a schedule or in response to events.
Data lakes
- a storage repository that holds large amounts of data in native, raw formats.
- optimized for scaling to massive volumes (terabytes or petabytes) of data.
- data typically comes from multiple heterogeneous sources, and may be structured, semi-structured, or unstructured.
GOAL: store everything in its original, untransformed state.
Data warehouse
- a centralized repository of integrated data from one or more disparate sources.
- store current and historical data in relational tables that are organized into a schema that optimizes performance for analytical queries.
Apache Spark
a parallel processing framework that takes advantage of in-memory processing and a distributed file storage.
Core Azure technologies used to implement data engineering workloads include:
- Azure Synapse Analytics
- Azure Data Lake Storage Gen2
- Azure Stream Analytics
- Azure Data Factory
- Azure Databricks
- Azure Event Hubs
Data Lake pt2
- provides file-based storage, usually in a distributed file system that supports high scalability for massive volumes of data.
- an store structured, semi-structured, and unstructured files in the data lake and then consume them from there in big data processing technologies, such as Apache Spark.
Azure Data Lake Storage
- a comprehensive, massively scalable, secure, and cost-effective data lake solution for high performance analytics built into Azure.
- combines a file system with a storage platform to help you quickly identify insights into your data.
- enables analytics performance, the tiering and data lifecycle management capabilities of Blob storage, and the high-availability, security, and durability capabilities of Azure Storage.
- can use as the basis for both real-time and batch solutions.
Benefits
- Hadoop compatible access
- Security
- Performance
- Data redundancy
Hadoop compatible access
- As if stored in HDFS
- can store the data in one place and access it through compute technologies including Azure Databricks, Azure HDInsight, and Azure Synapse Analytics without moving the data between environments.
- use storage mechanisms such as the parquet format, which is highly compressed and performs well across multiple platforms using an internal columnar storage.
Security
- supports access control lists (ACLs)
- Portable Operating System Interface (POSIX) permissions
- can set permissions at a directory level or file level for the data stored within the data lake
- encrypted at rest
Performance
organizes the stored data into a hierarchy of directories and subdirectories, much like a file system, for easier navigation.
Data redundancy
takes advantage of the Azure Blob replication models that provide data redundancy in a single data center with locally redundant storage (LRS), or to a secondary region by using the Geo-redundant storage (GRS) option.