Data Engineering Solutions Flashcards
Data pipelines are sequences of operations that?
Data pipelines are sequences of operations that copy, transform, load, and analyze data. There are common high-level design patterns that you see repeatedly in batch, streaming, and machine learning pipelines.
Understand the model of data pipelines.
A data pipeline is an abstract concept that captures the idea that data flows from one stage of processing to another. Data pipelines are modeled as directed acyclic graphs (DAGs). A graph is a set of nodes linked by edges. A directed graph has edges that flow from one node to another.
Know the four stages in a data pipeline.
Ingestion is the process of bringing data into the GCP environment.
Transformation is the process of mapping data from the structure used in the source system to the structure used in the storage and analysis stages of the data pipeline.
Cloud Storage can be used as both the staging area for storing data immediately after ingestion and also as a long-term store for transformed data.
BigQuery and Cloud Storage treat data as external tables and query them.
Cloud Dataproc can use Cloud Storage as HDFS-compatible storage.
Analysis can take on several forms, from simple SQL querying and report generation to machine learning model training and data science analysis.
Know that the structure and function of data pipelines will vary according to the use case to which they are applied.
Three common types of pipelines are data warehousing pipelines, stream processing pipelines, and machine learning pipelines.
Know the common patterns in data warehousing pipelines.
Extract, transformation, and load (ETL) pipelines begin with extracting data from one or more data sources.
When multiple data sources are used, the extraction processes need to be coordinated.
This is because extractions are often time based, so it is important that extracts from different sources cover the same time period. Extract, load, and transformation (ELT) processes are slightly different from ETL processes.
In an ELT process, data is loaded into a database before transforming the data. Extraction and load procedures do not transform data. This kind of process is appropriate when data does not require changes from the source format. In a change data capture approach, each change is a source system that is captured and recorded in a data store. This is helpful in cases where it is important to know all changes over time and not just the state of the database at the time of data extraction.
Understand the unique processing characteristics of stream processing.
This includes the difference between event time and processing time, sliding and tumbling windows, late-arriving data and watermarks, and missing data.
Event time is the time that something occurred at the place where the data is generated.
Processing time is the time that data arrives at the endpoint where data is ingested.
Sliding windows are used when you want to show how an aggregate, such as the average of the last three values, change over time, and you want to update that stream of averages each time a new value arrives in the stream.
Tumbling windows are used when you want to aggregate data over a fixed period of time—for example, for the last one minute.
Know the components of a typical machine learning pipeline.
This includes data ingestion, data preprocessing, feature engineering, model training and evaluation, and deployment.
Data ingestion uses the same tools and services as data warehousing and streaming data pipelines. Cloud Storage is used for batch storage of datasets, whereas Cloud Pub/Sub can be used for the ingestion of streaming data. Feature engineering is a machine learning practice in which new attributes are introduced into a dataset. The new attributes are derived from one or more existing attributes.
Know that Cloud Pub/Sub is a managed message queue service.
Cloud Pub/Sub is a real-time messaging service that supports both push and pull subscription models.
It is a managed service, and it requires no provisioning of servers or clusters.
Cloud Pub/Sub will automatically scale as needed. Messaging queues are used in distributed systems to decouple services in a pipeline.
This allows one service to produce more output than the consuming service can process without adversely affecting the consuming service. This is especially helpful when one process is subject to spikes.
Know that Cloud Dataflow is a managed stream and batch processing service.
Cloud Dataflow is a core component for running pipelines that collect, transform, and output data. In the past, developers would typically create a stream processing pipeline (hot path) and a separate batch processing pipeline (cold path). Cloud Dataflow is based on Apache Beam, which is a model for combined stream and batch processing. Understand these key Cloud Dataflow concepts:
Pipelines
PCollection
Transforms
ParDo
Pipeline I/O
Aggregation
User-defined functions
Runner
Triggers
Know that Cloud Dataproc is a managed Hadoop and Spark service.
Cloud Dataproc makes it easy to create and destroy ephemeral clusters. Cloud Dataproc makes it easy to migrate from on-premises Hadoop clusters to GCP. A typical Cloud Dataproc cluster is configured with commonly used components of the Hadoop ecosystem, including Hadoop, Spark, Pig, and Hive. Cloud Dataproc clusters consist of two types of nodes: master nodes and worker nodes. The master node is responsible for distributing and managing workload distribution
Know that Cloud Composer is a managed service implementing Apache Airflow.
Cloud Composer is used for scheduling and managing workflows. As pipelines become more complex and have to be resilient when errors occur, it becomes more important to have a framework for managing workflows so that you are not reinventing code for handling errors and other exceptional cases. Cloud Composer automates the scheduling and monitoring of workflows. Before you can run workflows with Cloud Composer, you will need to create an environment in GCP.
Understand what to consider when migrating from on-premises Hadoop and Spark to GCP.
Factors include migrating data, migrating jobs, and migrating HBase to Bigtable. Hadoop and Spark migrations can happen incrementally, especially since you will be using ephemeral clusters configured for specific jobs. There may be cases where you will have to keep an on-premises cluster while migrating some jobs and data to GCP. In those cases, you will have to keep data synchronized between environments. It is a good practice to migrate HBase databases to Bigtable, which provides consistent, scalable performance.
What is ingestion?
Ingestion (see Figure 3.3) is the process of bringing data into the GCP environment. This can occur in either batch or streaming mode.
In batch mode, data sets made up of one or more files are copied to GCP. Often these files will be copied to Cloud Storage first. There are several ways to get data into Cloud Storage, including gsutil copying, Transfer Service, and Transfer Appliance.
Streaming ingestion receives data in increments, typically a single record or small batches of records, that continuously flow into an ingestion endpoint, typically a Cloud Pub/Sub topic.
Transformation
Transformation is the process of mapping data from the structure used in the source system to the structure used in the storage and analysis stages of the data pipeline. There are many kinds of transformations, including the following:
Converting data types, such as converting a text representation of a date to a datetime data type
Substituting missing data with default or imputed values
Aggregating data; for example, averaging all CPU utilization metrics for an instance over the course of one minute
Filtering records that violate business logic rules, such as an audit log transaction with a date in the future
Augmenting data by joining records from distinct sources, such as joining data from an employee table with data from a sales table that includes the employee identifier of the person who made the sale
Dropping columns or attributes from a dataset when they will not be needed
Adding columns or attributes derived from input data; for example, the average of the previous three reported sales prices of a stock might be added to a row of data about the latest price for that stock
What are options for storage in a pipeline
Storage
After data is ingested and transformed, it is often stored. Chapter 2, “Building and Operationalizing Storage Systems,” describes GCP storage systems in detail, but key points related to data pipelines will be reviewed here as well.
Cloud Storage can be used as both the staging area for storing data immediately after ingestion and also as a long-term store for transformed data. BigQuery can treat Cloud Storage data as external tables and query them. Cloud Dataproc can use Cloud Storage as HDFS-compatible storage.
BigQuery is an analytical database that uses a columnar storage model that is highly efficient for data warehousing and analytic use cases.
Bigtable is a low-latency, wide-column NoSQL database used for time-series, IoT, and other high-volume write applications. Bigtable also supports the HBase API, making it a good storage option when migrating an on-premises HBase database on Hadoop (see Figure 3.5).
Types of Data Pipelines
The structure and function of data pipelines will vary according to the use case to which they are applied, but three common types of pipelines are as follows:
Data warehousing pipelines
Stream processing pipelines
Machine learning pipeline
Data Warehousing Pipelines
Data warehouses are databases for storing data from multiple data sources, typically organized in a dimensional data model. Dimensional data models are denormalized; that is, they do not adhere to the rules of normalization used in transaction processing systems. This is done intentionally because the purpose of a data warehouse is to answer analytic queries efficiently, and highly normalized data models can require complex joins and significant amounts of I/O operations. Denormalized dimensional models keep related data together in a minimal number of tables so that few joins are required.
Collecting and restructuring data from online transaction processing systems is often a multistep process. Some common patterns in data warehousing pipelines are as follows:
Extraction, transformation, and load (ETL)
Extraction, load, and transformation (ELT)
Extraction and load
Change data capture
What is the difference between event time and processing time?
Event Time and Processing Time
Data in time-series streams is ordered by time. If a set of data A arrives before data B, then presumably the event described by A occurred before the event described by B. There is a subtle but important issue implied in the previous sentence, which is that you are actually dealing with two points in time in stream processing:
Event time is the time that something occurred at the place where the data is generated.
Processing time is the time that data arrives at the endpoint where data is ingested. Processing time could be defined as some other point in the data pipeline, such as the time that transformation starts.
What is a watermark?
To help stream processing applications, you can use the concept of a watermark, which is basically a timestamp indicating that no data older than that timestamp will ever appear in the stream.
What is the difference between hotpath and cold path?
Hot Path and Cold Path Ingestion
We have been considering a streaming-only ingestion process. This is sometimes called a hot path ingestion. It reflects the latest data available and makes it available as soon as possible. You improve the timeliness of reporting data at the potential risk of a loss of accuracy.
There are many use cases where this tradeoff is acceptable. For example, an online retailer having a flash sale would want to know sales figures in real time, even if they might be slightly off. Sales professionals running the flash sale need that data to adjust the parameters of the sale, and approximate, but not necessarily accurate, data meets their needs.
GCP Pipeline Components
GCP has several services that are commonly used components of pipelines, including?
Cloud Pub/Sub
Cloud Dataflow
Cloud Dataproc
Cloud Composer
A job is an executing pipeline in Cloud Dataflow. There are two ways to execute jobs: the traditional method and the template method.
With the traditional method, developers create a pipeline in a development environment and run the job from that environment. The template method separates development from staging and execution. With the template method, developers still create pipelines in a development environment, but they also create a template, which is a configured job specification. The specification can have parameters that are specified when a user runs the template. Google provides a number of templates, and you can create your own as well. See Figure 3.9 for examples of templates provided by Google.
the four main compute GCP products?
Compute Engine is GCP’s infrastructure-as-a-service (IaaS) product.
With Compute Engine, you have the greatest amount of control over your infrastructure relative to the other GCP compute services.
Kubernetes is a container orchestration system, and Kubernetes Engine is a managed Kubernetes service. With Kubernetes Engine, Google maintains the cluster and assumes responsibility for installing and configuring the Kubernetes platform on the cluster. Kubernetes Engine deploys Kubernetes on managed instance groups.
App Engine is GCP’s original platform-as-a-service (PaaS) offering. App Engine is designed to allow developers to focus on application development while minimizing their need to support the infrastructure that runs their applications. App Engine has two versions: App Engine Standard and App Engine Flexible.
Cloud Functions is a serverless, managed compute service for running code in response to events that occur in the cloud. Events are supported for Cloud Pub/Sub, Cloud Storage, HTTP events, Firebase, and Stackdriver Logging.
Understand the definitions of availability, reliability, and scalability.
Availability is defined as the ability of a user to access a resource at a specific time. Availability is usually measured as the percentage of time a system is operational.
Reliability is defined as the probability that a system will meet service-level objectives for some duration of time. Reliability is often measured as the mean time between failures.
Scalability is the ability of a system to meet the demands of workloads as they vary over time.