11. Designing ML Training Pipelines Flashcards

Question 1

Q

What is ML pipelines for?

Answer

A

It lets you instrument, orchestrate, and automate the complex ML steps from ML development and prediction serving.

Question 2

Q

What are the key functions of ML pipeline?

Answer

A

Triggering pipelines on demand, on a schedule or in response to specified events
Pipelines integrate with the ML metadata tracking capability to capture pipeline execution parameters and to produce artifacts.

Question 3

Q

What is an orchestrator for?

Answer

A

It runs the pipeline in a sequence and automatically moves from one step to the next based on the defined conditions, e.g., cleaning data, transforming data, training a model.

Question 4

Q

How does orchestrator help in the production phase?

Answer

A

Help automate the execution of the ML pipeline based on a schedule or triggering conditions.

Question 5

Q

How does orchestrator help in the development phase?

Answer

A

Help data scientists run the ML experiment steps automatically.

Question 6

Q

What is Kubeflow?

Answer

A

It is the ML toolkit for Kubernetes.
It builds on Kubernetes for deploying, scaling, managing complex systems.
It supports different frameworks, e.g., TensorFlow, PyTorch, MXNet.
The workflow can be deployed to various clouds, local or on-premise platforms.

Question 7

Q

What is a pipeline?

Answer

A

It is a description of an ML workflow in the form of a graph, including all of the components in the workflow and how the components relate to each other.

Question 8

Q

What happens when a pipeline starts?

Answer

A

The pods start Docker containers and the containers start your program.

Question 9

Q

What does a Kubeflow Pipelines platform consists of?

Answer

A

A UI for managing, tracking experiments, jobs, and runs.
An engine for scheduling multistep ML workflows
An SDK for defining and manipulating pipelines and components
Notebooks for interacting with the system with the SDK
Orchestration, experimentation and reuse

Question 10

Q

Where can you run Kubeflow pipelines?

Answer

A

GKE
Vertex AI Pipelines
on-premises or local systems for testing purposes.

Question 11

Q

How can you run Kubeflow pipelines or TensorFlow Extended pipelines without setting up Kubeflow or TFX infrastructure?

Answer

A

Use Vertex AI Pipelines automatically provisions the underlying infrastructure and manage it.

Question 12

Q

What is data lineage in Vertex AI Pipelines?

Answer

A

Tracking the movement of data from the source to consumption by a model.
Produce metadata and ML artifacts, e.g., training data and hyperparameters used.

Question 13

Q

What are characteristics of a pipeline?

Answer

A

Automate, monitor, experiment with parts of an ML workflow
Portable, scalable, and based on containers
Each individual part of the pipeline workflow is defined by code. This code is a component or a step. Components are composed of inputs, outputs and container image location. A container image is a package includes the component executable code and a definition of the environment.
Components are self-contained sets of code that perform one part of the pipeline workflow.
You can build custom components or reuse pre-built components.

Question 14

Q

What is the purpose to set up a threshold to deploy a model?

Answer

A

Only when a model reach a certain level, it will be deployed.

Question 15

Q

What is Apache Airflow?

Answer

A

It is a workflow management platform for data engineering pipelines.
It lets you build and run workflows. A workflow is represented as a directed acyclic graph and contains pieces of work called tasks, arranged with dependencies and data flows.
It comes with a UI, a scheduler and an executor.

Question 16

Q

What is Metadata in Vertex AI for?

Answer

A

Store artifacts and Metadata where you can track lineage.

Question 17

Q

What is Cloud Composer good for?

Answer

A

No installation or management overhead
Can use Airflow environments and tools
Designed for orchestrate data-driven workflow (ETL)
Best for batch workloads

Question 18

Q

What is Cloud Composer?

Answer

A

It is a fully managed workflow orchestration service built on Apache Airflow.

Question 19

Q

Compare Kubeflow Pipelines, Vertex AI Pipelines and Cloud Composer?

Answer

A

Kubeflow:
Orchestrate ML workflows (TensorFlow, PyTorch or MXNet using Kubernetes) in on-premise or any cloud
Need to handle failures
Vertex AI:
Serverless pipeline
Orchestrate Kubeflow or TFX pipelines
Use Kubeflow’s failure management
Cloud Composer:
Orchestrate ETL pipelines using Apache Airflow.
Use workflows to do MLOps.
Use GCP built-in metrics to manage failure.

Question 20

Q

How do you invoke Kubeflow Pipelines using Kubeflow Pipeline SDK?

Answer

A

Cloud Scheduler or Pub/Sub + Cloud Functions
Cloud Composer or Cloud Data Fusion
Argo (a built-in scheduler)
Apache Airflow orchestration and scheduling

Question 21

Q

What are the two ways to schedule a Vertex AI pipeline?

Answer

A

To schedule a pipeline with Cloud Scheduler, build and compile a pipeline, upload the compiled pipeline JSON to a Cloud Storage bucket, create a Cloud Function with an HTTP trigger, and create a Cloud Scheduler job.
When you specify a Pub/Sub trigger for a function, you also specify a Pub/Sub topic. Your function will be called whenever a message is published to the specified topic.

Question 22

Q

What is system design with Kubeflow DSL?

Answer

A

Pipelines created in Kubeflow Pipelines are stored in YAML files executed by Argo. Kubeflow exposes a Python domain-specific language for authoring pipelines.

Question 23

Q

How to create a pipeline with Kubeflow DSL?

Answer

A

Create a container as a simple Python function or a Docker container.
Create an operation references to the container, command-line arguments, data mounts and variables to pass the container
Sequence the operations (parallel or sequential)
Compile the pipeline (YAML) Kubeflow Pipeline can consume.

Question 24

Q

How do you invoke a component in a pipeline?

Answer

A

You need to create a component op.

Question 25

Q

How do you create a component op?

Answer

A

Implement a lightweight Python component:
Define a Python function and let Kubeflow to package it
Create a reusable component:
Create a component specification in the component.yaml file. Component sops automatically create the component.
Use predefined Google Cloud components:
These components help execute tasks using services, e.g., BigQuery, Dataflow

Question 26

Q

What is TFX?

Answer

A

TFX is a platform for building and managing ML workflows in a production environment
TFX pipeline orchestrate ML workflows on Apache Airflow, Apache Bean, Kubeflow Pipelines
TFX provide components as a part of the pipeline.
TFX provides libraries as the base functionality, e.g., TFDV.

Question 27

Q

What are the components of TFX pipeline?

Answer

A

ExampleGen (ingest, split data)
StatisticsGen (calculate statistics)
SchemaGen (examine statistics and create schema)
ExampleValidator (anomalies and missing values)
Transform (feature engineering)
Trainer (train)
Tuner (tune)
Evaluator (analyse training results & validate models)
Model Validator (servability)
Pusher (deploy)
Model Server (batch processing)

Hints: Elephants Stampede Safely Every Time To The Elephant’s Peaceful Mountain Meadow.

Question 28

Q

What are the characteristics of TFX pipeline?

Answer

A

Scalable, high-performance.
Components can be used individually.
Metadata store maintain the state of the pipeline run

Question 29

Q

What are the ways to orchestrate TFX pipelines on GCP?

Answer

A

Use Kubeflow Pipelines running on GKE
Use Apache Airflow or Cloud Composer
Use Vertex AI Pipelines

Question 30

Q

What is multicloud?

Answer

A

Interconnection between two different cloud providers.

Question 31

Q

What are the ways to operate as multicloud with GCP?

Answer

A

Use GCP AI APIs to integrate on-premises
Use BigQuery Omni to run BigQuery analytics on data stored in amazon or Azure.
Use data from Amazon or Azure via BigQuery Omni to train ML models with Vertex AI.

Question 32

Q

What is hybrid cloud?

Answer

A

Combining a private computing environment and public cloud computing environment

Question 33

Q

What is Anthos?

Answer

A

It is the hybrid and multicloud cloud modernization platform.

Question 34

Q

What are the features in Anthos?

Answer

A

BigQuery Omni to querying data
Hybrid AI offers speech-to-text on-prem
Run GKE on-premises
Run Cloud Run service on-premises

Brainscape's Knowledge GenomeTM

11. Designing ML Training Pipelines Flashcards

Brainscape's Knowledge Genome^TM