11. Designing ML Training Pipelines Flashcards

1
Q

What is ML pipelines for?

A

It lets you instrument, orchestrate, and automate the complex ML steps from ML development and prediction serving.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the key functions of ML pipeline?

A

Triggering pipelines on demand, on a schedule or in response to specified events
Pipelines integrate with the ML metadata tracking capability to capture pipeline execution parameters and to produce artifacts.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is an orchestrator for?

A

It runs the pipeline in a sequence and automatically moves from one step to the next based on the defined conditions, e.g., cleaning data, transforming data, training a model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How does orchestrator help in the production phase?

A

Help automate the execution of the ML pipeline based on a schedule or triggering conditions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How does orchestrator help in the development phase?

A

Help data scientists run the ML experiment steps automatically.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is Kubeflow?

A

It is the ML toolkit for Kubernetes.
It builds on Kubernetes for deploying, scaling, managing complex systems.
It supports different frameworks, e.g., TensorFlow, PyTorch, MXNet.
The workflow can be deployed to various clouds, local or on-premise platforms.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is a pipeline?

A

It is a description of an ML workflow in the form of a graph, including all of the components in the workflow and how the components relate to each other.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What happens when a pipeline starts?

A

The pods start Docker containers and the containers start your program.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What does a Kubeflow Pipelines platform consists of?

A

A UI for managing, tracking experiments, jobs, and runs.
An engine for scheduling multistep ML workflows
An SDK for defining and manipulating pipelines and components
Notebooks for interacting with the system with the SDK
Orchestration, experimentation and reuse

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Where can you run Kubeflow pipelines?

A

GKE
Vertex AI Pipelines
on-premises or local systems for testing purposes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How can you run Kubeflow pipelines or TensorFlow Extended pipelines without setting up Kubeflow or TFX infrastructure?

A

Use Vertex AI Pipelines automatically provisions the underlying infrastructure and manage it.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is data lineage in Vertex AI Pipelines?

A

Tracking the movement of data from the source to consumption by a model.
Produce metadata and ML artifacts, e.g., training data and hyperparameters used.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are characteristics of a pipeline?

A

Automate, monitor, experiment with parts of an ML workflow
Portable, scalable, and based on containers
Each individual part of the pipeline workflow is defined by code. This code is a component or a step. Components are composed of inputs, outputs and container image location. A container image is a package includes the component executable code and a definition of the environment.
Components are self-contained sets of code that perform one part of the pipeline workflow.
You can build custom components or reuse pre-built components.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is the purpose to set up a threshold to deploy a model?

A

Only when a model reach a certain level, it will be deployed.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is Apache Airflow?

A

It is a workflow management platform for data engineering pipelines.
It lets you build and run workflows. A workflow is represented as a directed acyclic graph and contains pieces of work called tasks, arranged with dependencies and data flows.
It comes with a UI, a scheduler and an executor.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is Metadata in Vertex AI for?

A

Store artifacts and Metadata where you can track lineage.

15
Q

What is Cloud Composer good for?

A

No installation or management overhead
Can use Airflow environments and tools
Designed for orchestrate data-driven workflow (ETL)
Best for batch workloads

15
Q

What is Cloud Composer?

A

It is a fully managed workflow orchestration service built on Apache Airflow.

16
Q

Compare Kubeflow Pipelines, Vertex AI Pipelines and Cloud Composer?

A

Kubeflow:
Orchestrate ML workflows (TensorFlow, PyTorch or MXNet using Kubernetes) in on-premise or any cloud
Need to handle failures
Vertex AI:
Serverless pipeline
Orchestrate Kubeflow or TFX pipelines
Use Kubeflow’s failure management
Cloud Composer:
Orchestrate ETL pipelines using Apache Airflow.
Use workflows to do MLOps.
Use GCP built-in metrics to manage failure.

17
Q

How do you invoke Kubeflow Pipelines using Kubeflow Pipeline SDK?

A

Cloud Scheduler or Pub/Sub + Cloud Functions
Cloud Composer or Cloud Data Fusion
Argo (a built-in scheduler)
Apache Airflow orchestration and scheduling

18
Q

What are the two ways to schedule a Vertex AI pipeline?

A

To schedule a pipeline with Cloud Scheduler, build and compile a pipeline, upload the compiled pipeline JSON to a Cloud Storage bucket, create a Cloud Function with an HTTP trigger, and create a Cloud Scheduler job.
When you specify a Pub/Sub trigger for a function, you also specify a Pub/Sub topic. Your function will be called whenever a message is published to the specified topic.

19
Q

What is system design with Kubeflow DSL?

A

Pipelines created in Kubeflow Pipelines are stored in YAML files executed by Argo. Kubeflow exposes a Python domain-specific language for authoring pipelines.

20
Q

How to create a pipeline with Kubeflow DSL?

A

Create a container as a simple Python function or a Docker container.
Create an operation references to the container, command-line arguments, data mounts and variables to pass the container
Sequence the operations (parallel or sequential)
Compile the pipeline (YAML) Kubeflow Pipeline can consume.

20
Q

How do you invoke a component in a pipeline?

A

You need to create a component op.

20
Q

How do you create a component op?

A

Implement a lightweight Python component:
Define a Python function and let Kubeflow to package it
Create a reusable component:
Create a component specification in the component.yaml file. Component sops automatically create the component.
Use predefined Google Cloud components:
These components help execute tasks using services, e.g., BigQuery, Dataflow

20
Q

What is TFX?

A

TFX is a platform for building and managing ML workflows in a production environment
TFX pipeline orchestrate ML workflows on Apache Airflow, Apache Bean, Kubeflow Pipelines
TFX provide components as a part of the pipeline.
TFX provides libraries as the base functionality, e.g., TFDV.

21
Q

What are the components of TFX pipeline?

A

ExampleGen (ingest, split data)
StatisticsGen (calculate statistics)
SchemaGen (examine statistics and create schema)
ExampleValidator (anomalies and missing values)
Transform (feature engineering)
Trainer (train)
Tuner (tune)
Evaluator (analyse training results & validate models)
Model Validator (servability)
Pusher (deploy)
Model Server (batch processing)

Hints: Elephants Stampede Safely Every Time To The Elephant’s Peaceful Mountain Meadow.

22
Q

What are the characteristics of TFX pipeline?

A

Scalable, high-performance.
Components can be used individually.
Metadata store maintain the state of the pipeline run

23
Q

What are the ways to orchestrate TFX pipelines on GCP?

A

Use Kubeflow Pipelines running on GKE
Use Apache Airflow or Cloud Composer
Use Vertex AI Pipelines

24
Q

What is multicloud?

A

Interconnection between two different cloud providers.

25
Q

What are the ways to operate as multicloud with GCP?

A

Use GCP AI APIs to integrate on-premises
Use BigQuery Omni to run BigQuery analytics on data stored in amazon or Azure.
Use data from Amazon or Azure via BigQuery Omni to train ML models with Vertex AI.

26
Q

What is hybrid cloud?

A

Combining a private computing environment and public cloud computing environment

27
Q

What is Anthos?

A

It is the hybrid and multicloud cloud modernization platform.

28
Q

What are the features in Anthos?

A

BigQuery Omni to querying data
Hybrid AI offers speech-to-text on-prem
Run GKE on-premises
Run Cloud Run service on-premises