11. Designing ML Training Pipelines Flashcards
What is ML pipelines for?
It lets you instrument, orchestrate, and automate the complex ML steps from ML development and prediction serving.
What are the key functions of ML pipeline?
Triggering pipelines on demand, on a schedule or in response to specified events
Pipelines integrate with the ML metadata tracking capability to capture pipeline execution parameters and to produce artifacts.
What is an orchestrator for?
It runs the pipeline in a sequence and automatically moves from one step to the next based on the defined conditions, e.g., cleaning data, transforming data, training a model.
How does orchestrator help in the production phase?
Help automate the execution of the ML pipeline based on a schedule or triggering conditions.
How does orchestrator help in the development phase?
Help data scientists run the ML experiment steps automatically.
What is Kubeflow?
It is the ML toolkit for Kubernetes.
It builds on Kubernetes for deploying, scaling, managing complex systems.
It supports different frameworks, e.g., TensorFlow, PyTorch, MXNet.
The workflow can be deployed to various clouds, local or on-premise platforms.
What is a pipeline?
It is a description of an ML workflow in the form of a graph, including all of the components in the workflow and how the components relate to each other.
What happens when a pipeline starts?
The pods start Docker containers and the containers start your program.
What does a Kubeflow Pipelines platform consists of?
A UI for managing, tracking experiments, jobs, and runs.
An engine for scheduling multistep ML workflows
An SDK for defining and manipulating pipelines and components
Notebooks for interacting with the system with the SDK
Orchestration, experimentation and reuse
Where can you run Kubeflow pipelines?
GKE
Vertex AI Pipelines
on-premises or local systems for testing purposes.
How can you run Kubeflow pipelines or TensorFlow Extended pipelines without setting up Kubeflow or TFX infrastructure?
Use Vertex AI Pipelines automatically provisions the underlying infrastructure and manage it.
What is data lineage in Vertex AI Pipelines?
Tracking the movement of data from the source to consumption by a model.
Produce metadata and ML artifacts, e.g., training data and hyperparameters used.
What are characteristics of a pipeline?
Automate, monitor, experiment with parts of an ML workflow
Portable, scalable, and based on containers
Each individual part of the pipeline workflow is defined by code. This code is a component or a step. Components are composed of inputs, outputs and container image location. A container image is a package includes the component executable code and a definition of the environment.
Components are self-contained sets of code that perform one part of the pipeline workflow.
You can build custom components or reuse pre-built components.
What is the purpose to set up a threshold to deploy a model?
Only when a model reach a certain level, it will be deployed.
What is Apache Airflow?
It is a workflow management platform for data engineering pipelines.
It lets you build and run workflows. A workflow is represented as a directed acyclic graph and contains pieces of work called tasks, arranged with dependencies and data flows.
It comes with a UI, a scheduler and an executor.