Design and prepare a machine learning solution (20–25%) Flashcards

(35 cards)

1
Q

number of data formats

A

Three: Tabular/structured, Semi-structured, Unstructured

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Tabular/structured Data

A

Highly ordered data that has a schema. CSV, excel

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Semi Structured Data

A

Data is organized with key value pairs like a dictionary

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Unstructured Data

A

Follows no rules! Images, video, audio, documents

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What to do when your data format sucks?

A

Transform it to something more suitable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Most common ways to store data for model training

A

Three: Azure Blob Storage, Azure Data Lake Storage, Azure SQL Database

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Azure Blob Storage

A

Can store structured and unstructured; cheapest option to store unstructured; most basic storage

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Azure Data Lake Storage G2

A

Stores CSVs as unstructured data, easy to give access for specific items to people due to hierarchical namespace, capacity is limitless

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Azure SQL Database

A

Only store structured data, like an sql table. ideal for data that doesn’t change over time

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What does Azure offer for compatible data formats

A

Azdb for Mysql, Azdb for Postgresql, Azdb for Mariadb, CosmosDB Cassandra API, CosmosDB MongoDB API

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Need Semi-structured data with on-demand schema

A

CosmosDB sql API

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Azure Synapse Analytics for pipelines

A

AKA Azure Synapse Pipelines. Can make data ingestion pipelines through UI or from json format. Makes it easy to ETL data from source into a data store

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Azure databricks for pipelines

A

Code-first tool where you can use sql, python, or R to define pipelines. Fast because it uses spark clusters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Azure Machine Learning for Pipelines

A

Provides auto scaling compute clusters. Create ETL pipelines in Designer or from multiple scripts. Not as scaleable as Synapse or Databricks

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Six steps to train a model

A
  1. Define problem 2. Get data 3. Prepare data 4. Train model 5. Deploy model 6. Monitor Model
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What questions to ask when defining the problem

A

Desired model output, what type of model is needed, what criteria make a model successful

17
Q

Five common ML tasks

A

Regression, Classification, Forecasting, Computer Vision, NLP

18
Q

4 services to train ML models

A

Azure ML studio, Azure Databricks, Azure Synapse Analytics, Azure Cognitive Services

19
Q

Azure ML studio for training

A

Can use UI, python SDK, or CLI to manage workloads

20
Q

Azure Databricks for training

A

Uses spark compute for efficient processing

21
Q

Azure Synapse Analytics

A

Primarily for ETL, but does have ML capabilities. Works well at scale with big data

22
Q

Azure Cognitive Service for training

A

Collection of prebuilt ML models for tasks like image recognition. Models can be customized through transfer learning

23
Q

How to save time and effort with pre-built models

A

Azure Cognitive Services

24
Q

Keep ETL and Data Science within same service

A

Azure Synapse Analytics or Azure Databricks

25
Need full control of training and management
Azure ML or Azure Databricks
26
Want to use Python SDK
Azure ML
27
Want to use a UI
Azure ML
28
CPU vs GPU
CPU is cheaper for smaller tasks, GPU is expensive for bigger tasks
29
General purpose or Memory optimized
General is obvious, use memory optimized for larger datasets or notebooking
30
Spark compute
Spark clusters are distributed so they work in parallel. can use gpu and cpu. Databricks and Azure Synapse Analytics use this
31
Prediction options for endpoints
Real-time or batch
32
Real-time predictions
Low latency, need the answer NOW! Good for customer facing services. Works on a single row input of data
33
batch predictions
Score new data in batches, save results as file. Good for forecasting or scheduled predictions. Works on multiple rows of data
34
Compute for real-time predictions
Container services like Azure Container Instance ACI or Azure Kubernetes Service AKS. Compute is always on and costing money
35
Compute for batch predictions
Clusters offer scoring in parallel. Compute spins down when not actively working