Deployment Flashcards

Question 1

Q

Batch prediction

Answer

A

Periodically run your model on a new data and cache the results in a data base

Works if the universe of inputs is relatively small (eg 1 prediction per user)

Pros:
Simple to implement
Low latency to the user

Cons:
Doesn’t scale to complex inputs types
User don’t get the most up to date predictions
Hard to detect model staleness, and it happens frequently

Question 2

Q

Model in service

Answer

A

Package up your model and include it in your deployed web server
Web server loads the model and calls it to make predictions (store the weights on the web server, or on s3 and download when needed)

Pros:
Re used your existing infrastructure

Cons:
Different coding language 
Models change more frequently
Eat resources
Hardware not optimized - no GPU
Most important - web server and the model may scale differently

Question 3

Q

Model as service

Answer

A

Most common deployment

Run your model on its own web server

The backend interact with the model by making requests

Pros:
Dependable. Model bugs less likely to crash the web app

Scalable (pick optimal hardware)

Flexibility - easily reuse a model across multiple apps

Cons:
Add latency
Infrastructure complexity
Now you have to run a model service (the ml engineer…)

Question 4

Q

REST APIs

Answer

A

Serving predictions in response to canonical formatted HTTP requests

Alternatives: GRPC , GraphQL

Question 5

Q

Dependency management for model server

Answer

A

Model predictions depend on code, weights, and dependencies. All need to be on your web server

Hard to make consistent, hard to update

2 strategies:

Constrain the dependencies for the model
Use containers

Question 6

Q

GPU or no GPU

Answer

A

Pros:
Same hardware as in training
Usually higher throughput

Cons:
More complex
More expansive - not the norm

Question 7

Q

Concurrency - what is it?

Answer

A

What?
Multiple copies of the model running on different cpus or cores

How?
Be carful about thread tuning - make sure it’s tuning in minimal threads it needs

Question 8

Q

What is Model distillation?

Answer

A

Train a smaller model to imitate your larger one

Can be finicky to do yourself, not used in practice so often

Exception - distilBERT

Question 9

Q

What is quantization?

Answer

A

Used for reduce the size and increase the speed:

What?
Execute some or all of the operations in your model with a smaller numerical representation than floats (eg int8)

Some trade offs with accuracy

How?

PyTorch and tensor flow lite have quantization built in
Can also run quantization aware training, which often results in higher accuracy

Question 10

Q

Caching (in model deployment)

Answer

A

Performance optimization.

What?
For some ml models some inputs are more common.
Instead of running the model again on them first check the cache

How?
Can get very fancy.
Basic way uses Python built in functools (.cache)

Question 11

Q

Batching (in model deployment)

Answer

A

What?

Ml model often achieve higher throughput when doing predictions in parallel

How?
Collect requests until you have a batch, run prediction, return to users.
Batch size needs to be tuned. (Throughput vs latency)
Have a shortcut for when latency is too long

Probably don’t want to implement yourself

Question 12

Q

Sharing the gpu

Answer

A

What?
You model may not take up the whole gpu so share it

How?
Use a model servicing solution that supports this out of the box (very hard to implement)

Question 13

Q

How to do horizontal scaling?

Answer

A

If you have too much traffic you can split it among multiple machines

Spin up multiple copies of the service.
2 common methods:

Container orchestration (kubernetes)
Serverless (aws lambda)

Question 14

Q

What is model deployment?

Answer

A

Serving the model is how you turn a model into something that can respond to requests, deployment is how you rollout manage and update these services.

How?
Gradually, instantly and with deploy pipelines of models
Hopefully your deployment liberty will take care for you.