Deployment Flashcards

1
Q

Batch prediction

A

Periodically run your model on a new data and cache the results in a data base

Works if the universe of inputs is relatively small (eg 1 prediction per user)

Pros:
Simple to implement
Low latency to the user

Cons:
Doesn’t scale to complex inputs types
User don’t get the most up to date predictions
Hard to detect model staleness, and it happens frequently

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Model in service

A

Package up your model and include it in your deployed web server
Web server loads the model and calls it to make predictions (store the weights on the web server, or on s3 and download when needed)

Pros:
Re used your existing infrastructure

Cons:
Different coding language 
Models change more frequently
Eat resources
Hardware not optimized - no GPU
Most important - web server and the model may scale differently
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Model as service

A

Most common deployment

Run your model on its own web server

The backend interact with the model by making requests

Pros:
Dependable. Model bugs less likely to crash the web app

Scalable (pick optimal hardware)

Flexibility - easily reuse a model across multiple apps

Cons:
Add latency
Infrastructure complexity
Now you have to run a model service (the ml engineer…)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

REST APIs

A

Serving predictions in response to canonical formatted HTTP requests

Alternatives: GRPC , GraphQL

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Dependency management for model server

A

Model predictions depend on code, weights, and dependencies. All need to be on your web server

Hard to make consistent, hard to update

2 strategies:

  1. Constrain the dependencies for the model
  2. Use containers
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

GPU or no GPU

A

Pros:
Same hardware as in training
Usually higher throughput

Cons:
More complex
More expansive - not the norm

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Concurrency - what is it?

A

What?
Multiple copies of the model running on different cpus or cores

How?
Be carful about thread tuning - make sure it’s tuning in minimal threads it needs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is Model distillation?

A

Train a smaller model to imitate your larger one

Can be finicky to do yourself, not used in practice so often

Exception - distilBERT

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is quantization?

A

Used for reduce the size and increase the speed:

What?
Execute some or all of the operations in your model with a smaller numerical representation than floats (eg int8)

Some trade offs with accuracy

How?

PyTorch and tensor flow lite have quantization built in
Can also run quantization aware training, which often results in higher accuracy

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Caching (in model deployment)

A

Performance optimization.

What?
For some ml models some inputs are more common.
Instead of running the model again on them first check the cache

How?
Can get very fancy.
Basic way uses Python built in functools (.cache)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Batching (in model deployment)

A

What?

Ml model often achieve higher throughput when doing predictions in parallel

How?
Collect requests until you have a batch, run prediction, return to users.
Batch size needs to be tuned. (Throughput vs latency)
Have a shortcut for when latency is too long

Probably don’t want to implement yourself

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Sharing the gpu

A

What?
You model may not take up the whole gpu so share it

How?
Use a model servicing solution that supports this out of the box (very hard to implement)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How to do horizontal scaling?

A

If you have too much traffic you can split it among multiple machines

Spin up multiple copies of the service.
2 common methods:

  1. Container orchestration (kubernetes)
  2. Serverless (aws lambda)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is model deployment?

A

Serving the model is how you turn a model into something that can respond to requests, deployment is how you rollout manage and update these services.

How?
Gradually, instantly and with deploy pipelines of models
Hopefully your deployment liberty will take care for you.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly