L8 - Cluster Management Flashcards

1
Q

What is resource allocation?

A

how much CPU/DRAM/disk/net to allocate to each app

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is resource assignment?

A

What should run on which physical nodes?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is private resource allocation? What is its other name?

A

each app receives a private, static set of resources

static partitioning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Advantages of static partitioning?

A
  1. simplicity
  2. performance isolation
  3. allows specialised HW (e.g. not everyone needs a GPU)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Disadvantages of static partitioning?

A
  1. low utilisation
  2. hard to solve failures
  3. hard to maintain

about 2&3: not clear how to migrate a machine

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What 3 properties do we want the scheduler to fulfil in case of shared resource assignment?

A
  1. Fairness
  2. Efficient resource usage
  3. Isolation
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

List the algorithm from the lecture for shared resource assignment.

A
  1. Fair queueing (extends 1) (for a single resource)
  2. Weighted max-min fair queueing (extends 2)
  3. Dominant resource fairness
  4. Token bucket
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What does work conserving mean? Which property implies that?

A

Resources should not remain idle while there are users whose demand is not fully satisfied.

This is implied by “Efficient resource usage”.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Why do we want work conserving schedulers?

A

It keeps resources well-utilised.

It maximises overall throughput across different users.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Name a strategy that is not work conserving.

A

time division multiplexing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are the different notions of fairness?

A
  1. Max-min fairness
  2. Dominant resource fairness
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are the properties of max-min fairness?

A

share guarantee: each user gets at least 1/n of the unless their demand is less

strategy-proof: users are not better off by asking for more than they need

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What does DRF try to achieve?

A

identify the dominant resource share of each user and maximise the minimum dominant share across all users

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is the drawback of DRF?

A

not work conserving

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the issue with max-min fairness?

A

With max-min fairness, a user’s allocation depends on the demands of other users that are sharing the resource. –> no performance predictability

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is the goal of token buckets?

A

guarantee a baseline bandwidth, but also allow bounded bursts

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

How does the token bucket idea work?

A

Control traffic by delaying requests until they accumulate sufficient tokens.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What does resource assignment try to optimise?

A
  1. performance
  2. resource utilisation
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Explain the first step of resource assignment.

A

Filter machines that satisfy hard constraints

e.g., VM may need a machine with a GPU

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Explain the second step of resource assignment.

A

Rank candidate nodes to find machine that best
satisfies soft constraints

e.g., best-fit to avoid resource fragmentation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

List different methods for cluster management system architecture.

A
  1. centralised
  2. distributed
  3. hierarchical e.g. two-level
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Next questions are about Borg. First, what is Borg?

A

Google’s centralised cluster manager

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What does Borgmaster do?

A

It is the main scheduler.
It polls Borglets every few seconds

extra: 5 replicas

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What does Borglet do?

A

Manages and monitors tasks and resources on machines it is responsible for.

extra: 10k heterogenous machines per Borglet

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

What strategies does Borg deploy to achieve high utilisation?

A
  1. admission control
  2. efficient task-packing
  3. over-commitment
  4. machine sharing
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

What is Kubernetes?

A

Cluster management for containerised applications;

  • manage complexity of container lifecycle and allocating/setting up hardware resources for the containers.
  • like an OS for your cloud cluster
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

List container orchestration primitives!

A
  1. Resource scaling
  2. Resource allocation
  3. Load balancing
  4. Lifecycle and health
  5. Naming and discovery
  6. Storage volumes
  7. Logging and monitoring
  8. Debugging and introspection
  9. Identity and authorization
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

Resource scaling

A

make sets of containers bigger or smaller

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

Resource allocation

A

decide where my containers should run

30
Q

Load balancing

A

distribute traffic across a set of containers

31
Q

Lifecycle and health

A

keep my containers running despite failures

32
Q

Naming and discovery

A

find where my containers are now

33
Q

Storage volumes

A

provide data to containers

34
Q

Logging and monitoring

A

track what’s happening with my containers

35
Q

Debugging and introspection

A

enter or attach to containers

36
Q

Identity and authorization

A

control who can do things to my containers

37
Q

What do the Kubernetes containers do?

A

Handle package dependencies

38
Q

What is a pod?

A

A pod is the unit of scheduling and migration in Kubernetes.

a bunch of containers with same properties

39
Q

List those properties!

A
  1. Lifecycle: live together, die together
  2. Network: same IP address, same routes, iptables
  3. Storage volumes: can share data
  4. Intended to run a common task
40
Q

Kubernetes service?

A

A group of pods that work together

extra: provides load balancing among pod replicas

41
Q

How do you control pod placement in Kubernetes?

A

use labels and selectors

42
Q

How do you keep N pods running?

A

use ReplicaSets: layer on top of Pod API that
ensures N copies of a pod are running

43
Q

What does the Horizontal Pod Autoscaler do?

A

automatically scale pods as needed
- based on CPU utilisation (or custom metrics)
- can set user-defined min/max bounds

44
Q

What is a potential problem with relying only on CPU utilization as a scaling metric?

A

good for compute bound apps but maybe I/O is the bottleneck

45
Q

What other metrics would you consider for auto-scaling besides CPU utilization?

A
  1. memory capacity
  2. memory BW
  3. network BW
46
Q

What properties does resource isolation try to achieve?

A
  1. Applications must not be able to affect each other’s performance
  2. Repeated runs of the same application should see similar behaviour
47
Q

What are the resource allocation mechanisms in Kubernetes?

A

Request: How much of a resource (CPU, RAM) the container is asking to use, with a strong guarantee of availability

Limit: Max amount of a resource the container can access

48
Q

Does the scheduler overcommit to requests?

A

No.

49
Q

List 3 Kubernetes Quality of Service classes.

A
  1. Guaranteed: highest protection
  2. Burstable: medium protection
  3. Best effort: lowest protection
50
Q

Relation of request and limit for Guaranteed class?

A

request > 0 && limit == request

51
Q

Relation of request and limit for Burstable class?

A

request > 0 && limit > request

52
Q

Relation of request and limit for Best effort class?

A

request == 0

53
Q

What are the advantages of centralised design?

A

can make globally optimal decisions

54
Q

What are the drawbacks of centralised design?

A

scalability: hard to enforce consistency

55
Q

Name 2 two-level cluster managers

A

Mesos and YARN

56
Q

How does Mesos work?

A

Lecture on 05.04
Min: 3.5

57
Q

List two distributed cluster management algorithms.

A

Omega and Sparrow

58
Q

List two new challenges serverless brings to the cluster management besides resource allocation and assignment.

A
  1. resource scaling: How many containers (“slots”) to keep warm for a function?
  2. request routing: To which node and “slot” do we send a particular invocation?
59
Q

What does Quasar try to solve?

A

Over-provisioning

60
Q

How does Quasar solve over-provisioning?

A

Don’t ask users for allocation request/resource demand.

They don’t really know it anyway.

61
Q

What do the users specify in this case? (Quasar)

A

performance goals

62
Q

What does the cluster manager do in this case? (Quasar)

A

profiles applications and dynamically adjusts resource allocations

63
Q

How does the cluster manager understand resource/performance tradeoffs? (Quasar)

A

It combines the following:

  1. Small signal from a short run of a new application
  2. Large signal from previously run applications
64
Q

What does the cluster manager do at the end? (still Quasar)

A

For each new application, it needs to recommend a resource allocation and assignment.

65
Q

How does one build a recommender system?

A

collaborative filtering

66
Q

What is collaborative filtering?

A

Predict preferences of new users given preferences of other users SVD and PQ reconstruction.

67
Q

What needs to be considered to recommend resource allocations to applications? (4)

A
  1. scale-out
  2. scale-up
  3. HW heterogeneity
  4. Interference
68
Q

What does scale out mean?

A

Use 4 nodes or a single node?

69
Q

What does scale up mean?

A

Use a 8-core VM or a single core VM?

70
Q

List 3 steps of Quasar’s functionality.

A

Step 1: short profiling runs produce initial performance data.

Step 2: collaborative filtering techniques fill in missing data

Step 3: Greedy scheduler uses output to find the number and type of resources that maximise utilisation and performance.

71
Q

To summarise, what are the challenges of using shared clusters?

A
  1. Resource allocation: how many resources should an app get?
  2. Resource assignment: which specific resources does an app get?
  3. Variability: within an app (different phases), within datasets, and load