Distributed Machine Learning Flashcards

(16 cards)

1
Q

What are the 7 different communication patterns?

A
  1. Push
  2. Pull
  3. Broadcast
  4. Reduce
  5. All-reduce
  6. Wait
  7. Barrier
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the push communication pattern?

A

Machine A send data to Machine B

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the pull communication pattern?

A

Machine B requests data from Machine A

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the broadcast communication pattern?

A

Machine A sends data to many machines.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the reduce communication pattern?

A

Compute some reduction of data from multiple machines and materialise the result on one machine.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the all reduce communication pattern?

A

Compute come reduction of data on multiple machines and materialise the results on all those machines.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the wait communication pattern?

A

One machine pauses its computation and wait on a signal from another machine.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the barrier communication pattern?

A

Many machines wait until all those machines reach a point in their execution, then continue from there

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the key principle in distributed computing?

A

Overlapping computation and communication.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How does Stochastic Gradient Decent work with All reduce?

A
  1. The SGD equation is split into M sub equations
  2. Each of M machines is assigned a summation
  3. After all gradients are computed, the outer sum can be performed using all-reduce
    4.After the all-reduce the whole sum is present on all machines and can be used to update model parameters
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the advantage of SGD with All-reduce?

A

The algorithm is easy to reason about as it’s equivalent to minibatch SGD. The same hyperparameters can be used. The algorithm is easy to implement.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the disadvantage of SGD with All-reduce?

A

While the communication for the all-reduce is happening, the workers are idle. We are not overlapping computation and communication.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How is parallel k-means implemented?

A

Parallel k-means is split into 3 phases:
1. Map
2. Combine
3. Reduce

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How is the map step done in parallel k-means?

A
  1. Compute distances between the points and k centroids
  2. Assign points to the clusters
  3. send the intermediary data to combiner
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How is the combine step done in parallel k-means?

A
  1. Compute centroid of each cluster
  2. Send local sums of the value across dimensions and the number of samples
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How is the reduce step done in parallel k-means?

A
  1. Test convergence
  2. Update the cluster centroids
  3. Return to Map step until convergence reache