Distributed Machine Learning Flashcards
(16 cards)
What are the 7 different communication patterns?
- Push
- Pull
- Broadcast
- Reduce
- All-reduce
- Wait
- Barrier
What is the push communication pattern?
Machine A send data to Machine B
What is the pull communication pattern?
Machine B requests data from Machine A
What is the broadcast communication pattern?
Machine A sends data to many machines.
What is the reduce communication pattern?
Compute some reduction of data from multiple machines and materialise the result on one machine.
What is the all reduce communication pattern?
Compute come reduction of data on multiple machines and materialise the results on all those machines.
What is the wait communication pattern?
One machine pauses its computation and wait on a signal from another machine.
What is the barrier communication pattern?
Many machines wait until all those machines reach a point in their execution, then continue from there
What is the key principle in distributed computing?
Overlapping computation and communication.
How does Stochastic Gradient Decent work with All reduce?
- The SGD equation is split into M sub equations
- Each of M machines is assigned a summation
- After all gradients are computed, the outer sum can be performed using all-reduce
4.After the all-reduce the whole sum is present on all machines and can be used to update model parameters
What is the advantage of SGD with All-reduce?
The algorithm is easy to reason about as it’s equivalent to minibatch SGD. The same hyperparameters can be used. The algorithm is easy to implement.
What is the disadvantage of SGD with All-reduce?
While the communication for the all-reduce is happening, the workers are idle. We are not overlapping computation and communication.
How is parallel k-means implemented?
Parallel k-means is split into 3 phases:
1. Map
2. Combine
3. Reduce
How is the map step done in parallel k-means?
- Compute distances between the points and k centroids
- Assign points to the clusters
- send the intermediary data to combiner
How is the combine step done in parallel k-means?
- Compute centroid of each cluster
- Send local sums of the value across dimensions and the number of samples
How is the reduce step done in parallel k-means?
- Test convergence
- Update the cluster centroids
- Return to Map step until convergence reache