Parallelism with PyTorch Flashcards
Source: https://pytorch.org/tutorials/intermediate/ddp_tutorial.html (17 cards)
T/F: DistributedDataParallel works only if the model is on a single device.
False.
The model can be on multiple devices (sharding) EXCEPT if it’s on a GPU.
What are DDP processes?
They are workers. Each receives a replica of the model and batches of data.
How are gradients synced with DDP?
Each model parameter has a hook.
The hook is triggered during the bwd pass and updated with the mean gradients from the different processes.
What are the differences between DataParallel (DP) and DistributedDataParallel (DDP)?
- DP is single-process, multi-threaded. It can be on multiple devices. DDP is multi-process, can be multi-machine.
- DDP support model-parallel if a model does not fit on a single GPU
Which is generally faster between DP and DDP?
DDP: no GIL (multiprocessing), no streaming and gathering data.
How does DDP work with model-parallel?
Every process uses model parallel and the processes COLLECTIVELY behave like a single process doing DP.
Does DDP work only with multiple GPUs or also with multiple CPUs?
Only with multi-GPUs.
For multi-CPUs, use torch.multiprocessing.
What does DistributedSampler do?
Distributes data between GPUs.
What are process groups?
Groups of processes.
T/F: 1 process always correspond to 1 GPU.
True. This is to avoid data sync between different GPUs.
T/F: DistributedDataParallel wraps around a nn.Module.
True. It takes it as an argument. The other important argument is a list of IDs of GPUs where the model sits.
What does torchrun do?
It handles failures/interruptions.
Specifically, it supports a checkpointing logic, such to restart training at the last snapshot before interruption.
What does is parallelism for PyTorch? A: Single program/multiple data
B: Multiple programs/single data
C: Multiple programs/multiple data
A: Single program/multiple data
Why can DDP be faster than DP even on a single GPU?
- DDP runs the optimize step on all process separately. It’s redundant, but it avoids parameter broadcasting.
- Each DDP process has its own separate Pyhon interpreter, so there is no delay from the GIL.
What are the two different communication methods between processes?
- Point 2 point
- Collective
What is the rank of a process?
The process ID
What is the world size?