Parallelism with PyTorch Flashcards

Question 1

Q

T/F: DistributedDataParallel works only if the model is on a single device.

Answer

A

False.

The model can be on multiple devices (sharding) EXCEPT if it’s on a GPU.

Question 2

Q

What are DDP processes?

Answer

A

They are workers. Each receives a replica of the model and batches of data.

Question 3

Q

How are gradients synced with DDP?

Answer

A

Each model parameter has a hook.
The hook is triggered during the bwd pass and updated with the mean gradients from the different processes.

Question 4

Q

What are the differences between DataParallel (DP) and DistributedDataParallel (DDP)?

Answer

A

DP is single-process, multi-threaded. It can be on multiple devices. DDP is multi-process, can be multi-machine.
DDP support model-parallel if a model does not fit on a single GPU

Question 5

Q

Which is generally faster between DP and DDP?

Answer

A

DDP: no GIL (multiprocessing), no streaming and gathering data.

Question 6

Q

How does DDP work with model-parallel?

Answer

A

Every process uses model parallel and the processes COLLECTIVELY behave like a single process doing DP.

Question 7

Q

Does DDP work only with multiple GPUs or also with multiple CPUs?

Answer

A

Only with multi-GPUs.
For multi-CPUs, use torch.multiprocessing.

Question 8

Q

What does DistributedSampler do?

Answer

A

Distributes data between GPUs.

Question 9

Q

What are process groups?

Answer

A

Groups of processes.

Question 10

Q

T/F: 1 process always correspond to 1 GPU.

Answer

A

True. This is to avoid data sync between different GPUs.

Question 11

Q

T/F: DistributedDataParallel wraps around a nn.Module.

Answer

A

True. It takes it as an argument. The other important argument is a list of IDs of GPUs where the model sits.

Question 12

Q

What does torchrun do?

Answer

A

It handles failures/interruptions.

Specifically, it supports a checkpointing logic, such to restart training at the last snapshot before interruption.

Question 13

Q

What does is parallelism for PyTorch? A: Single program/multiple data
B: Multiple programs/single data
C: Multiple programs/multiple data

Answer

A

A: Single program/multiple data

Question 14

Q

Why can DDP be faster than DP even on a single GPU?

Answer

A

DDP runs the optimize step on all process separately. It’s redundant, but it avoids parameter broadcasting.
Each DDP process has its own separate Pyhon interpreter, so there is no delay from the GIL.

Question 15

Q

What are the two different communication methods between processes?

Answer

A

Point 2 point
Collective

Question 16

Q

What is the rank of a process?

Answer

Study These Flashcards

A

The process ID

Question 17

Q

What is the world size?

Answer

Study These Flashcards

A

Parallelism with PyTorch Flashcards

Source: https://pytorch.org/tutorials/intermediate/ddp_tutorial.html (17 cards)