Parallelism with PyTorch Flashcards

Source: https://pytorch.org/tutorials/intermediate/ddp_tutorial.html (17 cards)

1
Q

T/F: DistributedDataParallel works only if the model is on a single device.

A

False.

The model can be on multiple devices (sharding) EXCEPT if it’s on a GPU.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are DDP processes?

A

They are workers. Each receives a replica of the model and batches of data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How are gradients synced with DDP?

A

Each model parameter has a hook.
The hook is triggered during the bwd pass and updated with the mean gradients from the different processes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are the differences between DataParallel (DP) and DistributedDataParallel (DDP)?

A
  1. DP is single-process, multi-threaded. It can be on multiple devices. DDP is multi-process, can be multi-machine.
  2. DDP support model-parallel if a model does not fit on a single GPU
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Which is generally faster between DP and DDP?

A

DDP: no GIL (multiprocessing), no streaming and gathering data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How does DDP work with model-parallel?

A

Every process uses model parallel and the processes COLLECTIVELY behave like a single process doing DP.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Does DDP work only with multiple GPUs or also with multiple CPUs?

A

Only with multi-GPUs.
For multi-CPUs, use torch.multiprocessing.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What does DistributedSampler do?

A

Distributes data between GPUs.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are process groups?

A

Groups of processes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

T/F: 1 process always correspond to 1 GPU.

A

True. This is to avoid data sync between different GPUs.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

T/F: DistributedDataParallel wraps around a nn.Module.

A

True. It takes it as an argument. The other important argument is a list of IDs of GPUs where the model sits.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What does torchrun do?

A

It handles failures/interruptions.

Specifically, it supports a checkpointing logic, such to restart training at the last snapshot before interruption.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What does is parallelism for PyTorch? A: Single program/multiple data
B: Multiple programs/single data
C: Multiple programs/multiple data

A

A: Single program/multiple data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Why can DDP be faster than DP even on a single GPU?

A
  1. DDP runs the optimize step on all process separately. It’s redundant, but it avoids parameter broadcasting.
  2. Each DDP process has its own separate Pyhon interpreter, so there is no delay from the GIL.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are the two different communication methods between processes?

A
  • Point 2 point
  • Collective
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is the rank of a process?

A

The process ID

17
Q

What is the world size?