add mpi support for DDP #5919

xhzhao · 2018-03-21T08:53:59Z

overview
this PR target is to add mpi support for torch.nn.parallel.DistributedDataParallel().
AFAIK, PyTorch DDP only support nccl and gloo backend, and i think it would be great to support mpi backend when the user get CPU only, especially for the researcher with supper computer access.
reference issue
code change:
i only add some lines in the init() and forward() function, without any change for the cuda code (except for the indent).
validation:
this code passed my test case for the mnist example: https://github.com/xhzhao/examples/tree/master/mnist
i'm also looking to add one more case in pytorch/test/test_distributed.py, but i could not find a method to launch mpi task from python. hope this problem will be fixed in the future.

ezyang · 2018-03-21T14:49:47Z

@pytorchbot test this please

NB: I don't think CI is covering mpi atm; that should be fixed

apaszke · 2018-03-21T15:43:34Z

Thanks for the PR, but I think we should try to look for alternative ways to add support for CPU training. Right now this adds a third possible code path, for a third backend, and that's just going to be unmaintainable in the long run. It's not using most of hte code in this file, which suggests that it might be better to just make it a class in a separate file.

xhzhao · 2018-03-23T02:28:14Z

@apaszke Thanks for the feedback, i'm looking to make it a class in a separate file, but i'm wondering what the class name should be. Can i use a new class name like torch.nn.parallel.MPIDistributedDataParallel()?

The key-point is that the user have to call difference API for distributed CPU and GPU training.
Does it make sense?

xhzhao · 2018-03-30T01:17:54Z

any proposal for this new class name to support distributed CPU training?

soumith · 2018-03-30T17:22:26Z

@xhzhao torch.nn.parallel.DistributedDataParallelMPI() seems better

apaszke · 2018-03-30T17:36:08Z

Sorry for a late reply. Actually it shouldn't have MPI in the name. It will work with any backend that supports CPU, so it's more like DistributedDataParallelCPU.

teng-li · 2018-03-30T17:39:48Z

@apaszke then the existing DDP should probably to be renamed to DistributedDataParallelGPU?

apaszke · 2018-03-30T17:42:33Z

Idk, we can do that if you feel strongly about it. I don't mind leaving it as is.

teng-li · 2018-03-30T18:31:01Z

@apaszke I will do it later when the DDP CPU gets merged

teng-li · 2018-03-30T18:34:41Z

@xhzhao Beside MPI, TCP and Gloo backend also supports CPU collective ops, you could later mention the supported backends in the module comments.

xhzhao · 2018-04-01T13:43:01Z

@teng-li will do

ezyang · 2018-04-02T17:52:45Z

@pytorchbot test this please

…elCPU

apaszke

This needs a test.

Also, is there any difference in allreduce_params from regular DistributedDataParallel and what you have here? It would be better to avoid code duplication, and refactor it into a shared function

torch/nn/parallel/distributed_cpu.py

+    def forward(self, *inputs, **kwargs):
+        if self.first_call:
+            self.weight_broadcast()
+            self.first_call = False


torch/nn/parallel/distributed_cpu.py

+        self.first_call = True
+
+        def allreduce_params():
+            if (self.needs_reduction):


torch/nn/parallel/distributed_cpu.py

+                            buckets[tp] = []
+                        buckets[tp].append(param)
+
+                for tp in buckets:


torch/nn/parallel/distributed_cpu.py

+                    if param.requires_grad and param.grad is not None:
+                        tp = type(param.data)
+                        if tp not in buckets:
+                            buckets[tp] = []


torch/nn/parallel/distributed_cpu.py

+
+    .. warning::
+        This module works only with the ``mpi`` backends.
+        The other backends like ``gloo``, ``tcp`` are not tested yet.


xhzhao · 2018-04-08T14:54:20Z

@apaszke thanks for your feedback, i updated the code again:

This class should just have its own test, and we should remove this warning
Done, a test case is setup for this new interface with the name test_DistributedDataParallelCPU(),
which use the test_DistributedDataParallel() as a reference.
The auto-test log shows that tcp and gloo backends passed this test case, but mpi backend did not touch this test case. I build pytorch from source on my PC and the mpi backend passed this test case.
Remove First_call and broadcast weight in init()
Done
Code share for allreduce_params
This function is totally different with the GPU implementtation, so i think we could not reuse this function.
please don't use parenthesis in if statements
Done
nit: please make buckets a defaultdict(list)
Done

xhzhao · 2018-04-11T05:30:13Z

any update for the code review?

apaszke · 2018-04-11T08:26:20Z

I will get back to you this week, sorry for the delay

Stonesjtu · 2018-04-14T08:07:59Z

Well, I'm not into the name DDP-CPU. MPI != CPU

As far as I know, there are at least 3 cuda-aware MPI implementations available. And I managed to compile pytorch with openmpi-1.10.7 and most MPI primitives seem to work on GPU memory.

apaszke

Two things and it should be good to merge.

test/test_distributed.py

+            # Shuffle the input so that DDP input is different
+            input_cpu = input_cpu[torch.randperm(global_bs)]
+
+        self._barrier()


torch/nn/parallel/distributed_cpu.py

+            if param.requires_grad:
+                param.register_hook(allreduce_hook)
+
+    def weight_broadcast(self):


teng-li · 2018-04-16T03:36:14Z

@Stonesjtu If you would like to use GPU training, why not just use existing DistributedDataParallel (DDP) module, I believe CUDA-aware MPI should work with it. My understanding is that CPU DDP is currently missing, and instead, let's just get a CPU-version of DDP working with all supported CPU backend, that makes more sense IMHO.

@xhzhao I am wondering if you could also test your implementation with Gloo backend as well, that's gonna be super useful.

xhzhao · 2018-04-16T06:01:58Z

@Stonesjtu this PR title is a little mismatch with the real target as our discussion went on.
@teng-li i agree with you and this PR target is to create a CPU-version of DDP working with all supported CPU backend. We already add the test case for gloo backend, please see this log:

14:29:12 Running distributed tests for the gloo backend
14:29:12 test_DistributedDataParallel (main.TestDistBackend) ... ok
14:29:12 test_DistributedDataParallelCPU (main.TestDistBackend) ... ok

test/test_distributed.py

+            # Shuffle the input so that DDP input is different
+            input_cpu = input_cpu[torch.randperm(global_bs)]
+
+        self._barrier()


test/test_distributed.py

                    raise unittest.SkipTest("worldsize is too small to run group tests")

 elif BACKEND == 'mpi':
+    WORLD_SIZE = os.environ['WORLD_SIZE']


apaszke · 2018-04-17T13:36:51Z

Thanks a lot @xhzhao!

xhzhao added 2 commits March 22, 2018 04:09

add mpi support for DDP

3661f08

fix code check fails

6fce34a

ezyang added the awaiting response (this tag is deprecated) This tag is deprecated while we figure out what to do with it label Mar 30, 2018

xhzhao added 3 commits April 3, 2018 02:22

revert the modification for torch.nn.parallel.distributed

bde5e10

add mpi support for distributed CPU training in DistributedDataParall…

fe8897c

…elCPU

fix flake8 test error

ef3645c

apaszke reviewed Apr 5, 2018

View reviewed changes

xhzhao requested review from colesbury, ezyang, gchanan, soumith and zdevito as code owners April 8, 2018 13:26

xhzhao added 3 commits April 9, 2018 09:00

update for the PR feedback, add testcase for DDP-CPU

0201e9a

fix flake8

f6d9cb4

fix build error about the _execution_engine

1a682da

refine the document for DDP-CPU

9235815

apaszke reviewed Apr 15, 2018

View reviewed changes

ezyang approved these changes Apr 16, 2018

View reviewed changes

apaszke suggested changes Apr 16, 2018

View reviewed changes

test/test_distributed.py

# Shuffle the input so that DDP input is different

input_cpu = input_cpu[torch.randperm(global_bs)]

self._barrier()

This comment was marked as off-topic.

Sign in to view

test code reuse for DDP and DDP-CPU

1fee7ab

apaszke approved these changes Apr 17, 2018

View reviewed changes

test/test_distributed.py

raise unittest.SkipTest("worldsize is too small to run group tests")

elif BACKEND == 'mpi':

WORLD_SIZE = os.environ['WORLD_SIZE']

This comment was marked as off-topic.

Sign in to view

This comment was marked as off-topic.

Sign in to view

apaszke merged commit f2c9975 into pytorch:master Apr 17, 2018

xhzhao added 2 commits April 17, 2018 22:34

test code further reuse for DDP and DDP-CPU

aaa47c6

fix conflicts in DDP test case

d534275

xhzhao mentioned this pull request Apr 20, 2018

DistributedDataParallel doesn't converge well when using MPI #4406

Open

Jorghi12 pushed a commit to wsttiger/pytorch that referenced this pull request May 10, 2018

Add DistributedDataParallelCPU (pytorch#5919)

5ee2d44

ezyang added the open source label Jun 24, 2019

add mpi support for DDP #5919

add mpi support for DDP #5919

Uh oh!

Conversation

xhzhao commented Mar 21, 2018

Uh oh!

ezyang commented Mar 21, 2018

Uh oh!

apaszke commented Mar 21, 2018

Uh oh!

xhzhao commented Mar 23, 2018

Uh oh!

xhzhao commented Mar 30, 2018

Uh oh!

soumith commented Mar 30, 2018

Uh oh!

apaszke commented Mar 30, 2018

Uh oh!

teng-li commented Mar 30, 2018

Uh oh!

apaszke commented Mar 30, 2018

Uh oh!

teng-li commented Mar 30, 2018

Uh oh!

teng-li commented Mar 30, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xhzhao commented Apr 1, 2018

Uh oh!

ezyang commented Apr 2, 2018

Uh oh!

apaszke left a comment

Choose a reason for hiding this comment

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

xhzhao commented Apr 8, 2018

Uh oh!

xhzhao commented Apr 11, 2018

Uh oh!

apaszke commented Apr 11, 2018

Uh oh!

Stonesjtu commented Apr 14, 2018

Uh oh!

apaszke left a comment

Choose a reason for hiding this comment

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

teng-li commented Apr 16, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xhzhao commented Apr 16, 2018

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

apaszke commented Apr 17, 2018

Uh oh!

Reviewers

Assignees

teng-li commented Mar 30, 2018 •

edited

Loading

teng-li commented Apr 16, 2018 •

edited

Loading