Skip to content

Conversation

@xhzhao
Copy link
Contributor

@xhzhao xhzhao commented Mar 21, 2018

  • overview
    this PR target is to add mpi support for torch.nn.parallel.DistributedDataParallel().
    AFAIK, PyTorch DDP only support nccl and gloo backend, and i think it would be great to support mpi backend when the user get CPU only, especially for the researcher with supper computer access.
    reference issue

  • code change:
    i only add some lines in the init() and forward() function, without any change for the cuda code (except for the indent).

  • validation:
    this code passed my test case for the mnist example: https://github.com/xhzhao/examples/tree/master/mnist
    i'm also looking to add one more case in pytorch/test/test_distributed.py, but i could not find a method to launch mpi task from python. hope this problem will be fixed in the future.

@ezyang
Copy link
Contributor

ezyang commented Mar 21, 2018

CC @teng-li

@pytorchbot test this please

NB: I don't think CI is covering mpi atm; that should be fixed

@apaszke
Copy link
Contributor

apaszke commented Mar 21, 2018

Thanks for the PR, but I think we should try to look for alternative ways to add support for CPU training. Right now this adds a third possible code path, for a third backend, and that's just going to be unmaintainable in the long run. It's not using most of hte code in this file, which suggests that it might be better to just make it a class in a separate file.

@xhzhao
Copy link
Contributor Author

xhzhao commented Mar 23, 2018

@apaszke Thanks for the feedback, i'm looking to make it a class in a separate file, but i'm wondering what the class name should be. Can i use a new class name like torch.nn.parallel.MPIDistributedDataParallel()?

The key-point is that the user have to call difference API for distributed CPU and GPU training.
Does it make sense?

@xhzhao
Copy link
Contributor Author

xhzhao commented Mar 30, 2018

any proposal for this new class name to support distributed CPU training?

@soumith
Copy link
Contributor

soumith commented Mar 30, 2018

@xhzhao torch.nn.parallel.DistributedDataParallelMPI() seems better

@apaszke
Copy link
Contributor

apaszke commented Mar 30, 2018

Sorry for a late reply. Actually it shouldn't have MPI in the name. It will work with any backend that supports CPU, so it's more like DistributedDataParallelCPU.

@teng-li
Copy link
Contributor

teng-li commented Mar 30, 2018

@apaszke then the existing DDP should probably to be renamed to DistributedDataParallelGPU?

@apaszke
Copy link
Contributor

apaszke commented Mar 30, 2018

Idk, we can do that if you feel strongly about it. I don't mind leaving it as is.

@teng-li
Copy link
Contributor

teng-li commented Mar 30, 2018

@apaszke I will do it later when the DDP CPU gets merged

@teng-li
Copy link
Contributor

teng-li commented Mar 30, 2018

@xhzhao Beside MPI, TCP and Gloo backend also supports CPU collective ops, you could later mention the supported backends in the module comments.

@ezyang ezyang added the awaiting response (this tag is deprecated) This tag is deprecated while we figure out what to do with it label Mar 30, 2018
@xhzhao
Copy link
Contributor Author

xhzhao commented Apr 1, 2018

@teng-li will do

@ezyang
Copy link
Contributor

ezyang commented Apr 2, 2018

@pytorchbot test this please

Copy link
Contributor

@apaszke apaszke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs a test.

Also, is there any difference in allreduce_params from regular DistributedDataParallel and what you have here? It would be better to avoid code duplication, and refactor it into a shared function

def forward(self, *inputs, **kwargs):
if self.first_call:
self.weight_broadcast()
self.first_call = False

This comment was marked as off-topic.

self.first_call = True

def allreduce_params():
if (self.needs_reduction):

This comment was marked as off-topic.

buckets[tp] = []
buckets[tp].append(param)

for tp in buckets:

This comment was marked as off-topic.

if param.requires_grad and param.grad is not None:
tp = type(param.data)
if tp not in buckets:
buckets[tp] = []

This comment was marked as off-topic.

.. warning::
This module works only with the ``mpi`` backends.
The other backends like ``gloo``, ``tcp`` are not tested yet.

This comment was marked as off-topic.

@xhzhao
Copy link
Contributor Author

xhzhao commented Apr 8, 2018

@apaszke thanks for your feedback, i updated the code again:

  • This class should just have its own test, and we should remove this warning
    Done, a test case is setup for this new interface with the name test_DistributedDataParallelCPU(),
    which use the test_DistributedDataParallel() as a reference.
    The auto-test log shows that tcp and gloo backends passed this test case, but mpi backend did not touch this test case. I build pytorch from source on my PC and the mpi backend passed this test case.
  • Remove First_call and broadcast weight in init()
    Done
  • Code share for allreduce_params
    This function is totally different with the GPU implementtation, so i think we could not reuse this function.
  • please don't use parenthesis in if statements
    Done
  • nit: please make buckets a defaultdict(list)
    Done

@xhzhao
Copy link
Contributor Author

xhzhao commented Apr 11, 2018

any update for the code review?

@apaszke
Copy link
Contributor

apaszke commented Apr 11, 2018

I will get back to you this week, sorry for the delay

@Stonesjtu
Copy link
Contributor

Well, I'm not into the name DDP-CPU. MPI != CPU

As far as I know, there are at least 3 cuda-aware MPI implementations available. And I managed to compile pytorch with openmpi-1.10.7 and most MPI primitives seem to work on GPU memory.

Copy link
Contributor

@apaszke apaszke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two things and it should be good to merge.

# Shuffle the input so that DDP input is different
input_cpu = input_cpu[torch.randperm(global_bs)]

self._barrier()

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

if param.requires_grad:
param.register_hook(allreduce_hook)

def weight_broadcast(self):

This comment was marked as off-topic.

This comment was marked as off-topic.

@teng-li
Copy link
Contributor

teng-li commented Apr 16, 2018

@Stonesjtu If you would like to use GPU training, why not just use existing DistributedDataParallel (DDP) module, I believe CUDA-aware MPI should work with it. My understanding is that CPU DDP is currently missing, and instead, let's just get a CPU-version of DDP working with all supported CPU backend, that makes more sense IMHO.

@xhzhao I am wondering if you could also test your implementation with Gloo backend as well, that's gonna be super useful.

@xhzhao
Copy link
Contributor Author

xhzhao commented Apr 16, 2018

@Stonesjtu this PR title is a little mismatch with the real target as our discussion went on.
@teng-li i agree with you and this PR target is to create a CPU-version of DDP working with all supported CPU backend. We already add the test case for gloo backend, please see this log:

14:29:12 Running distributed tests for the gloo backend
14:29:12 test_DistributedDataParallel (main.TestDistBackend) ... ok
14:29:12 test_DistributedDataParallelCPU (main.TestDistBackend) ... ok

# Shuffle the input so that DDP input is different
input_cpu = input_cpu[torch.randperm(global_bs)]

self._barrier()

This comment was marked as off-topic.

raise unittest.SkipTest("worldsize is too small to run group tests")

elif BACKEND == 'mpi':
WORLD_SIZE = os.environ['WORLD_SIZE']

This comment was marked as off-topic.

This comment was marked as off-topic.

@apaszke apaszke merged commit f2c9975 into pytorch:master Apr 17, 2018
@apaszke
Copy link
Contributor

apaszke commented Apr 17, 2018

Thanks a lot @xhzhao!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

awaiting response (this tag is deprecated) This tag is deprecated while we figure out what to do with it open source

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants