Improve the documentation of DistributedDataParallel #42471

0x4f5da2 · 2020-08-03T18:23:15Z

Fixes #{issue number}

It's not clear by illustrating 'gradients from each node are averaged' in the documentation of DistributedDataParallel. Many people, including me, have a totally wrong understanding on this part. I add a note into the documentation to make it more straight forward and more user friendly.

Here is some toy code to illustrate my point:

non-DistributedDataParallel version

import torch
import torch.nn as nn

x = torch.tensor([-1, 2, -3, 4], dtype=torch.float).view(-1, 1)
print("input:", x)

model = nn.Linear(in_features=1, out_features=1, bias=False)
model.weight.data.zero_()
model.weight.data.add_(1.0)

opti = torch.optim.SGD(model.parameters(), lr=0.001)
opti.zero_grad()

y = model(x)

label = torch.zeros(4, 1, dtype=torch.float)
loss = torch.sum((y - label)**2)

loss.backward()
opti.step()

print("grad:", model.weight.grad)
print("updated weight:\n", model.weight)

# OUTPUT
# $ python test.py
# input: tensor([[-1.],
#         [ 2.],
#         [-3.],
#         [ 4.]])
# grad: tensor([[60.]])
# updated weight:
#  Parameter containing:
# tensor([[0.9400]], requires_grad=True)

DistributedDataParallel version

import os
import torch
import torch.nn as nn
import torch.distributed as dist
from torch.multiprocessing import Process

def run(rank, size):
    x = torch.tensor([-(1 + 2 * rank), 2 + 2 * rank], dtype=torch.float).view(-1, 1)
    print("input:", x)

    model = nn.Linear(in_features=1, out_features=1, bias=False)
    model.weight.data.zero_()
    model.weight.data.add_(1.0)
    model = torch.nn.parallel.DistributedDataParallel(model)

    opti = torch.optim.SGD(model.parameters(), lr=0.001)
    opti.zero_grad()

    y = model(x)

    label = torch.zeros(2, 1, dtype=torch.float)
    loss = torch.sum((y.view(-1, 1) - label)**2)

    loss.backward()
    opti.step()

    if rank == 0:
        print("grad:", model.module.weight.grad)
        print("updated weight:\n", model.module.weight)


def init_process(rank, size, fn, backend="gloo"):
    os.environ['MASTER_ADDR'] = '127.0.0.1'
    os.environ['MASTER_PORT'] = '29500'
    dist.init_process_group(backend, rank=rank, world_size=size)
    fn(rank, size)


if __name__ == "__main__":
    size = 2
    process = []
    for rank in range(size):
        p = Process(target=init_process, args=(rank, size, run))
        p.start()
        process.append(p)

    for p in process:
        p.join()

# OUTPUT
# $ python test_d.py
# input: tensor([[-3.],
#         [ 4.]])input: tensor([[-1.],
#         [ 2.]])

# grad: tensor([[30.]])
# updated weight:
#  Parameter containing:
# tensor([[0.9700]], requires_grad=True)

dr-ci · 2020-08-03T18:25:12Z

💊 CI failures summary and remediations

As of commit 7eae5c6 (more details on the Dr. CI page):

✅ None of the CI failures appear to be your fault 💚

2/10 tentatively recognized as flaky ❄️
- Click here to rerun these jobs
8/10 broken upstream at merge base ed44269 on Aug 03 from 10:10am to 12:08pm PDT (4 commits; ed44269 - 1b9cd74)

❄️ 2 failures tentatively classified as flaky

but reruns have not yet been triggered to confirm:

pytorch_linux_xenial_py3_6_gcc5_4_build (1/2)

Step: "Build" (full log | diagnosis details | 🔁 rerun) ❄️

Error response from daemon: manifest for 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3.6-gcc5.4:8bdba785b1eac4d297d5f5930f979518012a56e0 not found

DOCKER_IMAGE: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3.6-gcc5.4:8bdba785b1eac4d297d5f5930f979518012a56e0 
Error response from daemon: manifest for 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3.6-gcc5.4:8bdba785b1eac4d297d5f5930f979518012a56e0 not found

pytorch_bazel_build (2/2)

Step: "Bazel Build" (full log | diagnosis details | 🔁 rerun) ❄️

Error response from daemon: manifest for 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3.6-gcc7:8bdba785b1eac4d297d5f5930f979518012a56e0 not found

DOCKER_IMAGE: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3.6-gcc7:8bdba785b1eac4d297d5f5930f979518012a56e0 
Error response from daemon: manifest for 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3.6-gcc7:8bdba785b1eac4d297d5f5930f979518012a56e0 not found

🚧 1 ongoing upstream failure:

These were probably caused by upstream breakages that are not fixed yet:

pytorch_linux_bionic_py3_7_conda_build since Aug 03
- 🔁 rerun

🚧 7 fixed upstream failures:

These were probably caused by upstream breakages that were already fixed.

Please rebase on the viable/strict branch (expand for instructions)

Since your merge base is older than viable/strict, run these commands:

git fetch https://github.com/pytorch/pytorch viable/strict
git rebase FETCH_HEAD

Check out the recency history of this "viable master" tracking branch.

pytorch_linux_xenial_py3_clang5_mobile_build on Aug 03 from 10:10am to 12:08pm PDT (4 commits; ed44269 - 1b9cd74)
- 🔁 rerun
pytorch_libtorch_linux_xenial_cuda10_2_cudnn7_py3_gcc7_build on Aug 03 from 10:10am to 12:08pm PDT (4 commits; ed44269 - 1b9cd74)
- 🔁 rerun
pytorch_linux_xenial_py3_clang5_mobile_custom_build_static on Aug 03 from 10:10am to 12:08pm PDT (4 commits; ed44269 - 1b9cd74)
- 🔁 rerun
pytorch_xla_linux_bionic_py3_6_clang9_build on Aug 03 from 10:10am to 12:08pm PDT (4 commits; ed44269 - 1b9cd74)
- 🔁 rerun
pytorch_linux_bionic_py3_6_clang9_build on Aug 03 from 10:10am to 12:08pm PDT (4 commits; ed44269 - 1b9cd74)
- 🔁 rerun
pytorch_linux_xenial_cuda10_2_cudnn7_py3_gcc7_build on Aug 03 from 10:10am to 12:08pm PDT (4 commits; ed44269 - 1b9cd74)
- 🔁 rerun
pytorch_linux_xenial_py3_clang5_asan_build on Aug 03 from 10:10am to 12:08pm PDT (4 commits; ed44269 - 1b9cd74)
- 🔁 rerun

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 12 times.

mrshenli

Thanks for adding this!

mrshenli

Lint failure is real. Could you please fix this? Thanks

  {
    path: 'torch/nn/parallel/distributed.py',
    start_line: 150,
    end_line: 150,
    start_column: 1,
    end_column: 1,
    annotation_level: 'failure',
    message: '[W293] blank line contains whitespace'
  }

facebook-github-bot

@mrshenli has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2020-08-04T16:24:56Z

@mrshenli merged this pull request in b56db30.

Summary: Sorry for my previous inaccurate [PR](#42471 (comment) ). Here are some toy code to illustrate my point: * non-DistributedDataParallel version ```python import torch if __name__ == "__main__": torch.manual_seed(0) inp = torch.randn(1,16) inp = torch.cat([inp, inp], dim=0) model = torch.nn.Linear(16, 2) loss_func = torch.nn.CrossEntropyLoss() opti = torch.optim.SGD(model.parameters(), lr=0.001) opti.zero_grad() loss = loss_func(model(inp), torch.tensor([0, 0])) loss.backward() opti.step() print("grad:", model.weight.grad) print("updated weight:\n", model.weight) ``` * DistributedDataParallel version ```python import os import torch import torch.nn as nn import torch.distributed as dist from torch.multiprocessing import Process def run(rank, size): torch.manual_seed(0) x = torch.randn(1,16) model = torch.nn.Linear(16, 2) model = torch.nn.parallel.DistributedDataParallel(model) loss_func = torch.nn.CrossEntropyLoss() opti = torch.optim.SGD(model.parameters(), lr=0.001) opti.zero_grad() y = model(x) label = torch.tensor([0]) loss = loss_func(y, label) loss.backward() opti.step() if rank == 0: print("grad:", model.module.weight.grad) print("updated weight:\n", model.module.weight) def init_process(rank, size, fn, backend="gloo"): os.environ['MASTER_ADDR'] = '127.0.0.1' os.environ['MASTER_PORT'] = '29500' dist.init_process_group(backend, rank=rank, world_size=size) fn(rank, size) if __name__ == "__main__": size = 2 process = [] for rank in range(size): p = Process(target=init_process, args=(rank, size, run)) p.start() process.append(p) for p in process: p.join() ``` Both of these two pieces of code have the same output. Pull Request resolved: #47156 Reviewed By: mruberry Differential Revision: D24675199 Pulled By: mrshenli fbshipit-source-id: 1238a63350a32a824b4b8c0018dc80454ea502bb

Improve the document of DistributedDataParallel

765edc4

0x4f5da2 requested a review from apaszke as a code owner August 3, 2020 18:23

pytorchbot added the open source label Aug 3, 2020

mrshenli approved these changes Aug 4, 2020

View reviewed changes

mrshenli requested changes Aug 4, 2020

View reviewed changes

remove white space

7eae5c6

0x4f5da2 requested a review from mrshenli August 4, 2020 03:52

mrshenli approved these changes Aug 4, 2020

View reviewed changes

facebook-github-bot reviewed Aug 4, 2020

View reviewed changes

facebook-github-bot closed this in b56db30 Aug 4, 2020

facebook-github-bot added the merged label Aug 4, 2020

mruberry added the Merged label Oct 28, 2020

0x4f5da2 mentioned this pull request Oct 31, 2020

Fix inaccurate note in DistributedDataParallel #47156

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve the documentation of DistributedDataParallel #42471

Improve the documentation of DistributedDataParallel #42471

Uh oh!

0x4f5da2 commented Aug 3, 2020

Uh oh!

dr-ci bot commented Aug 3, 2020 •

edited

Loading

Uh oh!

mrshenli left a comment

Uh oh!

mrshenli left a comment

Uh oh!

facebook-github-bot left a comment

Uh oh!

facebook-github-bot commented Aug 4, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Improve the documentation of DistributedDataParallel #42471

Improve the documentation of DistributedDataParallel #42471

Uh oh!

Conversation

0x4f5da2 commented Aug 3, 2020

Uh oh!

dr-ci bot commented Aug 3, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💊 CI failures summary and remediations

❄️ 2 failures tentatively classified as flaky

pytorch_linux_xenial_py3_6_gcc5_4_build (1/2)

pytorch_bazel_build (2/2)

🚧 1 ongoing upstream failure:

🚧 7 fixed upstream failures:

Uh oh!

mrshenli left a comment

Choose a reason for hiding this comment

Uh oh!

mrshenli left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Aug 4, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

dr-ci bot commented Aug 3, 2020 •

edited

Loading