Skip to content

Conversation

@0x4f5da2
Copy link
Contributor

@0x4f5da2 0x4f5da2 commented Aug 3, 2020

Fixes #{issue number}

It's not clear by illustrating 'gradients from each node are averaged' in the documentation of DistributedDataParallel. Many people, including me, have a totally wrong understanding on this part. I add a note into the documentation to make it more straight forward and more user friendly.

Here is some toy code to illustrate my point:

  • non-DistributedDataParallel version

    import torch
    import torch.nn as nn
    
    x = torch.tensor([-1, 2, -3, 4], dtype=torch.float).view(-1, 1)
    print("input:", x)
    
    model = nn.Linear(in_features=1, out_features=1, bias=False)
    model.weight.data.zero_()
    model.weight.data.add_(1.0)
    
    opti = torch.optim.SGD(model.parameters(), lr=0.001)
    opti.zero_grad()
    
    y = model(x)
    
    label = torch.zeros(4, 1, dtype=torch.float)
    loss = torch.sum((y - label)**2)
    
    loss.backward()
    opti.step()
    
    print("grad:", model.weight.grad)
    print("updated weight:\n", model.weight)
    
    # OUTPUT
    # $ python test.py
    # input: tensor([[-1.],
    #         [ 2.],
    #         [-3.],
    #         [ 4.]])
    # grad: tensor([[60.]])
    # updated weight:
    #  Parameter containing:
    # tensor([[0.9400]], requires_grad=True)
  • DistributedDataParallel version

    import os
    import torch
    import torch.nn as nn
    import torch.distributed as dist
    from torch.multiprocessing import Process
    
    def run(rank, size):
        x = torch.tensor([-(1 + 2 * rank), 2 + 2 * rank], dtype=torch.float).view(-1, 1)
        print("input:", x)
    
        model = nn.Linear(in_features=1, out_features=1, bias=False)
        model.weight.data.zero_()
        model.weight.data.add_(1.0)
        model = torch.nn.parallel.DistributedDataParallel(model)
    
        opti = torch.optim.SGD(model.parameters(), lr=0.001)
        opti.zero_grad()
    
        y = model(x)
    
        label = torch.zeros(2, 1, dtype=torch.float)
        loss = torch.sum((y.view(-1, 1) - label)**2)
    
        loss.backward()
        opti.step()
    
        if rank == 0:
            print("grad:", model.module.weight.grad)
            print("updated weight:\n", model.module.weight)
    
    
    def init_process(rank, size, fn, backend="gloo"):
        os.environ['MASTER_ADDR'] = '127.0.0.1'
        os.environ['MASTER_PORT'] = '29500'
        dist.init_process_group(backend, rank=rank, world_size=size)
        fn(rank, size)
    
    
    if __name__ == "__main__":
        size = 2
        process = []
        for rank in range(size):
            p = Process(target=init_process, args=(rank, size, run))
            p.start()
            process.append(p)
    
        for p in process:
            p.join()
    
    # OUTPUT
    # $ python test_d.py
    # input: tensor([[-3.],
    #         [ 4.]])input: tensor([[-1.],
    #         [ 2.]])
    
    # grad: tensor([[30.]])
    # updated weight:
    #  Parameter containing:
    # tensor([[0.9700]], requires_grad=True)

@0x4f5da2 0x4f5da2 requested a review from apaszke as a code owner August 3, 2020 18:23
@dr-ci
Copy link

dr-ci bot commented Aug 3, 2020

💊 CI failures summary and remediations

As of commit 7eae5c6 (more details on the Dr. CI page):


None of the CI failures appear to be your fault 💚



❄️ 2 failures tentatively classified as flaky

but reruns have not yet been triggered to confirm:

See CircleCI build pytorch_linux_xenial_py3_6_gcc5_4_build (1/2)

Step: "Build" (full log | diagnosis details | 🔁 rerun) ❄️

Error response from daemon: manifest for 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3.6-gcc5.4:8bdba785b1eac4d297d5f5930f979518012a56e0 not found
DOCKER_IMAGE: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3.6-gcc5.4:8bdba785b1eac4d297d5f5930f979518012a56e0 
Error response from daemon: manifest for 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3.6-gcc5.4:8bdba785b1eac4d297d5f5930f979518012a56e0 not found 

See CircleCI build pytorch_bazel_build (2/2)

Step: "Bazel Build" (full log | diagnosis details | 🔁 rerun) ❄️

Error response from daemon: manifest for 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3.6-gcc7:8bdba785b1eac4d297d5f5930f979518012a56e0 not found
DOCKER_IMAGE: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3.6-gcc7:8bdba785b1eac4d297d5f5930f979518012a56e0 
Error response from daemon: manifest for 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3.6-gcc7:8bdba785b1eac4d297d5f5930f979518012a56e0 not found 

🚧 1 ongoing upstream failure:

These were probably caused by upstream breakages that are not fixed yet:


🚧 7 fixed upstream failures:

These were probably caused by upstream breakages that were already fixed.

Please rebase on the viable/strict branch (expand for instructions)

Since your merge base is older than viable/strict, run these commands:

git fetch https://github.com/pytorch/pytorch viable/strict
git rebase FETCH_HEAD

Check out the recency history of this "viable master" tracking branch.


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 12 times.

Copy link
Contributor

@mrshenli mrshenli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this!

Copy link
Contributor

@mrshenli mrshenli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lint failure is real. Could you please fix this? Thanks

  {
    path: 'torch/nn/parallel/distributed.py',
    start_line: 150,
    end_line: 150,
    start_column: 1,
    end_column: 1,
    annotation_level: 'failure',
    message: '[W293] blank line contains whitespace'
  }

@0x4f5da2 0x4f5da2 requested a review from mrshenli August 4, 2020 03:52
Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mrshenli has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@mrshenli merged this pull request in b56db30.

facebook-github-bot pushed a commit that referenced this pull request Nov 10, 2020
Summary:
Sorry for my previous inaccurate [PR](#42471 (comment) ).

Here are some toy code to illustrate my point:

* non-DistributedDataParallel version

```python
import torch

if __name__ == "__main__":
    torch.manual_seed(0)
    inp = torch.randn(1,16)
    inp = torch.cat([inp, inp], dim=0)
    model = torch.nn.Linear(16, 2)
    loss_func = torch.nn.CrossEntropyLoss()
    opti = torch.optim.SGD(model.parameters(), lr=0.001)
    opti.zero_grad()
    loss = loss_func(model(inp), torch.tensor([0, 0]))
    loss.backward()
    opti.step()

    print("grad:", model.weight.grad)
    print("updated weight:\n", model.weight)
```

* DistributedDataParallel version

```python
import os
import torch
import torch.nn as nn
import torch.distributed as dist
from torch.multiprocessing import Process

def run(rank, size):
    torch.manual_seed(0)
    x = torch.randn(1,16)

    model = torch.nn.Linear(16, 2)
    model = torch.nn.parallel.DistributedDataParallel(model)
    loss_func = torch.nn.CrossEntropyLoss()
    opti = torch.optim.SGD(model.parameters(), lr=0.001)
    opti.zero_grad()

    y = model(x)

    label = torch.tensor([0])
    loss = loss_func(y, label)

    loss.backward()
    opti.step()

    if rank == 0:
        print("grad:", model.module.weight.grad)
        print("updated weight:\n", model.module.weight)

def init_process(rank, size, fn, backend="gloo"):
    os.environ['MASTER_ADDR'] = '127.0.0.1'
    os.environ['MASTER_PORT'] = '29500'
    dist.init_process_group(backend, rank=rank, world_size=size)
    fn(rank, size)

if __name__ == "__main__":
    size = 2
    process = []
    for rank in range(size):
        p = Process(target=init_process, args=(rank, size, run))
        p.start()
        process.append(p)

    for p in process:
        p.join()
```

Both of these two pieces of code have the same output.

Pull Request resolved: #47156

Reviewed By: mruberry

Differential Revision: D24675199

Pulled By: mrshenli

fbshipit-source-id: 1238a63350a32a824b4b8c0018dc80454ea502bb
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants