Skip to content

Conversation

@0x4f5da2
Copy link
Contributor

Sorry for my previous inaccurate PR.

Here are some toy code to illustrate my point:

  • non-DistributedDataParallel version
import torch


if __name__ == "__main__":
    torch.manual_seed(0)
    inp = torch.randn(1,16)
    inp = torch.cat([inp, inp], dim=0)
    model = torch.nn.Linear(16, 2)
    loss_func = torch.nn.CrossEntropyLoss()
    opti = torch.optim.SGD(model.parameters(), lr=0.001)
    opti.zero_grad()
    loss = loss_func(model(inp), torch.tensor([0, 0]))
    loss.backward()
    opti.step()

    print("grad:", model.weight.grad)
    print("updated weight:\n", model.weight)
  • DistributedDataParallel version
import os
import torch
import torch.nn as nn
import torch.distributed as dist
from torch.multiprocessing import Process

def run(rank, size):
    torch.manual_seed(0)
    x = torch.randn(1,16)

    model = torch.nn.Linear(16, 2)
    model = torch.nn.parallel.DistributedDataParallel(model)
    loss_func = torch.nn.CrossEntropyLoss()
    opti = torch.optim.SGD(model.parameters(), lr=0.001)
    opti.zero_grad()

    y = model(x)

    label = torch.tensor([0])
    loss = loss_func(y, label)

    loss.backward()
    opti.step()

    if rank == 0:
        print("grad:", model.module.weight.grad)
        print("updated weight:\n", model.module.weight)


def init_process(rank, size, fn, backend="gloo"):
    os.environ['MASTER_ADDR'] = '127.0.0.1'
    os.environ['MASTER_PORT'] = '29500'
    dist.init_process_group(backend, rank=rank, world_size=size)
    fn(rank, size)


if __name__ == "__main__":
    size = 2
    process = []
    for rank in range(size):
        p = Process(target=init_process, args=(rank, size, run))
        p.start()
        process.append(p)

    for p in process:
        p.join()

Both of these two pieces of code have the same output.

@0x4f5da2 0x4f5da2 requested a review from apaszke as a code owner October 31, 2020 16:22
@facebook-github-bot
Copy link
Contributor

Hi @0x4f5da2!

Thank you for your pull request and welcome to our community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file.

In order for us to review and merge your code, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

If you have received this in error or have any questions, please contact us at cla@fb.com. Thanks!

@facebook-github-bot facebook-github-bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Oct 31, 2020
@codecov
Copy link

codecov bot commented Oct 31, 2020

Codecov Report

Merging #47156 (c54a194) into master (9e0102c) will decrease coverage by 10.29%.
The diff coverage is n/a.

@@             Coverage Diff             @@
##           master   #47156       +/-   ##
===========================================
- Coverage   71.10%   60.81%   -10.30%     
===========================================
  Files        1805     2748      +943     
  Lines      190006   253997    +63991     
===========================================
+ Hits       135103   154456    +19353     
- Misses      54903    99541    +44638     

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mrshenli has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@dr-ci
Copy link

dr-ci bot commented Nov 9, 2020

💊 CI failures summary and remediations

As of commit 7670876 (more details on the Dr. CI page):


  • 1/7 failures possibly* introduced in this PR
    • 1/1 non-CircleCI failure(s)
  • 6/7 broken upstream at merge base 9e0102c since Nov 08

🚧 6 ongoing upstream failures:

These were probably caused by upstream breakages that are not fixed yet:


Extra GitHub checks: 1 failed


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 7 times.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mrshenli has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@mrshenli merged this pull request in 3dd2663.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla signed Merged oncall: distributed Add this issue/PR to distributed oncall triage queue open source

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants