-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Improve the documentation of DistributedDataParallel #42471
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
💊 CI failures summary and remediationsAs of commit 7eae5c6 (more details on the Dr. CI page): ✅ None of the CI failures appear to be your fault 💚
❄️ 2 failures tentatively classified as flakybut reruns have not yet been triggered to confirm:
|
mrshenli
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding this!
mrshenli
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lint failure is real. Could you please fix this? Thanks
{
path: 'torch/nn/parallel/distributed.py',
start_line: 150,
end_line: 150,
start_column: 1,
end_column: 1,
annotation_level: 'failure',
message: '[W293] blank line contains whitespace'
}
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mrshenli has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
Summary: Sorry for my previous inaccurate [PR](#42471 (comment) ). Here are some toy code to illustrate my point: * non-DistributedDataParallel version ```python import torch if __name__ == "__main__": torch.manual_seed(0) inp = torch.randn(1,16) inp = torch.cat([inp, inp], dim=0) model = torch.nn.Linear(16, 2) loss_func = torch.nn.CrossEntropyLoss() opti = torch.optim.SGD(model.parameters(), lr=0.001) opti.zero_grad() loss = loss_func(model(inp), torch.tensor([0, 0])) loss.backward() opti.step() print("grad:", model.weight.grad) print("updated weight:\n", model.weight) ``` * DistributedDataParallel version ```python import os import torch import torch.nn as nn import torch.distributed as dist from torch.multiprocessing import Process def run(rank, size): torch.manual_seed(0) x = torch.randn(1,16) model = torch.nn.Linear(16, 2) model = torch.nn.parallel.DistributedDataParallel(model) loss_func = torch.nn.CrossEntropyLoss() opti = torch.optim.SGD(model.parameters(), lr=0.001) opti.zero_grad() y = model(x) label = torch.tensor([0]) loss = loss_func(y, label) loss.backward() opti.step() if rank == 0: print("grad:", model.module.weight.grad) print("updated weight:\n", model.module.weight) def init_process(rank, size, fn, backend="gloo"): os.environ['MASTER_ADDR'] = '127.0.0.1' os.environ['MASTER_PORT'] = '29500' dist.init_process_group(backend, rank=rank, world_size=size) fn(rank, size) if __name__ == "__main__": size = 2 process = [] for rank in range(size): p = Process(target=init_process, args=(rank, size, run)) p.start() process.append(p) for p in process: p.join() ``` Both of these two pieces of code have the same output. Pull Request resolved: #47156 Reviewed By: mruberry Differential Revision: D24675199 Pulled By: mrshenli fbshipit-source-id: 1238a63350a32a824b4b8c0018dc80454ea502bb
Fixes #{issue number}
It's not clear by illustrating 'gradients from each node are averaged' in the documentation of DistributedDataParallel. Many people, including me, have a totally wrong understanding on this part. I add a note into the documentation to make it more straight forward and more user friendly.
Here is some toy code to illustrate my point:
non-DistributedDataParallel version
DistributedDataParallel version