[reland] Make grad point to bucket buffer in DDP to save memory usage #44344

zhaojuanmao · 2020-09-08T23:28:43Z

Stack from ghstack:

[reland] Make grad point to bucket buffer in DDP to save memory usage #44344 [reland] Make grad point to bucket buffer in DDP to save memory usage

reland #41954

Add one argument in DDP API to enable/disable letting grads pointing to views. When it is disabled, behavior is the same as DDP right now; when it is enabled, Make both variable.grad() and grad in distautograd context point to bucket buffer in DDP to save memory usage.
In this case, grad will be view of bucket buffer tensors, in order to make it compatiable with optimizer.zero_grad(), we
made changes in #41283.

Also be noted that we can not make variable.grad() pointing to bucket buffer during construction time, because we want to
keep grad undefined for unused parameters.

Differential Revision: D23588186

reland #41954 Add one argument in DDP API to enable/disable letting grads pointing to views. When it is disabled, behavior is the same as DDP right now; when it is enabled, Make both variable.grad() and grad in distautograd context point to bucket buffer in DDP to save memory usage. In this case, grad will be view of bucket buffer tensors, in order to make it compatiable with optimizer.zero_grad(), we made changes in #41283. Also be noted that we can not make variable.grad() pointing to bucket buffer during construction time, because we want to keep grad undefined for unused parameters. Differential Revision: [D23588186](https://our.internmc.facebook.com/intern/diff/D23588186/) [ghstack-poisoned]

dr-ci · 2020-09-08T23:39:02Z

💊 CI failures summary and remediations

As of commit 3568201 (more details on the Dr. CI page):

💚 💚 Looks good so far! There are no failures yet. 💚 💚

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 64 times.

…emory usage" reland #41954 Add one argument in DDP API to enable/disable letting grads pointing to views. When it is disabled, behavior is the same as DDP right now; when it is enabled, Make both variable.grad() and grad in distautograd context point to bucket buffer in DDP to save memory usage. In this case, grad will be view of bucket buffer tensors, in order to make it compatiable with optimizer.zero_grad(), we made changes in #41283. Also be noted that we can not make variable.grad() pointing to bucket buffer during construction time, because we want to keep grad undefined for unused parameters. Differential Revision: [D23588186](https://our.internmc.facebook.com/intern/diff/D23588186/) [ghstack-poisoned]

Pull Request resolved: #44344 reland #41954 Add one argument in DDP API to enable/disable letting grads pointing to views. When it is disabled, behavior is the same as DDP right now; when it is enabled, Make both variable.grad() and grad in distautograd context point to bucket buffer in DDP to save memory usage. In this case, grad will be view of bucket buffer tensors, in order to make it compatiable with optimizer.zero_grad(), we made changes in #41283. Also be noted that we can not make variable.grad() pointing to bucket buffer during construction time, because we want to keep grad undefined for unused parameters. ghstack-source-id: 111631800 Differential Revision: [D23588186](https://our.internmc.facebook.com/intern/diff/D23588186/)

…emory usage" reland #41954 Add one argument in DDP API to enable/disable letting grads pointing to views. When it is disabled, behavior is the same as DDP right now; when it is enabled, Make both variable.grad() and grad in distautograd context point to bucket buffer in DDP to save memory usage. In this case, grad will be view of bucket buffer tensors, in order to make it compatiable with optimizer.zero_grad(), we made changes in #41283. Also be noted that we can not make variable.grad() pointing to bucket buffer during construction time, because we want to keep grad undefined for unused parameters. Differential Revision: [D23588186](https://our.internmc.facebook.com/intern/diff/D23588186/) [ghstack-poisoned]

Pull Request resolved: #44344 reland #41954 Add one argument in DDP API to enable/disable letting grads pointing to views. When it is disabled, behavior is the same as DDP right now; when it is enabled, Make both variable.grad() and grad in distautograd context point to bucket buffer in DDP to save memory usage. In this case, grad will be view of bucket buffer tensors, in order to make it compatiable with optimizer.zero_grad(), we made changes in #41283. Also be noted that we can not make variable.grad() pointing to bucket buffer during construction time, because we want to keep grad undefined for unused parameters. ghstack-source-id: 111827320 Differential Revision: [D23588186](https://our.internmc.facebook.com/intern/diff/D23588186/)

codecov · 2020-09-11T03:16:48Z

Codecov Report

❗ No coverage uploaded for pull request base (gh/zhaojuanmao/53/base@f3cce29). Click here to learn what that means.
The diff coverage is n/a.

@@                    Coverage Diff                    @@
##             gh/zhaojuanmao/53/base   #44344   +/-   ##
=========================================================
  Coverage                          ?   68.09%           
=========================================================
  Files                             ?      393           
  Lines                             ?    50972           
  Branches                          ?        0           
=========================================================
  Hits                              ?    34707           
  Misses                            ?    16265           
  Partials                          ?        0

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f3cce29...3568201. Read the comment docs.

mrshenli

Lint errors are real.

Could you please build the docs to verify the new doc strings are shown correctly? Thanks!

torch/csrc/distributed/c10d/init.cpp

torch/csrc/distributed/c10d/reducer.cpp

torch/nn/parallel/distributed.py

test/distributed/test_c10d.py

…emory usage" reland #41954 Add one argument in DDP API to enable/disable letting grads pointing to views. When it is disabled, behavior is the same as DDP right now; when it is enabled, Make both variable.grad() and grad in distautograd context point to bucket buffer in DDP to save memory usage. In this case, grad will be view of bucket buffer tensors, in order to make it compatiable with optimizer.zero_grad(), we made changes in #41283. Also be noted that we can not make variable.grad() pointing to bucket buffer during construction time, because we want to keep grad undefined for unused parameters. Differential Revision: [D23588186](https://our.internmc.facebook.com/intern/diff/D23588186/) [ghstack-poisoned]

[test all] Pull Request resolved: #44344 reland #41954 Add one argument in DDP API to enable/disable letting grads pointing to views. When it is disabled, behavior is the same as DDP right now; when it is enabled, Make both variable.grad() and grad in distautograd context point to bucket buffer in DDP to save memory usage. In this case, grad will be view of bucket buffer tensors, in order to make it compatiable with optimizer.zero_grad(), we made changes in #41283. Also be noted that we can not make variable.grad() pointing to bucket buffer during construction time, because we want to keep grad undefined for unused parameters. ghstack-source-id: 112194326 Differential Revision: [D23588186](https://our.internmc.facebook.com/intern/diff/D23588186/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D23588186/)!

…emory usage" reland #41954 Add one argument in DDP API to enable/disable letting grads pointing to views. When it is disabled, behavior is the same as DDP right now; when it is enabled, Make both variable.grad() and grad in distautograd context point to bucket buffer in DDP to save memory usage. In this case, grad will be view of bucket buffer tensors, in order to make it compatiable with optimizer.zero_grad(), we made changes in #41283. Also be noted that we can not make variable.grad() pointing to bucket buffer during construction time, because we want to keep grad undefined for unused parameters. Differential Revision: [D23588186](https://our.internmc.facebook.com/intern/diff/D23588186/) [ghstack-poisoned]

[test all] Pull Request resolved: #44344 reland #41954 Add one argument in DDP API to enable/disable letting grads pointing to views. When it is disabled, behavior is the same as DDP right now; when it is enabled, Make both variable.grad() and grad in distautograd context point to bucket buffer in DDP to save memory usage. In this case, grad will be view of bucket buffer tensors, in order to make it compatiable with optimizer.zero_grad(), we made changes in #41283. Also be noted that we can not make variable.grad() pointing to bucket buffer during construction time, because we want to keep grad undefined for unused parameters. ghstack-source-id: 112244977 Differential Revision: [D23588186](https://our.internmc.facebook.com/intern/diff/D23588186/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D23588186/)!

…emory usage" reland #41954 Add one argument in DDP API to enable/disable letting grads pointing to views. When it is disabled, behavior is the same as DDP right now; when it is enabled, Make both variable.grad() and grad in distautograd context point to bucket buffer in DDP to save memory usage. In this case, grad will be view of bucket buffer tensors, in order to make it compatiable with optimizer.zero_grad(), we made changes in #41283. Also be noted that we can not make variable.grad() pointing to bucket buffer during construction time, because we want to keep grad undefined for unused parameters. Differential Revision: [D23588186](https://our.internmc.facebook.com/intern/diff/D23588186/) [ghstack-poisoned]

[test all] Pull Request resolved: #44344 reland #41954 Add one argument in DDP API to enable/disable letting grads pointing to views. When it is disabled, behavior is the same as DDP right now; when it is enabled, Make both variable.grad() and grad in distautograd context point to bucket buffer in DDP to save memory usage. In this case, grad will be view of bucket buffer tensors, in order to make it compatiable with optimizer.zero_grad(), we made changes in #41283. Also be noted that we can not make variable.grad() pointing to bucket buffer during construction time, because we want to keep grad undefined for unused parameters. ghstack-source-id: 112705673 Differential Revision: [D23588186](https://our.internmc.facebook.com/intern/diff/D23588186/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D23588186/)!

test/distributed/test_c10d.py

torch/csrc/distributed/c10d/reducer.cpp

torch/csrc/distributed/c10d/reducer.h

torch/nn/parallel/distributed.py

…emory usage" reland #41954 Add one argument in DDP API to enable/disable letting grads pointing to views. When it is disabled, behavior is the same as DDP right now; when it is enabled, Make both variable.grad() and grad in distautograd context point to bucket buffer in DDP to save memory usage. In this case, grad will be view of bucket buffer tensors, in order to make it compatiable with optimizer.zero_grad(), we made changes in #41283. Also be noted that we can not make variable.grad() pointing to bucket buffer during construction time, because we want to keep grad undefined for unused parameters. Differential Revision: [D23588186](https://our.internmc.facebook.com/intern/diff/D23588186/) [ghstack-poisoned]

[test all] Pull Request resolved: #44344 reland #41954 Add one argument in DDP API to enable/disable letting grads pointing to views. When it is disabled, behavior is the same as DDP right now; when it is enabled, Make both variable.grad() and grad in distautograd context point to bucket buffer in DDP to save memory usage. In this case, grad will be view of bucket buffer tensors, in order to make it compatiable with optimizer.zero_grad(), we made changes in #41283. Also be noted that we can not make variable.grad() pointing to bucket buffer during construction time, because we want to keep grad undefined for unused parameters. ghstack-source-id: 112730565 Differential Revision: [D23588186](https://our.internmc.facebook.com/intern/diff/D23588186/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D23588186/)!

…emory usage" reland #41954 Add one argument in DDP API to enable/disable letting grads pointing to views. When it is disabled, behavior is the same as DDP right now; when it is enabled, Make both variable.grad() and grad in distautograd context point to bucket buffer in DDP to save memory usage. In this case, grad will be view of bucket buffer tensors, in order to make it compatiable with optimizer.zero_grad(), we made changes in #41283. Also be noted that we can not make variable.grad() pointing to bucket buffer during construction time, because we want to keep grad undefined for unused parameters. Differential Revision: [D23588186](https://our.internmc.facebook.com/intern/diff/D23588186/) [ghstack-poisoned]

[test all] Pull Request resolved: #44344 reland #41954 Add one argument in DDP API to enable/disable letting grads pointing to views. When it is disabled, behavior is the same as DDP right now; when it is enabled, Make both variable.grad() and grad in distautograd context point to bucket buffer in DDP to save memory usage. In this case, grad will be view of bucket buffer tensors, in order to make it compatiable with optimizer.zero_grad(), we made changes in #41283. Also be noted that we can not make variable.grad() pointing to bucket buffer during construction time, because we want to keep grad undefined for unused parameters. ghstack-source-id: 112760412 Differential Revision: [D23588186](https://our.internmc.facebook.com/intern/diff/D23588186/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D23588186/)!

mrshenli

LGTM!

Master was broken and is fixed now. Could you please rebase to the latest commit and wait for all tests to pass before landing? Thanks!

mrshenli · 2020-09-24T01:39:51Z

torch/nn/parallel/distributed.py

+                      between gradients and allreduce communication buckets.
+                      When gradients are views, "detach_()" cannot be called on the
+                      gradients. If hitting such errors, please fix it by referring to
+                      the :meth:torch.optim.Optimizer.zero_grad function in


I might be wrong, does :meth:torch.optim.Optimizer.zero_grad also correctly show as a link?

…emory usage" reland #41954 Add one argument in DDP API to enable/disable letting grads pointing to views. When it is disabled, behavior is the same as DDP right now; when it is enabled, Make both variable.grad() and grad in distautograd context point to bucket buffer in DDP to save memory usage. In this case, grad will be view of bucket buffer tensors, in order to make it compatiable with optimizer.zero_grad(), we made changes in #41283. Also be noted that we can not make variable.grad() pointing to bucket buffer during construction time, because we want to keep grad undefined for unused parameters. Differential Revision: [D23588186](https://our.internmc.facebook.com/intern/diff/D23588186/) [ghstack-poisoned]

[test all] Pull Request resolved: #44344 reland #41954 Add one argument in DDP API to enable/disable letting grads pointing to views. When it is disabled, behavior is the same as DDP right now; when it is enabled, Make both variable.grad() and grad in distautograd context point to bucket buffer in DDP to save memory usage. In this case, grad will be view of bucket buffer tensors, in order to make it compatiable with optimizer.zero_grad(), we made changes in #41283. Also be noted that we can not make variable.grad() pointing to bucket buffer during construction time, because we want to keep grad undefined for unused parameters. ghstack-source-id: 112795556 Differential Revision: [D23588186](https://our.internmc.facebook.com/intern/diff/D23588186/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D23588186/)!

zhaojuanmao · 2020-09-24T16:41:55Z

failures are not related

…emory usage" reland #41954 Add one argument in DDP API to enable/disable letting grads pointing to views. When it is disabled, behavior is the same as DDP right now; when it is enabled, Make both variable.grad() and grad in distautograd context point to bucket buffer in DDP to save memory usage. In this case, grad will be view of bucket buffer tensors, in order to make it compatiable with optimizer.zero_grad(), we made changes in #41283. Also be noted that we can not make variable.grad() pointing to bucket buffer during construction time, because we want to keep grad undefined for unused parameters. Differential Revision: [D23588186](https://our.internmc.facebook.com/intern/diff/D23588186/) [ghstack-poisoned]

[test all] Pull Request resolved: #44344 reland #41954 Add one argument in DDP API to enable/disable letting grads pointing to views. When it is disabled, behavior is the same as DDP right now; when it is enabled, Make both variable.grad() and grad in distautograd context point to bucket buffer in DDP to save memory usage. In this case, grad will be view of bucket buffer tensors, in order to make it compatiable with optimizer.zero_grad(), we made changes in #41283. Also be noted that we can not make variable.grad() pointing to bucket buffer during construction time, because we want to keep grad undefined for unused parameters. ghstack-source-id: 112845787 Differential Revision: [D23588186](https://our.internmc.facebook.com/intern/diff/D23588186/)

zhaojuanmao · 2020-09-24T20:41:46Z

merge conflict, rebase

facebook-github-bot · 2020-09-25T04:17:39Z

This pull request has been merged in c6500bc.

Summary: Fixes #{issue number} This is resubmit for PR #42897 . Together with fix for Windows build issue introduced by PR #44344 . Pull Request resolved: #45335 Reviewed By: zou3519 Differential Revision: D23931471 Pulled By: mrshenli fbshipit-source-id: f49b5a114944c1450b32934b3292170be064f494

zhaojuanmao requested review from albanD, apaszke, mrshenli, pietern and pritamdamania87 as code owners September 8, 2020 23:28

This was referenced Sep 8, 2020

move rebuild buckets from end of first iteration to beginning of second iteration #44326

Closed

refactor intialize bucket views #44330

Closed

mrshenli reviewed Sep 14, 2020

View reviewed changes

pritamdamania87 reviewed Sep 16, 2020

View reviewed changes

torch/nn/parallel/distributed.py Outdated Show resolved Hide resolved

test/distributed/test_c10d.py Outdated Show resolved Hide resolved

zhaojuanmao mentioned this pull request Sep 16, 2020

[reland] move rebuild buckets from end of first iteration to beginning of second iteration #44798

Closed

zhaojuanmao requested a review from rohan-varma as a code owner September 17, 2020 00:29

mrshenli reviewed Sep 23, 2020

View reviewed changes

test/distributed/test_c10d.py Show resolved Hide resolved

torch/csrc/distributed/c10d/reducer.cpp Outdated Show resolved Hide resolved

torch/csrc/distributed/c10d/reducer.cpp Show resolved Hide resolved

torch/csrc/distributed/c10d/reducer.cpp Outdated Show resolved Hide resolved

mrshenli reviewed Sep 23, 2020

View reviewed changes

mrshenli approved these changes Sep 24, 2020

View reviewed changes

zhaojuanmao mentioned this pull request Sep 24, 2020

[ci-all tests] Make grad point to bucket buffer in DDP to save memory usage #45265

Closed

facebook-github-bot closed this in c6500bc Sep 25, 2020

facebook-github-bot added the merged label Sep 25, 2020

gunandrose4u mentioned this pull request Sep 25, 2020

[reland]Enable distributed package on windows, Gloo backend supported only #45335

Closed

facebook-github-bot deleted the gh/zhaojuanmao/53/head branch September 28, 2020 14:17

mruberry added the Merged label Oct 28, 2020

zhaojuanmao mentioned this pull request Apr 20, 2021

Avoid keeping two copies of gradients (param.grad and buckets) in DDP #37030

Closed

[reland] Make grad point to bucket buffer in DDP to save memory usage #44344

[reland] Make grad point to bucket buffer in DDP to save memory usage #44344

Uh oh!

Conversation

zhaojuanmao commented Sep 8, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dr-ci bot commented Sep 8, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💊 CI failures summary and remediations

Uh oh!

codecov bot commented Sep 11, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

mrshenli left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mrshenli left a comment

Choose a reason for hiding this comment

Uh oh!

mrshenli Sep 24, 2020

Choose a reason for hiding this comment

Uh oh!

zhaojuanmao commented Sep 24, 2020

Uh oh!

zhaojuanmao commented Sep 24, 2020

Uh oh!

facebook-github-bot commented Sep 25, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

zhaojuanmao commented Sep 8, 2020 •

edited

Loading

dr-ci bot commented Sep 8, 2020 •

edited

Loading

codecov bot commented Sep 11, 2020 •

edited

Loading