-
Notifications
You must be signed in to change notification settings - Fork 26.3k
[dist_optim] introduce distributed functional optimizer #45221
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This PR introduces a distributed functional optimizer, so that distributed optimizer can reuse the functional optimizer APIs and maintain their own states. This could enable the torchscript compatible functional optimizer when using distributed optimizer, helps getting rid of GIL and improve overall performance of training, especially distributed model parallel training [ghstack-poisoned]
This PR introduces a distributed functional optimizer, so that distributed optimizer can reuse the functional optimizer APIs and maintain their own states. This could enable the torchscript compatible functional optimizer when using distributed optimizer, helps getting rid of GIL and improve overall performance of training, especially distributed model parallel training ghstack-source-id: 0c75d84 Pull Request resolved: #45221
💊 CI failures summary and remediationsAs of commit 500ac24 (more details on the Dr. CI page):
XLA failureJob pytorch_xla_linux_bionic_py3_6_clang9_test is failing. Please create an issue with title prefixed by This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group. This comment has been revised 34 times. |
This PR introduces a distributed functional optimizer, so that distributed optimizer can reuse the functional optimizer APIs and maintain their own states. This could enable the torchscript compatible functional optimizer when using distributed optimizer, helps getting rid of GIL and improve overall performance of training, especially distributed model parallel training [ghstack-poisoned]
This PR introduces a distributed functional optimizer, so that distributed optimizer can reuse the functional optimizer APIs and maintain their own states. This could enable the torchscript compatible functional optimizer when using distributed optimizer, helps getting rid of GIL and improve overall performance of training, especially distributed model parallel training [ghstack-poisoned]
This PR introduces a distributed functional optimizer, so that distributed optimizer can reuse the functional optimizer APIs and maintain their own states. This could enable the torchscript compatible functional optimizer when using distributed optimizer, helps getting rid of GIL and improve overall performance of training, especially distributed model parallel training [ghstack-poisoned]
This PR introduces a distributed functional optimizer, so that distributed optimizer can reuse the functional optimizer APIs and maintain their own states. This could enable the torchscript compatible functional optimizer when using distributed optimizer, helps getting rid of GIL and improve overall performance of training, especially distributed model parallel training [ghstack-poisoned]
This PR introduces a distributed functional optimizer, so that distributed optimizer can reuse the functional optimizer APIs and maintain their own states. This could enable the torchscript compatible functional optimizer when using distributed optimizer, helps getting rid of GIL and improve overall performance of training, especially distributed model parallel training [ghstack-poisoned]
This PR introduces a distributed functional optimizer, so that distributed optimizer can reuse the functional optimizer APIs and maintain their own states. This could enable the torchscript compatible functional optimizer when using distributed optimizer, helps getting rid of GIL and improve overall performance of training, especially distributed model parallel training [ghstack-poisoned]
This PR introduces a distributed functional optimizer, so that distributed optimizer can reuse the functional optimizer APIs and maintain their own states. This could enable the torchscript compatible functional optimizer when using distributed optimizer, helps getting rid of GIL and improve overall performance of training, especially distributed model parallel training ghstack-source-id: 115adfe Pull Request resolved: #45221
pritamdamania87
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good overall, requesting changes to see if we can further optimize the local optimizer step to be completely in torchscript.
| @@ -0,0 +1,90 @@ | |||
| from typing import List, Dict, Optional | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: maybe rename this file to functional_adagrad.py
torch/distributed/optim/adagrad.py
Outdated
| # NOTE: This should be only used by distributed optimizer internals | ||
| # and not meant to expose to the user. | ||
| @torch.jit.script | ||
| class FunctionalAdagrad(object): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Should this be _FunctionalAdagrad for now to indicate its internal only?
|
|
||
| # TODO: use foreach API in optim.functional to do all the computation | ||
|
|
||
| def _make_sparse(grad, grad_indices, values): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vincentqb Could you review these changes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. What is the advantage of writing this in this way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's mainly because nested functions are not supported by TorchScript, so I moved it out as a separate function. Also I feel like define a nested function inside a for loop is not a good practice
torch/distributed/optim/optimizer.py
Outdated
| rpc_futs.append(rpc.rpc_async( | ||
| optim.owner(), | ||
| optimizer.owner(), | ||
| _local_optimizer_step, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have two versions of _local_optimizer_step? If we're using a torchscript functional optimizer, we can have a torchscript _local_optimizer_step which means RPC will call an torchscript function and there would be no GIL anywhere. If it is an optimizer that is not torchscript we call a regular python _local_optimizer_step.
In the current version, we still end up holding GIL since _local_optimizer_step is not torchscript.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm I see, this make sense, the problem of this is that _LocalOptimizer still contains Lock() objects to protect those who doesn't have a functional optimizer implementation. I could try adding a new _LocalScriptOptimizer to enable the use case you mentioned
This PR introduces a distributed functional optimizer, so that distributed optimizer can reuse the functional optimizer APIs and maintain their own states. This could enable the torchscript compatible functional optimizer when using distributed optimizer, helps getting rid of GIL and improve overall performance of training, especially distributed model parallel training [ghstack-poisoned]
This PR introduces a distributed functional optimizer, so that distributed optimizer can reuse the functional optimizer APIs and maintain their own states. This could enable the torchscript compatible functional optimizer when using distributed optimizer, helps getting rid of GIL and improve overall performance of training, especially distributed model parallel training ghstack-source-id: 0b4ef3c Pull Request resolved: #45221
This PR introduces a distributed functional optimizer, so that distributed optimizer can reuse the functional optimizer APIs and maintain their own states. This could enable the torchscript compatible functional optimizer when using distributed optimizer, helps getting rid of GIL and improve overall performance of training, especially distributed model parallel training [ghstack-poisoned]
This PR introduces a distributed functional optimizer, so that distributed optimizer can reuse the functional optimizer APIs and maintain their own states. This could enable the torchscript compatible functional optimizer when using distributed optimizer, helps getting rid of GIL and improve overall performance of training, especially distributed model parallel training [ghstack-poisoned]
This PR introduces a distributed functional optimizer, so that distributed optimizer can reuse the functional optimizer APIs and maintain their own states. This could enable the torchscript compatible functional optimizer when using distributed optimizer, helps getting rid of GIL and improve overall performance of training, especially distributed model parallel training ghstack-source-id: 5dd33d0 Pull Request resolved: #45221
This PR introduces a distributed functional optimizer, so that distributed optimizer can reuse the functional optimizer APIs and maintain their own states. This could enable the torchscript compatible functional optimizer when using distributed optimizer, helps getting rid of GIL and improve overall performance of training, especially distributed model parallel training [ghstack-poisoned]
This PR introduces a distributed functional optimizer, so that distributed optimizer can reuse the functional optimizer APIs and maintain their own states. This could enable the torchscript compatible functional optimizer when using distributed optimizer, helps getting rid of GIL and improve overall performance of training, especially distributed model parallel training ghstack-source-id: 0e0a953 Pull Request resolved: #45221
Stack from ghstack:
This PR introduces a distributed functional optimizer, so that
distributed optimizer can reuse the functional optimizer APIs and
maintain their own states. This could enable the torchscript compatible
functional optimizer when using distributed optimizer, helps getting rid
of GIL and improve overall performance of training, especially distributed
model parallel training
Differential Revision: D23935256