-
Notifications
You must be signed in to change notification settings - Fork 26.3k
[ddp] parameter verification #74113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ddp] parameter verification #74113
Conversation
Check mismatch in # of parameters by broadcasting and verifying from rank 0. As a result, non-zero ranks raise an error when # of parameters are mismatched across ranks. Closes #73547 Differential Revision: [D34772067](https://our.internmc.facebook.com/intern/diff/D34772067/) [ghstack-poisoned]
CI Flow Status⚛️ CI FlowRuleset - Version:
|
🔗 Helpful links
💊 CI failures summary and remediationsAs of commit 62f0b92 (more details on the Dr. CI page): 💚 💚 Looks good so far! There are no failures yet. 💚 💚 This comment was automatically generated by Dr. CI (expand for details).Please report bugs/suggestions to the (internal) Dr. CI Users group. |
Check mismatch in # of parameters by broadcasting and verifying from rank 0. As a result, non-zero ranks raise an error when # of parameters are mismatched across ranks. Closes #73547 Differential Revision: [D34772067](https://our.internmc.facebook.com/intern/diff/D34772067/) [ghstack-poisoned]
Pull Request resolved: #74113 Check mismatch in # of parameters by broadcasting and verifying from rank 0. As a result, non-zero ranks raise an error when # of parameters are mismatched across ranks. Closes #73547 ghstack-source-id: 151159056 Differential Revision: [D34772067](https://our.internmc.facebook.com/intern/diff/D34772067/)
Check mismatch in # of parameters by broadcasting and verifying from rank 0. As a result, non-zero ranks raise an error when # of parameters are mismatched across ranks. Closes #73547 Differential Revision: [D34772067](https://our.internmc.facebook.com/intern/diff/D34772067/) [ghstack-poisoned]
Pull Request resolved: #74113 Check mismatch in # of parameters by broadcasting and verifying from rank 0. As a result, non-zero ranks raise an error when # of parameters are mismatched across ranks. Closes #73547 ghstack-source-id: 151191152 Differential Revision: [D34772067](https://our.internmc.facebook.com/intern/diff/D34772067/)
Check mismatch in # of parameters by broadcasting and verifying from rank 0. As a result, non-zero ranks raise an error when # of parameters are mismatched across ranks. Closes #73547 Differential Revision: [D34772067](https://our.internmc.facebook.com/intern/diff/D34772067/) [ghstack-poisoned]
Pull Request resolved: #74113 Check mismatch in # of parameters by broadcasting and verifying from rank 0. As a result, non-zero ranks raise an error when # of parameters are mismatched across ranks. Closes #73547 ghstack-source-id: 151275647 Differential Revision: [D34772067](https://our.internmc.facebook.com/intern/diff/D34772067/)
mrshenli
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stamp to unblock
|
|
||
| // Broadcast and verify parameter size. | ||
| std::vector<at::Tensor> param_size_vec{param_size_tensor}; | ||
| process_group->broadcast(param_size_vec)->wait(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit (can be done as future BE work item): DDP ctor will broadcast param values anyway. Is it possible to implement these two broadcasts in the same location using two async ops? I recall the param value broadcast was done in the distributed.py file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, there's actually 3 collectives that go on now:
- parameter size verification
- parameter shape verification
- parameter broadcast
We could make these all async, and wait on all three, and register then() callbacks on the first two to raise appropriate errors. Filed #74185
| std::vector<at::Tensor> param_size_vec{param_size_tensor}; | ||
| process_group->broadcast(param_size_vec)->wait(); | ||
| auto res = param_size_tensor[0].item<int>(); | ||
| TORCH_CHECK( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
curious, does this mean some DDP process will crash, while other won't if their size matches? If that's the case, does it mean allgather might be better, as all processes can make the same decision here and throw the same error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good point, it would be a lot better if all processes make the same decision. Next update will fix this
| group_gloo = dist.new_group( | ||
| timeout=timedelta(seconds=60), backend=dist.Backend.GLOO | ||
| ) | ||
| # Set NCCL_BLOCKING_WAIT and use a new NCCL group to improve test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
curious, why this can help improve determinism?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'd like NCCL_BLOCKING_WAIT so that rank 0 doesn't need to be taken down by async error handling which is non-deterministic when it kicks in and runs, and it also takes down the process which is hard to accomodate for in unittest.
As a result, we also need a new NCCL group as the default NCCL group is initialized before this test executes as a part of setup. I'll add some clarifying comments.
Check mismatch in # of parameters by broadcasting and verifying from rank 0. As a result, non-zero ranks raise an error when # of parameters are mismatched across ranks. Closes #73547 Differential Revision: [D34772067](https://our.internmc.facebook.com/intern/diff/D34772067/) [ghstack-poisoned]
Pull Request resolved: #74113 Check mismatch in # of parameters by broadcasting and verifying from rank 0. As a result, non-zero ranks raise an error when # of parameters are mismatched across ranks. Closes #73547 ghstack-source-id: 151319259 Differential Revision: [D34772067](https://our.internmc.facebook.com/intern/diff/D34772067/)
Summary: Pull Request resolved: #74113 Check mismatch in # of parameters by broadcasting and verifying from rank 0. As a result, non-zero ranks raise an error when # of parameters are mismatched across ranks. Closes #73547 ghstack-source-id: 151319259 Test Plan: UT Reviewed By: mrshenli Differential Revision: D34772067 fbshipit-source-id: 456933111e9996823f1a220b474998e17fb74210
Stack from ghstack (oldest at bottom):
Check mismatch in # of parameters by broadcasting and verifying from rank 0. As a result, non-zero ranks raise an error when # of parameters are mismatched across ranks.
Closes #73547
Differential Revision: D34772067