-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Remove ProcessGroupRoundRobin #87088
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
[ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/87088
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit a08aedf: This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
oh no, I think we can still keep this, as I heard NCCL will support multiple NCCL streams soon, so this progress group will be useful soon. also it can be used for progress group gloo or other backends |
zhaojuanmao
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please do not delete this, thanks!
Removes ProcessGroupRoundRobin which has been marked to be deprecated after 1.13 release. This is part of the stack in anticipation of the change to move all ProcessGroups to derive from the Backend class. [ghstack-poisoned]
Removes ProcessGroupRoundRobin which has been marked to be deprecated after 1.13 release. This is part of the stack in anticipation of the change to move all ProcessGroups to derive from the Backend class. [ghstack-poisoned]
Removes ProcessGroupRoundRobin which has been marked to be deprecated after 1.13 release. This is part of the stack in anticipation of the change to move all ProcessGroups to derive from the Backend class. [ghstack-poisoned]
|
based on previous comments, I think we still need ProcessGroupRoundRobin, maybe abandon this PR? |
|
Once I get the PG round robin changes working with the ProcessGroup -> Backend changes in the PR above, I will abandon this PR |
|
Hi @zhaojuanmao, do you know if there is any library using I can think of a few reasons for removing it:
|
Wondering whether there are some other strong reasons to remove it? if removing it because of other strong reasons, we may still add it back as common util functions (if not allowed to be in C10D) to be shared by multiple higher level APIs. |
|
When I think about it more, I think there is another important reason to keep it. With this ProcessGroupRoundRobin, DDP or FSDP does not need to add another API argument for users to turn on concurrent stream communications, users just simply construct its own RoundRobin process group and pass it to DDP/FSDP's existing process_group argument. That means there will be no extra API change for DDP/FSDP in the near future when we want to promote concurrent stream communications for DDP/FSDP users |
|
There is a fundamental reason why we cannot support The point about DDP/FSDP API is a good one. I wonder, if DDP/FSDP is going to support concurrent communicators in future, can the |
|
ah, that is a good call out regarding round robin send/recv is not supported. In this case, DDP/FSDP/OtherAPIs should not support users to pass a list of process groups and call them in a round robin fashion internally, because these process groups support send/recv. On the other hand, we can limit ProcessGroupRoundRobin not support send/recv primitives, then the higher level APIs can safely take a ProcessGroupRoundRobin instance in:) |
|
Well, when multiple process groups are passed in as a list, it actually gives DDP/FSDP more flexibility. They can choose to iterate the list when it comes to collectives, and always use the first PG in the list when it comes to P2P. |
|
still not convinced to remove it so far:), agree that either way will work. seems that your main point is to reduce c10d maintenance cost (not support send/recv is OK, right? why have to support send/recv?), I prefer to keep it because it is a nice abstraction to be used by multiple higher level APIs transparently. Would like to hear others' feedback as well:) |
That is in theory not so nice as long as ProcessGroupRoundRobin remains a derived class of ProcessGroup (which defines these APIs).
Users/libraries would probably implement a variety of scheduling algorithms which they see as most performant for themselves, for example, earliest-finishing PG first, shortest-queue first, highest-priority PG first, etc. Round-robin is just a simplest one of them, hence not representative or abstract enough. |
|
I am in the camp of removing it since it is not used currently. It is easy resurrect code (feel free to revert commit whenever) in the case that we do need it / want to build on it for future. Overall, either option is fine with me, both options require minimal work and I dont believe it's a huge design decision. If everyone is still pretty adamant about keeping and removing, then a compromise is to just keep it in another release and if there are still no issues filed / people using it / our team doesn't use it, then we can remove it then. |
Removes ProcessGroupRoundRobin which has been marked to be deprecated after 1.13 release. This is part of the stack in anticipation of the change to move all ProcessGroups to derive from the Backend class. [ghstack-poisoned]
Removes ProcessGroupRoundRobin which has been marked to be deprecated after 1.13 release. This is part of the stack in anticipation of the change to move all ProcessGroups to derive from the Backend class. [ghstack-poisoned]
Removes ProcessGroupRoundRobin which has been marked to be deprecated after 1.13 release. This is part of the stack in anticipation of the change to move all ProcessGroups to derive from the Backend class. [ghstack-poisoned]
Removes ProcessGroupRoundRobin which has been marked to be deprecated after 1.13 release. This is part of the stack in anticipation of the change to move all ProcessGroups to derive from the Backend class. [ghstack-poisoned]
Removes ProcessGroupRoundRobin which has been marked to be deprecated after 1.13 release. This is part of the stack in anticipation of the change to move all ProcessGroups to derive from the Backend class. [ghstack-poisoned]
Removes ProcessGroupRoundRobin which has been marked to be deprecated after 1.13 release. This is part of the stack in anticipation of the change to move all ProcessGroups to derive from the Backend class. [ghstack-poisoned]
Removes ProcessGroupRoundRobin which has been marked to be deprecated after 1.13 release. This is part of the stack in anticipation of the change to move all ProcessGroups to derive from the Backend class. [ghstack-poisoned]
|
@H-Huang has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Removes ProcessGroupRoundRobin which has been marked to be deprecated after 1.13 release. This is part of the stack in anticipation of the change to move all ProcessGroups to derive from the Backend class. Differential Revision: [D41415323](https://our.internmc.facebook.com/intern/diff/D41415323) [ghstack-poisoned]
Removes ProcessGroupRoundRobin which has been marked to be deprecated after 1.13 release. This is part of the stack in anticipation of the change to move all ProcessGroups to derive from the Backend class. Differential Revision: [D41415323](https://our.internmc.facebook.com/intern/diff/D41415323) [ghstack-poisoned]
Removes ProcessGroupRoundRobin which has been marked to be deprecated after 1.13 release. This is part of the stack in anticipation of the change to move all ProcessGroups to derive from the Backend class. Differential Revision: [D41415323](https://our.internmc.facebook.com/intern/diff/D41415323) [ghstack-poisoned]
Removes ProcessGroupRoundRobin which has been marked to be deprecated after 1.13 release. This is part of the stack in anticipation of the change to move all ProcessGroups to derive from the Backend class. Differential Revision: [D41415323](https://our.internmc.facebook.com/intern/diff/D41415323) [ghstack-poisoned]
|
@H-Huang has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
1 similar comment
|
@H-Huang has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
|
Closing and will revisit this later |
ghstack-source-id: e2d83cd Pull Request resolved: pytorch/pytorch#87088
Stack from ghstack:
Removes ProcessGroupRoundRobin which has been marked to be deprecated after 1.13 release. This is part of the stack in anticipation of the change to move all ProcessGroups to derive from the Backend class.