Skip to content

Conversation

@H-Huang
Copy link
Member

@H-Huang H-Huang commented Oct 17, 2022

Stack from ghstack:

Removes ProcessGroupRoundRobin which has been marked to be deprecated after 1.13 release. This is part of the stack in anticipation of the change to move all ProcessGroups to derive from the Backend class.

[ghstack-poisoned]
@pytorch-bot
Copy link

pytorch-bot bot commented Oct 17, 2022

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/87088

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit a08aedf:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@zhaojuanmao
Copy link
Contributor

oh no, I think we can still keep this, as I heard NCCL will support multiple NCCL streams soon, so this progress group will be useful soon. also it can be used for progress group gloo or other backends

Copy link
Contributor

@zhaojuanmao zhaojuanmao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please do not delete this, thanks!

Removes ProcessGroupRoundRobin which has been marked to be deprecated after 1.13 release. This is part of the stack in anticipation of the change to move all ProcessGroups to derive from the Backend class.

[ghstack-poisoned]
H-Huang added a commit that referenced this pull request Oct 31, 2022
ghstack-source-id: 4e095a2
Pull Request resolved: #87088
Removes ProcessGroupRoundRobin which has been marked to be deprecated after 1.13 release. This is part of the stack in anticipation of the change to move all ProcessGroups to derive from the Backend class.

[ghstack-poisoned]
Removes ProcessGroupRoundRobin which has been marked to be deprecated after 1.13 release. This is part of the stack in anticipation of the change to move all ProcessGroups to derive from the Backend class.

[ghstack-poisoned]
@zhaojuanmao
Copy link
Contributor

based on previous comments, I think we still need ProcessGroupRoundRobin, maybe abandon this PR?

@H-Huang
Copy link
Member Author

H-Huang commented Nov 2, 2022

Once I get the PG round robin changes working with the ProcessGroup -> Backend changes in the PR above, I will abandon this PR

@kwen2501
Copy link
Collaborator

kwen2501 commented Nov 3, 2022

Hi @zhaojuanmao, do you know if there is any library using ProcessGroupRoundRobin?

I can think of a few reasons for removing it:

  • We have marked it as deprecated in PyTorch 1.13; and the API for creating it (_round_robin_process_groups) is never an official API;
  • Even after NCCL supports concurrent communicators, I would prefer for user to write their own PG iterator, because dist APIs take pg as an argument, so it can be as simple as:
dist.all_reduce(tensor, pg=next(pool))

@zhaojuanmao
Copy link
Contributor

  1. ProcessGroupRoundRobin is not official API because it is not ready for use when NCCL does not support concurrent NCCL streams
  2. I can imagine a lot of users (DDP/FSDP or other dist training APIs in OSS) can take advantages of using ProcessGroupRoundRobin soon, each of these users do not need to manage their progress group iterators on their own. From this perspective, this looks like to be a nice abstraction.

Wondering whether there are some other strong reasons to remove it? if removing it because of other strong reasons, we may still add it back as common util functions (if not allowed to be in C10D) to be shared by multiple higher level APIs.

@zhaojuanmao
Copy link
Contributor

When I think about it more, I think there is another important reason to keep it. With this ProcessGroupRoundRobin, DDP or FSDP does not need to add another API argument for users to turn on concurrent stream communications, users just simply construct its own RoundRobin process group and pass it to DDP/FSDP's existing process_group argument. That means there will be no extra API change for DDP/FSDP in the near future when we want to promote concurrent stream communications for DDP/FSDP users

@kwen2501
Copy link
Collaborator

kwen2501 commented Nov 3, 2022

There is a fundamental reason why we cannot support ProcessGroupRoundRobin: it requires all ranks to advance their iterators synchronously; otherwise there will be hang. This is incompatible with send/recv's, which not all ranks would call into.

The point about DDP/FSDP API is a good one. I wonder, if DDP/FSDP is going to support concurrent communicators in future, can the process_group kwarg be extended to take a Union? In fact, it is already an implicit Union today because it accepts value None. When DDP or FSDP detects that the passed-in is a List, it would then iterate over it. This must be under DDP or FSDP's knowledge that they don't have any send/recv calls therein. (But I don't know if this is an easy promise to make from DDP and FSDP future perspective -- once they need to use send/recv, they will have to break the promise that they support process groups in a round-robin fashion.)

@zhaojuanmao
Copy link
Contributor

zhaojuanmao commented Nov 3, 2022

ah, that is a good call out regarding round robin send/recv is not supported. In this case, DDP/FSDP/OtherAPIs should not support users to pass a list of process groups and call them in a round robin fashion internally, because these process groups support send/recv. On the other hand, we can limit ProcessGroupRoundRobin not support send/recv primitives, then the higher level APIs can safely take a ProcessGroupRoundRobin instance in:)

@kwen2501
Copy link
Collaborator

kwen2501 commented Nov 3, 2022

Well, when multiple process groups are passed in as a list, it actually gives DDP/FSDP more flexibility. They can choose to iterate the list when it comes to collectives, and always use the first PG in the list when it comes to P2P.

@zhaojuanmao
Copy link
Contributor

still not convinced to remove it so far:), agree that either way will work. seems that your main point is to reduce c10d maintenance cost (not support send/recv is OK, right? why have to support send/recv?), I prefer to keep it because it is a nice abstraction to be used by multiple higher level APIs transparently.

Would like to hear others' feedback as well:)

@kwen2501
Copy link
Collaborator

kwen2501 commented Nov 3, 2022

  • Sure. Just to follow up with the question of not supporting send/recv:

That is in theory not so nice as long as ProcessGroupRoundRobin remains a derived class of ProcessGroup (which defines these APIs).

  • In real use cases of multiple process groups:

Users/libraries would probably implement a variety of scheduling algorithms which they see as most performant for themselves, for example, earliest-finishing PG first, shortest-queue first, highest-priority PG first, etc. Round-robin is just a simplest one of them, hence not representative or abstract enough.

@H-Huang
Copy link
Member Author

H-Huang commented Nov 3, 2022

I am in the camp of removing it since it is not used currently. It is easy resurrect code (feel free to revert commit whenever) in the case that we do need it / want to build on it for future.

Overall, either option is fine with me, both options require minimal work and I dont believe it's a huge design decision. If everyone is still pretty adamant about keeping and removing, then a compromise is to just keep it in another release and if there are still no issues filed / people using it / our team doesn't use it, then we can remove it then.

Removes ProcessGroupRoundRobin which has been marked to be deprecated after 1.13 release. This is part of the stack in anticipation of the change to move all ProcessGroups to derive from the Backend class.

[ghstack-poisoned]
Removes ProcessGroupRoundRobin which has been marked to be deprecated after 1.13 release. This is part of the stack in anticipation of the change to move all ProcessGroups to derive from the Backend class.

[ghstack-poisoned]
Removes ProcessGroupRoundRobin which has been marked to be deprecated after 1.13 release. This is part of the stack in anticipation of the change to move all ProcessGroups to derive from the Backend class.

[ghstack-poisoned]
Removes ProcessGroupRoundRobin which has been marked to be deprecated after 1.13 release. This is part of the stack in anticipation of the change to move all ProcessGroups to derive from the Backend class.

[ghstack-poisoned]
Removes ProcessGroupRoundRobin which has been marked to be deprecated after 1.13 release. This is part of the stack in anticipation of the change to move all ProcessGroups to derive from the Backend class.

[ghstack-poisoned]
Removes ProcessGroupRoundRobin which has been marked to be deprecated after 1.13 release. This is part of the stack in anticipation of the change to move all ProcessGroups to derive from the Backend class.

[ghstack-poisoned]
Removes ProcessGroupRoundRobin which has been marked to be deprecated after 1.13 release. This is part of the stack in anticipation of the change to move all ProcessGroups to derive from the Backend class.

[ghstack-poisoned]
@H-Huang
Copy link
Member Author

H-Huang commented Nov 18, 2022

@H-Huang has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Removes ProcessGroupRoundRobin which has been marked to be deprecated after 1.13 release. This is part of the stack in anticipation of the change to move all ProcessGroups to derive from the Backend class.

Differential Revision: [D41415323](https://our.internmc.facebook.com/intern/diff/D41415323)

[ghstack-poisoned]
Removes ProcessGroupRoundRobin which has been marked to be deprecated after 1.13 release. This is part of the stack in anticipation of the change to move all ProcessGroups to derive from the Backend class.

Differential Revision: [D41415323](https://our.internmc.facebook.com/intern/diff/D41415323)

[ghstack-poisoned]
Removes ProcessGroupRoundRobin which has been marked to be deprecated after 1.13 release. This is part of the stack in anticipation of the change to move all ProcessGroups to derive from the Backend class.

Differential Revision: [D41415323](https://our.internmc.facebook.com/intern/diff/D41415323)

[ghstack-poisoned]
Removes ProcessGroupRoundRobin which has been marked to be deprecated after 1.13 release. This is part of the stack in anticipation of the change to move all ProcessGroups to derive from the Backend class.

Differential Revision: [D41415323](https://our.internmc.facebook.com/intern/diff/D41415323)

[ghstack-poisoned]
@H-Huang
Copy link
Member Author

H-Huang commented Dec 1, 2022

@H-Huang has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

1 similar comment
@H-Huang
Copy link
Member Author

H-Huang commented Dec 6, 2022

@H-Huang has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@H-Huang
Copy link
Member Author

H-Huang commented Dec 7, 2022

Closing and will revisit this later

@H-Huang H-Huang closed this Dec 7, 2022
hasanyeganeh pushed a commit to hasanyeganeh/pytorch-pytorch that referenced this pull request Dec 21, 2022
ghstack-source-id: e2d83cd
Pull Request resolved: pytorch/pytorch#87088
@facebook-github-bot facebook-github-bot deleted the gh/H-Huang/88/head branch June 8, 2023 14:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants