-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Fix RPC and ProcessGroup GIL deadlock #45088
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Fixes #45082 Found a few problems while working on #44983 1. We deliberately swallow RPC timeouts during shutdown, as we haven't found a good way to handle those. When we convert `_wait_all_workers` into `_all_gather`, the same logic was inherited. However, as `_all_gather` meant to be used in more general scenarios, we should no longer keep silent about errors. This commit let the error throw in `_all_gather` and also let `shutdown()` to catch them and log. 2. After fixing (1), I found that `UnpickledPythonCall` needs to acquire GIL on destruction, and this can lead to deadlock when used in conjuction with `ProcessGroup`. Because `ProcessGroup` ctor is a synchronization point which holds GIL. In `init_rpc`, followers (`rank != 0`) can exit before the leader (`rank == 0`). If the two happens together, we could get a) on a follower, it exits `init_rpc` after running `_broadcast_to_followers` and before the reaching dtor of `UnpickledPythonCall`. Then it runs the ctor of `ProcessGroup`, which holds the GIL and wait for the leader to join. However, the leader is waiting for the response from `_broadcast_to_followers`, which is blocked by the dtor of `UnpickledPythonCall`. And hence the deadlock. This commit drops the GIL in `ProcessGroup` ctor. 3. After fixing (2), I found that `TensorPipe` backend nondeterministically fails with `test_local_shutdown`, due to a similar reason as (2), but this time it is that `shutdown()` on a follower runs before the leader finishes `init_rpc`. This commit adds a join for `TensorPipe` backend `init_rpc` after `_all_gather`. The 3rd one should be able to solve the 2nd one as well. But since I didn't see a reason to hold GIL during `ProcessGroup` ctor, I made that change too. [ghstack-poisoned]
Fixes #45082 Found a few problems while working on #44983 1. We deliberately swallow RPC timeouts during shutdown, as we haven't found a good way to handle those. When we convert `_wait_all_workers` into `_all_gather`, the same logic was inherited. However, as `_all_gather` meant to be used in more general scenarios, we should no longer keep silent about errors. This commit let the error throw in `_all_gather` and also let `shutdown()` to catch them and log. 2. After fixing (1), I found that `UnpickledPythonCall` needs to acquire GIL on destruction, and this can lead to deadlock when used in conjuction with `ProcessGroup`. Because `ProcessGroup` ctor is a synchronization point which holds GIL. In `init_rpc`, followers (`rank != 0`) can exit before the leader (`rank == 0`). If the two happens together, we could get a) on a follower, it exits `init_rpc` after running `_broadcast_to_followers` and before the reaching dtor of `UnpickledPythonCall`. Then it runs the ctor of `ProcessGroup`, which holds the GIL and wait for the leader to join. However, the leader is waiting for the response from `_broadcast_to_followers`, which is blocked by the dtor of `UnpickledPythonCall`. And hence the deadlock. This commit drops the GIL in `ProcessGroup` ctor. 3. After fixing (2), I found that `TensorPipe` backend nondeterministically fails with `test_local_shutdown`, due to a similar reason as (2), but this time it is that `shutdown()` on a follower runs before the leader finishes `init_rpc`. This commit adds a join for `TensorPipe` backend `init_rpc` after `_all_gather`. The 3rd one should be able to solve the 2nd one as well. But since I didn't see a reason to hold GIL during `ProcessGroup` ctor, I made that change too. ghstack-source-id: aab6baa Pull Request resolved: #45088
💊 CI failures summary and remediationsAs of commit 82615bf (more details on the Dr. CI page):
🕵️ 1 new failure recognized by patternsThe following CI failures do not appear to be due to upstream breakages:
|
pritamdamania87
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to add a unit test that would've caught this issue? Even a unit test that would fail once in 100 without this fix would be useful to catch any regressions we have in the future.
torch/distributed/rpc/api.py
Outdated
| worker_name=follower_name, timeout=timeout | ||
| ) | ||
| ) | ||
| fut.wait() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously we would call wait() on all futures during shutdown, but now we throw on the first one and probably don't wait for the other futures.
It would be nice to keep the same behavior as before. Can we add a try-catch here and save the exception? We can probably only throw the first/last exception we see and catch that in _wait_all_workers.
I think the test described here: #45089 should reproduce the RPC/ProcessGroup GIL deadlock issue, so this should probably be able to use as a unittest for (2). |
After fixing (1), there are multiple existing tests failed due to (2) and (3). This problem didn't surface because existing code swallowed the failure, timed out in 5s, and logged the error message. |
Thanks, let me pull that in as a test. |
Fixes #45082 Found a few problems while working on #44983 1. We deliberately swallow RPC timeouts during shutdown, as we haven't found a good way to handle those. When we convert `_wait_all_workers` into `_all_gather`, the same logic was inherited. However, as `_all_gather` meant to be used in more general scenarios, we should no longer keep silent about errors. This commit let the error throw in `_all_gather` and also let `shutdown()` to catch them and log. 2. After fixing (1), I found that `UnpickledPythonCall` needs to acquire GIL on destruction, and this can lead to deadlock when used in conjuction with `ProcessGroup`. Because `ProcessGroup` ctor is a synchronization point which holds GIL. In `init_rpc`, followers (`rank != 0`) can exit before the leader (`rank == 0`). If the two happens together, we could get a) on a follower, it exits `init_rpc` after running `_broadcast_to_followers` and before the reaching dtor of `UnpickledPythonCall`. Then it runs the ctor of `ProcessGroup`, which holds the GIL and wait for the leader to join. However, the leader is waiting for the response from `_broadcast_to_followers`, which is blocked by the dtor of `UnpickledPythonCall`. And hence the deadlock. This commit drops the GIL in `ProcessGroup` ctor. 3. After fixing (2), I found that `TensorPipe` backend nondeterministically fails with `test_local_shutdown`, due to a similar reason as (2), but this time it is that `shutdown()` on a follower runs before the leader finishes `init_rpc`. This commit adds a join for `TensorPipe` backend `init_rpc` after `_all_gather`. The 3rd one should be able to solve the 2nd one as well. But since I didn't see a reason to hold GIL during `ProcessGroup` ctor, I made that change too. Differential Revision: [D23825592](https://our.internmc.facebook.com/intern/diff/D23825592) [ghstack-poisoned]
Fixes #45082 Found a few problems while working on #44983 1. We deliberately swallow RPC timeouts during shutdown, as we haven't found a good way to handle those. When we convert `_wait_all_workers` into `_all_gather`, the same logic was inherited. However, as `_all_gather` meant to be used in more general scenarios, we should no longer keep silent about errors. This commit let the error throw in `_all_gather` and also let `shutdown()` to catch them and log. 2. After fixing (1), I found that `UnpickledPythonCall` needs to acquire GIL on destruction, and this can lead to deadlock when used in conjuction with `ProcessGroup`. Because `ProcessGroup` ctor is a synchronization point which holds GIL. In `init_rpc`, followers (`rank != 0`) can exit before the leader (`rank == 0`). If the two happens together, we could get a) on a follower, it exits `init_rpc` after running `_broadcast_to_followers` and before the reaching dtor of `UnpickledPythonCall`. Then it runs the ctor of `ProcessGroup`, which holds the GIL and wait for the leader to join. However, the leader is waiting for the response from `_broadcast_to_followers`, which is blocked by the dtor of `UnpickledPythonCall`. And hence the deadlock. This commit drops the GIL in `ProcessGroup` ctor. 3. After fixing (2), I found that `TensorPipe` backend nondeterministically fails with `test_local_shutdown`, due to a similar reason as (2), but this time it is that `shutdown()` on a follower runs before the leader finishes `init_rpc`. This commit adds a join for `TensorPipe` backend `init_rpc` after `_all_gather`. The 3rd one should be able to solve the 2nd one as well. But since I didn't see a reason to hold GIL during `ProcessGroup` ctor, I made that change too. Differential Revision: [D23825592](https://our.internmc.facebook.com/intern/diff/D23825592) [ghstack-poisoned]
Fixes #45082 Found a few problems while working on #44983 1. We deliberately swallow RPC timeouts during shutdown, as we haven't found a good way to handle those. When we convert `_wait_all_workers` into `_all_gather`, the same logic was inherited. However, as `_all_gather` meant to be used in more general scenarios, we should no longer keep silent about errors. This commit let the error throw in `_all_gather` and also let `shutdown()` to catch them and log. 2. After fixing (1), I found that `UnpickledPythonCall` needs to acquire GIL on destruction, and this can lead to deadlock when used in conjuction with `ProcessGroup`. Because `ProcessGroup` ctor is a synchronization point which holds GIL. In `init_rpc`, followers (`rank != 0`) can exit before the leader (`rank == 0`). If the two happens together, we could get a) on a follower, it exits `init_rpc` after running `_broadcast_to_followers` and before the reaching dtor of `UnpickledPythonCall`. Then it runs the ctor of `ProcessGroup`, which holds the GIL and wait for the leader to join. However, the leader is waiting for the response from `_broadcast_to_followers`, which is blocked by the dtor of `UnpickledPythonCall`. And hence the deadlock. This commit drops the GIL in `ProcessGroup` ctor. 3. After fixing (2), I found that `TensorPipe` backend nondeterministically fails with `test_local_shutdown`, due to a similar reason as (2), but this time it is that `shutdown()` on a follower runs before the leader finishes `init_rpc`. This commit adds a join for `TensorPipe` backend `init_rpc` after `_all_gather`. The 3rd one should be able to solve the 2nd one as well. But since I didn't see a reason to hold GIL during `ProcessGroup` ctor, I made that change too. ghstack-source-id: 3184900 Pull Request resolved: #45088
|
Failures in MacOS tests are already there on Master, see: |
It's actualy failing for both ProcessGroup and TensorPipe, but we don't have an issue to track. See the test below. Thanks for opening those issues. |
|
Trying a fix for those 6 issues: #45126 |
|
|
||
| try: | ||
| _tensorpipe_check_device_maps(agent, rpc_backend_options.device_maps) | ||
| agent.join() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes me a bit sad, as I thought in principle we wanted to eventually get rid of the sync and join methods of the agent, and adding more usages will just make it harder. What is the reason for doing this actually? By this time there shouldn't be any RPCs in flight so joining seems to be unnecessary. The only "side effect" I see is that it adds a barrier between all agents. If that's why we use it, can we just make an explicit barrier instead of using join?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes me a bit sad, as I thought in principle we wanted to eventually get rid of the sync and join methods of the agent, and adding more usages will just make it harder.
Agree, I am sad too.
What is the reason for doing this actually? By this time there shouldn't be any RPCs in flight so joining seems to be unnecessary.
There can be RPCs at this point. Assume there are two processes X and Y. Y could finish init_rpc earlier than X and then started sending RPCs to X. At this point, without this join, it is possible that X has not configured the reverse device map, and hence hit failures.
The only "side effect" I see is that it adds a barrier between all agents. If that's why we use it, can we just make an explicit barrier instead of using join?
Which barrier are you referring to? The dist.barrier()? Will that add another dependency on c10d? I cannot tell which one is worse. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, thanks for clarifying that there could be RPCs already in flight, I hadn't realized.
So the problem, at a high level, seems to me that we run in troubles if one rank exits init_rpc early and starts doing stuff while other ranks are potentially still inside init_rpc, right? If that's the case, can't we fix it by doing a barrier at the very end of init_rpc?
On one hand, that would nicely match what @pritamdamania is doing for DDP in #45181 (adding a barrier at the end of init_process_group and new_group). Also, since a barrier is a "weaker" operation than joining the agent (because the latter performs multiple barriers), I'd consider a barrier to be more readable as it makes it clear what the core issue is and what the essential solution is, without going "overkill".
And, if we're doing a barrier, we could use _all_gather and thus avoid using both the (deprecated?) agent.join() and without adding a dependency on c10d.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, I would expect _all_gather to work too. I actually have one PR for that #44990. Let me land that one too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using _all_gather makes test_local_shutdown flaky. Will come back to debug this after branch cut.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, thanks!
Stack from ghstack:
Fixes #45082
Found a few problems while working on #44983
found a good way to handle those. When we convert
_wait_all_workersinto
_all_gather, the same logic was inherited. However, as_all_gathermeant to be used in more general scenarios, we shouldno longer keep silent about errors. This commit let the error throw
in
_all_gatherand also letshutdown()to catch them and log.UnpickledPythonCallneeds toacquire GIL on destruction, and this can lead to deadlock when used
in conjuction with
ProcessGroup. BecauseProcessGroupctor is asynchronization point which holds GIL. In
init_rpc, followers(
rank != 0) can exit before the leader (rank == 0). If the twohappens together, we could get a) on a follower, it exits
init_rpcafter running
_broadcast_to_followersand before the reaching dtorof
UnpickledPythonCall. Then it runs the ctor ofProcessGroup,which holds the GIL and wait for the leader to join. However, the
leader is waiting for the response from
_broadcast_to_followers,which is blocked by the dtor of
UnpickledPythonCall. And hencethe deadlock. This commit drops the GIL in
ProcessGroupctor.TensorPipebackendnondeterministically fails with
test_local_shutdown, due to asimilar reason as (2), but this time it is that
shutdown()on afollower runs before the leader finishes
init_rpc. This commitadds a join for
TensorPipebackendinit_rpcafter_all_gather.The 3rd one should be able to solve the 2nd one as well. But since
I didn't see a reason to hold GIL during
ProcessGroupctor, Imade that change too.
Differential Revision: D23825592