Removing unnecessary wait_tensor interception in LocalTensor#169734
Removing unnecessary wait_tensor interception in LocalTensor#169734dzmitry-huba wants to merge 1 commit intogh/dzmitry-huba/18/basefrom
Conversation
Base implementation for wait_tensor pops work from registry, waits on it and then returns passed in object. Not draining registry (as current implementation incorrectly does) results in leaking work objects and may result in incorrect outcome when using LocalRunner. [ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/169734
Note: Links to docs will display an error until the docs builds have been completed. ✅ You can merge normally! (1 Unrelated Failure)As of commit 1ed7463 with merge base fbe9a5b ( FLAKY - The following job failed but was likely due to flakiness present on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
Base implementation for wait_tensor pops work from registry, waits on it and then returns passed in object. Not draining registry (as current implementation incorrectly does) results in leaking work objects and may result in incorrect outcome when using LocalRunner. ghstack-source-id: 9d21e77 Pull Request resolved: #169734
| return _c10d._local_functional_reduce_scatter_tensor(*args, **kwargs) | ||
| elif func is torch.ops._c10d_functional.all_to_all_single.default: | ||
| return _c10d._local_functional_all_to_all_single(*args, **kwargs) | ||
| elif func is torch.ops._c10d_functional.wait_tensor.default: |
There was a problem hiding this comment.
is the explanation of what went wrong here that we intercepted the wait_tensor ops, but lost the part of their implementation that interacted with the work registry?
follow up question- if we are doing localtensor stuff, we are not doing any actual collectives in the first place, we shouldn't have actual work objects registered to a process group right? should we stop pushing wait objects into the global registry when they are not genuine waits? Maybe this does not matter, i'm not sure.
There was a problem hiding this comment.
#1 Correct, we lost part that interacted with WorkRegistry.
#2 While we are intercepting some functional collectives, we do not intercept all (for example, those that need output allocation based on the inputs that are under local tensor maybe specific per rank). Others, maybe implemented in terms of standard collectives - these implementations take returned Work object (even if it is fake) and register it with the registry. Hence the waits may be fake, but work registry is still non-empty.
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
…#169734) Base implementation for wait_tensor pops work from registry, waits on it and then returns passed in object. Not draining registry (as current implementation incorrectly does) results in leaking work objects and may result in incorrect outcome when using LocalRunner. Pull Request resolved: pytorch#169734 Approved by: https://github.com/dolpm
Base implementation for wait_tensor pops work from registry, waits on it and then returns passed in object. Not draining registry (as current implementation incorrectly does) results in leaking work objects and may result in incorrect outcome when using LocalRunner. Pull Request resolved: #169734 Approved by: https://github.com/dolpm
Stack from ghstack (oldest at bottom):
Base implementation for wait_tensor pops work from registry, waits
on it and then returns passed in object. Not draining registry
(as current implementation incorrectly does) results in leaking
work objects and may result in incorrect outcome when using LocalRunner.