Skip to content

Removing unnecessary wait_tensor interception in LocalTensor#169734

Closed
dzmitry-huba wants to merge 1 commit intogh/dzmitry-huba/18/basefrom
gh/dzmitry-huba/18/head
Closed

Removing unnecessary wait_tensor interception in LocalTensor#169734
dzmitry-huba wants to merge 1 commit intogh/dzmitry-huba/18/basefrom
gh/dzmitry-huba/18/head

Conversation

@dzmitry-huba
Copy link
Contributor

@dzmitry-huba dzmitry-huba commented Dec 6, 2025

Stack from ghstack (oldest at bottom):

Base implementation for wait_tensor pops work from registry, waits
on it and then returns passed in object. Not draining registry
(as current implementation incorrectly does) results in leaking
work objects and may result in incorrect outcome when using LocalRunner.

Base implementation for wait_tensor pops work from registry, waits
on it and then returns passed in object. Not draining registry
(as current implementation incorrectly does) results in leaking
work objects and may result in incorrect outcome when using LocalRunner.

[ghstack-poisoned]
@pytorch-bot
Copy link

pytorch-bot bot commented Dec 6, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/169734

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 1ed7463 with merge base fbe9a5b (image):

FLAKY - The following job failed but was likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added the release notes: distributed (c10d) release notes category label Dec 6, 2025
dzmitry-huba added a commit that referenced this pull request Dec 6, 2025
Base implementation for wait_tensor pops work from registry, waits
on it and then returns passed in object. Not draining registry
(as current implementation incorrectly does) results in leaking
work objects and may result in incorrect outcome when using LocalRunner.

ghstack-source-id: 9d21e77
Pull Request resolved: #169734
return _c10d._local_functional_reduce_scatter_tensor(*args, **kwargs)
elif func is torch.ops._c10d_functional.all_to_all_single.default:
return _c10d._local_functional_all_to_all_single(*args, **kwargs)
elif func is torch.ops._c10d_functional.wait_tensor.default:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the explanation of what went wrong here that we intercepted the wait_tensor ops, but lost the part of their implementation that interacted with the work registry?

follow up question- if we are doing localtensor stuff, we are not doing any actual collectives in the first place, we shouldn't have actual work objects registered to a process group right? should we stop pushing wait objects into the global registry when they are not genuine waits? Maybe this does not matter, i'm not sure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#1 Correct, we lost part that interacted with WorkRegistry.

#2 While we are intercepting some functional collectives, we do not intercept all (for example, those that need output allocation based on the inputs that are under local tensor maybe specific per rank). Others, maybe implemented in terms of standard collectives - these implementations take returned Work object (even if it is fake) and register it with the registry. Hence the waits may be fake, but work registry is still non-empty.

@dzmitry-huba
Copy link
Contributor Author

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Dec 6, 2025
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

umechand-amd pushed a commit to ROCm/pytorch that referenced this pull request Dec 8, 2025
…#169734)

Base implementation for wait_tensor pops work from registry, waits
on it and then returns passed in object. Not draining registry
(as current implementation incorrectly does) results in leaking
work objects and may result in incorrect outcome when using LocalRunner.
Pull Request resolved: pytorch#169734
Approved by: https://github.com/dolpm
JacobSzwejbka pushed a commit that referenced this pull request Dec 8, 2025
Base implementation for wait_tensor pops work from registry, waits
on it and then returns passed in object. Not draining registry
(as current implementation incorrectly does) results in leaking
work objects and may result in incorrect outcome when using LocalRunner.
Pull Request resolved: #169734
Approved by: https://github.com/dolpm
@github-actions github-actions bot deleted the gh/dzmitry-huba/18/head branch January 6, 2026 02:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request Merged release notes: distributed (c10d) release notes category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants