Add option to FakeProcessGroup to raise error if comms are invoked. by ezyang · Pull Request #162841 · pytorch/pytorch

ezyang · 2025-09-12T19:51:02Z

Stack from ghstack (oldest at bottom):

-> Add option to FakeProcessGroup to raise error if comms are invoked. #162841

The current behavior is to do "nothing", which means you will corrupt
data. If you're doing something similar to LocalTensor, where you're
overriding the behavior of collectives to do something numerically,
this can be unwelcome behavior. If you can error when this happens
it can help prevent silent numerical incorrectness.

Authored with claude code.

Signed-off-by: Edward Yang ezyang@meta.com

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta @msaroufim @dcci

[ghstack-poisoned]

pytorch-bot · 2025-09-12T19:51:06Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/162841

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit 36f0bab with merge base d633bac ():

NEW FAILURE - The following job has failed:

trunk / linux-jammy-rocm-py3.10 / test (distributed, 1, 1, linux.rocm.gpu.gfx942.4) (gh)
distributed/test_c10d_nccl.py::TimeoutTest::test_default_store_timeout_nccl

This comment was automatically generated by Dr. CI and updates every 15 minutes.

The current behavior is to do "nothing", which means you will corrupt data. If you're doing something similar to LocalTensor, where you're overriding the behavior of collectives to do something numerically, this can be unwelcome behavior. If you can error when this happens it can help prevent silent numerical incorrectness. Authored with claude code. Signed-off-by: Edward Yang <ezyang@meta.com> ghstack-source-id: 9dff853 Pull-Request: #162841

[ghstack-poisoned]

The current behavior is to do "nothing", which means you will corrupt data. If you're doing something similar to LocalTensor, where you're overriding the behavior of collectives to do something numerically, this can be unwelcome behavior. If you can error when this happens it can help prevent silent numerical incorrectness. Authored with claude code. Signed-off-by: Edward Yang <ezyang@meta.com> ghstack-source-id: 3d4910b Pull-Request: #162841

ezyang · 2025-09-29T13:52:15Z

@pytorchbot merge

pytorchmergebot · 2025-09-29T13:55:12Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2025-09-29T13:55:23Z

Merge failed

Reason: 1 mandatory check(s) failed. The first few are:

Lint / lintrunner-noclang / linux-job

Dig deeper by viewing the failures on hud

Details for Dev Infra team

Raised by workflow job

Failing merge rule: Core Maintainers

[ghstack-poisoned]

The current behavior is to do "nothing", which means you will corrupt data. If you're doing something similar to LocalTensor, where you're overriding the behavior of collectives to do something numerically, this can be unwelcome behavior. If you can error when this happens it can help prevent silent numerical incorrectness. Authored with claude code. Signed-off-by: Edward Yang <ezyang@meta.com> ghstack-source-id: 98cddeb Pull-Request: #162841

ezyang · 2025-09-29T14:15:03Z

@pytorchbot merge

pytorchmergebot · 2025-09-29T14:17:12Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2025-09-29T15:10:24Z

Merge failed

Reason: 1 jobs have failed, first few of them are: trunk / linux-jammy-rocm-py3.10 / test (distributed, 1, 1, linux.rocm.gpu.gfx942.4)

Details for Dev Infra team

Raised by workflow job

ezyang · 2025-10-01T17:40:27Z

@pytorchbot merge -i

pytorchmergebot · 2025-10-01T17:42:32Z

Merge started

Your change will be merged while ignoring the following 1 checks: trunk / linux-jammy-rocm-py3.10 / test (distributed, 1, 1, linux.rocm.gpu.gfx942.4)

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

…ytorch#162841) The current behavior is to do "nothing", which means you will corrupt data. If you're doing something similar to LocalTensor, where you're overriding the behavior of collectives to do something numerically, this can be unwelcome behavior. If you can error when this happens it can help prevent silent numerical incorrectness. Authored with claude code. Signed-off-by: Edward Yang <ezyang@meta.com> Pull Request resolved: pytorch#162841 Approved by: https://github.com/dcci

Update

65774bc

[ghstack-poisoned]

pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels Sep 12, 2025

github-actions bot requested review from SherlockNoMad, albanD, antoniojkim, bdhirsh and miladm September 12, 2025 19:51

albanD removed their request for review September 12, 2025 20:09

dcci approved these changes Sep 12, 2025

View reviewed changes

Update

659f7c6

[ghstack-poisoned]

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Sep 29, 2025

pytorchmergebot added the merging label Sep 29, 2025

pytorchmergebot removed the merging label Sep 29, 2025

Update

36f0bab

[ghstack-poisoned]

pytorchmergebot added the merging label Sep 29, 2025

pytorchmergebot removed the merging label Sep 29, 2025

pytorchmergebot added the merging label Oct 1, 2025

pytorchmergebot added the Merged label Oct 1, 2025

pytorchmergebot closed this in 76ddbc2 Oct 1, 2025

pytorchmergebot removed the merging label Oct 1, 2025

github-actions bot deleted the gh/ezyang/3151/head branch November 1, 2025 02:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add option to FakeProcessGroup to raise error if comms are invoked.#162841

Add option to FakeProcessGroup to raise error if comms are invoked.#162841
ezyang wants to merge 3 commits intogh/ezyang/3151/basefrom
gh/ezyang/3151/head

ezyang commented Sep 12, 2025 •

edited by pytorch-bot bot

Loading

Uh oh!

pytorch-bot bot commented Sep 12, 2025 •

edited

Loading

Uh oh!

ezyang commented Sep 29, 2025

Uh oh!

pytorchmergebot commented Sep 29, 2025

Uh oh!

pytorchmergebot commented Sep 29, 2025

Uh oh!

ezyang commented Sep 29, 2025

Uh oh!

pytorchmergebot commented Sep 29, 2025

Uh oh!

pytorchmergebot commented Sep 29, 2025

Uh oh!

ezyang commented Oct 1, 2025

Uh oh!

pytorchmergebot commented Oct 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ezyang commented Sep 12, 2025 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Sep 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/162841

❌ 1 New Failure

Uh oh!

ezyang commented Sep 29, 2025

Uh oh!

pytorchmergebot commented Sep 29, 2025

Merge started

Uh oh!

pytorchmergebot commented Sep 29, 2025

Merge failed

Uh oh!

ezyang commented Sep 29, 2025

Uh oh!

pytorchmergebot commented Sep 29, 2025

Merge started

Uh oh!

pytorchmergebot commented Sep 29, 2025

Merge failed

Uh oh!

ezyang commented Oct 1, 2025

Uh oh!

pytorchmergebot commented Oct 1, 2025

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ezyang commented Sep 12, 2025 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Sep 12, 2025 •

edited

Loading