multimem reduce by kwen2501 · Pull Request #164517 · pytorch/pytorch

kwen2501 · 2025-10-02T22:35:22Z

Stack from ghstack (oldest at bottom):

-> multimem reduce #164517

Modified multimem_one_shot_all_reduce_out function to accept a root argument, making it a multimem_reduce op.

The original multimem_one_shot_all_reduce op becomes a caller of the multimem_reduce, with each rank providing its own rank id as root.

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta @msaroufim @dcci

[ghstack-poisoned]

pytorch-bot · 2025-10-02T22:35:25Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/164517

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

Driver update on H100 and A100 instances

✅ No Failures

As of commit 2f475da with merge base a707042 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: d04f156 Pull-Request: #164517

kwen2501 · 2025-10-03T00:00:14Z

@pytorchbot merge

pytorchmergebot · 2025-10-03T00:02:17Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

shunting314 · 2025-10-06T23:18:21Z

Do we drop the support for cuda version before 12.3?

Since otherwise, something like:

+at::Tensor multimem_one_shot_reduce_out(
+    const at::Tensor& input,
+    std::string reduce_op,
+    int64_t root,
+    std::string group_name,
+    at::Tensor out) {
+  TORCH_CHECK(false, "multimem_one_shot_reduce_out: requires CUDA 12.3+.");
+  return input;
+}

need to be added after #elif defined(CUDART_VERSION) && CUDART_VERSION < 12030
to be able to build

facebook-github-bot · 2025-10-07T20:04:44Z

@pytorchbot revert -m="Diff reverted internally" -c="ghfirst"

This Pull Request has been reverted by a revert inside Meta. To re-land this change, please open another pull request, assign the same reviewers, fix the CI failures that caused the revert and make sure that the failing CI runs on the PR by applying the proper ciflow label (e.g., ciflow/trunk).)

pytorchmergebot · 2025-10-07T20:12:24Z

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

This reverts commit d1cbb74. Reverted #164517 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](#164517 (comment)))

pytorchmergebot · 2025-10-07T20:12:42Z

@kwen2501 your PR has been successfully reverted.

kwen2501 · 2025-10-07T23:08:09Z

@shunting314 The code you cited is merely host-side code, without fancy CUDA APIs. Which part is not buildable? Can you share your error log?

shunting314 · 2025-10-07T23:19:31Z

I don't have the log available. But it says multimem_one_shot_reduce_out is not defined since the definition is in a conditional block guarded by the CUDA 12.3 check.

kwen2501 · 2025-10-08T00:22:00Z

I see, thank you @shunting314

[ghstack-poisoned]

ghstack-source-id: e768757 Pull-Request: #164517

kwen2501 · 2025-10-08T00:38:03Z

Added the temporary workaround as suggested.
Longer term, we should remove all those false impl's.

kwen2501 · 2025-10-08T00:38:17Z

@pytorchbot merge

pytorchmergebot · 2025-10-08T00:40:04Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Modified `multimem_one_shot_all_reduce_out` function to accept a `root` argument, making it a `multimem_reduce` op. The original `multimem_one_shot_all_reduce` op becomes a caller of the `multimem_reduce`, with each rank providing its own rank id as root. Pull Request resolved: pytorch#164517 Approved by: https://github.com/ngimel

This reverts commit d1cbb74. Reverted pytorch#164517 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](pytorch#164517 (comment)))

Modified `multimem_one_shot_all_reduce_out` function to accept a `root` argument, making it a `multimem_reduce` op. The original `multimem_one_shot_all_reduce` op becomes a caller of the `multimem_reduce`, with each rank providing its own rank id as root. Pull Request resolved: pytorch#164517 Approved by: https://github.com/ngimel

Update

714c130

[ghstack-poisoned]

pytorch-bot bot added the ciflow/h100-symm-mem label Oct 2, 2025

kwen2501 added a commit that referenced this pull request Oct 2, 2025

multimem reduce

2c89630

ghstack-source-id: d04f156 Pull-Request: #164517

pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels Oct 2, 2025

kwen2501 requested review from fduwjj, fegin and ngimel October 2, 2025 22:39

kwen2501 added the release notes: distributed (symm_mem) release note label for symmetric memory label Oct 2, 2025

ngimel approved these changes Oct 2, 2025

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Oct 3, 2025

pytorchmergebot added the merging label Oct 3, 2025

pytorchmergebot added the Merged label Oct 3, 2025

pytorchmergebot closed this in d1cbb74 Oct 3, 2025

pytorchmergebot removed the merging label Oct 3, 2025

pytorchmergebot added a commit that referenced this pull request Oct 7, 2025

Revert "multimem reduce (#164517)"

f505caa

This reverts commit d1cbb74. Reverted #164517 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](#164517 (comment)))

pytorchmergebot added Reverted ci-no-td Do not run TD on this PR labels Oct 7, 2025

pytorchmergebot reopened this Oct 7, 2025

Update

2f475da

[ghstack-poisoned]

kwen2501 added a commit that referenced this pull request Oct 8, 2025

multimem reduce

ce19af2

ghstack-source-id: e768757 Pull-Request: #164517

pytorchmergebot added the merging label Oct 8, 2025

pytorchmergebot closed this in 19bf67b Oct 8, 2025

pytorchmergebot removed the merging label Oct 8, 2025

github-actions bot deleted the gh/kwen2501/272/head branch November 8, 2025 02:10

Conversation

kwen2501 commented Oct 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Oct 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/164517

❗ 1 Active SEVs

✅ No Failures

Uh oh!

kwen2501 commented Oct 3, 2025

Uh oh!

pytorchmergebot commented Oct 3, 2025

Merge started

Uh oh!

shunting314 commented Oct 6, 2025

Uh oh!

facebook-github-bot commented Oct 7, 2025

Uh oh!

pytorchmergebot commented Oct 7, 2025

Uh oh!

pytorchmergebot commented Oct 7, 2025

Uh oh!

kwen2501 commented Oct 7, 2025

Uh oh!

shunting314 commented Oct 7, 2025

Uh oh!

kwen2501 commented Oct 8, 2025

Uh oh!

kwen2501 commented Oct 8, 2025

Uh oh!

kwen2501 commented Oct 8, 2025

Uh oh!

pytorchmergebot commented Oct 8, 2025

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

kwen2501 commented Oct 2, 2025 •

edited

Loading

pytorch-bot bot commented Oct 2, 2025 •

edited

Loading