[symm_mem] Added a wait for signal and put signal for one side API by fduwjj · Pull Request #159837 · pytorch/pytorch

fduwjj · 2025-08-05T03:50:50Z

Stack from ghstack (oldest at bottom):

-> [symm_mem] Added a wait for signal and put signal for one side API #159837

cc @H-Huang @awgu @wanchaol @fegin @wz337 @wconstab @d4l3k @pragupta @msaroufim @dcci

[ghstack-poisoned]

ghstack-source-id: aa36682 Pull Request resolved: #159837

pytorch-bot · 2025-08-05T03:51:00Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/159837

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 89b4650 with merge base fde929c ():

FLAKY - The following job failed but was likely due to flakiness present on trunk:

Limited CI for symmetric memory tests on H100 / linux-jammy-cuda12.8-py3.10-gcc11-sm90-symm / test (h100-symm-mem, 1, 1, linux.aws.h100.4) (gh) (similar failure)
'test/distributed/test_nvshmem_triton.py::NVSHMEMTritonTest::test_triton_wait_until'

This comment was automatically generated by Dr. CI and updates every 15 minutes.

…ne side API" cc H-Huang awgu wanchaol fegin wz337 wconstab d4l3k pragupta [ghstack-poisoned]

ghstack-source-id: dfe421e Pull Request resolved: #159837

codingwithsurya · 2025-08-09T03:52:33Z

nice :)

H-Huang · 2025-08-15T16:26:37Z

torch/csrc/distributed/c10d/symm_mem/nvshmem_extension.cu

+}
+
+void nvshmem_put_with_signal(at::Tensor& tensor, int64_t peer) {
+  auto hdl = c10d::symmetric_memory::rendezvous(tensor, "0");


What's "0" in this case. Also are we expected to call rendezvous amongst every rank in the group? Or just the ranks that get put/get-ing?

In this case probably all ranks?

"0" means global group. It is a temporary setting that can go wrong if the group is not actually global.
We need handle to remember the group which it has rendezvoused on.

…ne side API" cc H-Huang awgu wanchaol fegin wz337 wconstab d4l3k pragupta [ghstack-poisoned]

ghstack-source-id: 5afd28a Pull Request resolved: #159837

…ne side API" cc H-Huang awgu wanchaol fegin wz337 wconstab d4l3k pragupta [ghstack-poisoned]

ghstack-source-id: 3974588 Pull Request resolved: #159837

kwen2501 · 2025-08-27T04:01:28Z

torch/csrc/distributed/c10d/symm_mem/nvshmem_extension.cu

+}
+
+void nvshmem_put_with_signal(at::Tensor& tensor, int64_t peer) {
+  auto hdl = c10d::symmetric_memory::rendezvous(tensor, "0");


"0" means global group. It is a temporary setting that can go wrong if the group is not actually global.
We need handle to remember the group which it has rendezvoused on.

kwen2501 · 2025-08-27T04:17:50Z

torch/csrc/distributed/c10d/symm_mem/nvshmem_extension.cu

+
+  c10::cuda::CUDAGuard guard(tensor.device());
+  auto stream = at::cuda::getCurrentCUDAStream();
+  nvshmemx_putmem_signal_on_stream(buffer_ptr, tensor.data_ptr(), buffer_size, static_cast<uint64_t*>(signal_ptr), NVSHMEM_SIGNAL_SET, 1, peer, stream);


Here the dst can be tensor.data_ptr() too. (A reminder for myself to refactor the whole file after we land MemPool support.

torch/csrc/distributed/c10d/symm_mem/nvshmem_extension.cu

torch/csrc/distributed/c10d/symm_mem/SymmetricMemory.cpp

…ne side API" cc H-Huang awgu wanchaol fegin wz337 wconstab d4l3k pragupta [ghstack-poisoned]

ghstack-source-id: 7def6af Pull Request resolved: #159837

torch/csrc/distributed/c10d/symm_mem/nvshmem_extension.cu

torch/csrc/distributed/c10d/symm_mem/SymmetricMemory.cpp

torch/csrc/distributed/c10d/symm_mem/nvshmem_extension.cu

…ne side API" cc H-Huang awgu wanchaol fegin wz337 wconstab d4l3k pragupta [ghstack-poisoned]

ghstack-source-id: 611693a Pull Request resolved: #159837

fduwjj · 2025-09-27T18:34:07Z

@pytorchbot merge

pytorchmergebot · 2025-09-27T18:36:08Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

…#159837) Pull Request resolved: #159837 Approved by: https://github.com/kwen2501

torch/csrc/distributed/c10d/symm_mem/nvshmem_extension.cu

…pytorch#159837) Pull Request resolved: pytorch#159837 Approved by: https://github.com/kwen2501

[WIP][symm_mem] Add a wait for signal and put signal for one side API

d9e6170

[ghstack-poisoned]

fduwjj added a commit that referenced this pull request Aug 5, 2025

[WIP][symm_mem] Add a wait for signal and put signal for one side API

3f3d51d

ghstack-source-id: aa36682 Pull Request resolved: #159837

pytorch-bot bot added ciflow/h100-symm-mem oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels Aug 5, 2025

fduwjj marked this pull request as draft August 5, 2025 03:51

Update on "[WIP][symm_mem] Add a wait for signal and put signal for o…

1cb5cf9

…ne side API" cc H-Huang awgu wanchaol fegin wz337 wconstab d4l3k pragupta [ghstack-poisoned]

fduwjj added a commit that referenced this pull request Aug 5, 2025

[WIP][symm_mem] Add a wait for signal and put signal for one side API

69ba834

ghstack-source-id: dfe421e Pull Request resolved: #159837

H-Huang reviewed Aug 15, 2025

View reviewed changes

Update on "[WIP][symm_mem] Add a wait for signal and put signal for o…

e2b53ba

…ne side API" cc H-Huang awgu wanchaol fegin wz337 wconstab d4l3k pragupta [ghstack-poisoned]

fduwjj added a commit that referenced this pull request Aug 26, 2025

[WIP][symm_mem] Add a wait for signal and put signal for one side API

656bb71

ghstack-source-id: 5afd28a Pull Request resolved: #159837

Update on "[WIP][symm_mem] Add a wait for signal and put signal for o…

67ce6d0

…ne side API" cc H-Huang awgu wanchaol fegin wz337 wconstab d4l3k pragupta [ghstack-poisoned]

fduwjj added a commit that referenced this pull request Aug 26, 2025

[WIP][symm_mem] Add a wait for signal and put signal for one side API

7e3cb11

ghstack-source-id: 3974588 Pull Request resolved: #159837

kwen2501 reviewed Aug 27, 2025

View reviewed changes

Update on "[WIP][symm_mem] Add a wait for signal and put signal for o…

05d918d

…ne side API" cc H-Huang awgu wanchaol fegin wz337 wconstab d4l3k pragupta [ghstack-poisoned]

fduwjj added a commit that referenced this pull request Sep 24, 2025

[WIP][symm_mem] Add a wait for signal and put signal for one side API

0f90dfb

ghstack-source-id: 7def6af Pull Request resolved: #159837

kwen2501 approved these changes Sep 24, 2025

View reviewed changes

torch/csrc/distributed/c10d/symm_mem/nvshmem_extension.cu Outdated Show resolved Hide resolved

torch/csrc/distributed/c10d/symm_mem/SymmetricMemory.cpp Show resolved Hide resolved

kwen2501 reviewed Sep 25, 2025

View reviewed changes

torch/csrc/distributed/c10d/symm_mem/nvshmem_extension.cu Outdated Show resolved Hide resolved

Update on "[WIP][symm_mem] Add a wait for signal and put signal for o…

89b4650

…ne side API" cc H-Huang awgu wanchaol fegin wz337 wconstab d4l3k pragupta [ghstack-poisoned]

fduwjj added a commit that referenced this pull request Sep 27, 2025

[WIP][symm_mem] Add a wait for signal and put signal for one side API

14cf388

ghstack-source-id: 611693a Pull Request resolved: #159837

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Sep 27, 2025

pytorchmergebot added the merging label Sep 27, 2025

pytorchmergebot added the Merged label Sep 27, 2025

pytorchmergebot closed this in 2ce2e48 Sep 27, 2025

pytorchmergebot removed the merging label Sep 27, 2025

jainapurva pushed a commit that referenced this pull request Sep 29, 2025

[WIP][symm_mem] Add a wait for signal and put signal for one side API (…

cdba62f

…#159837) Pull Request resolved: #159837 Approved by: https://github.com/kwen2501

fduwjj changed the title ~~[WIP][symm_mem] Add a wait for signal and put signal for one side API~~ [symm_mem] Added a wait for signal and put signal for one side API Sep 29, 2025

kwen2501 reviewed Sep 29, 2025

View reviewed changes

torch/csrc/distributed/c10d/symm_mem/nvshmem_extension.cu Show resolved Hide resolved

maggiemoss pushed a commit to maggiemoss/pytorch that referenced this pull request Sep 29, 2025

[WIP][symm_mem] Add a wait for signal and put signal for one side API (…

e170312

…pytorch#159837) Pull Request resolved: pytorch#159837 Approved by: https://github.com/kwen2501

github-actions bot deleted the gh/fduwjj/176/head branch October 30, 2025 02:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[symm_mem] Added a wait for signal and put signal for one side API#159837

[symm_mem] Added a wait for signal and put signal for one side API#159837
fduwjj wants to merge 6 commits intogh/fduwjj/176/basefrom
gh/fduwjj/176/head

fduwjj commented Aug 5, 2025 •

edited by pytorch-bot bot

Loading

Uh oh!

pytorch-bot bot commented Aug 5, 2025 •

edited

Loading

Uh oh!

codingwithsurya commented Aug 9, 2025

Uh oh!

H-Huang Aug 15, 2025

Uh oh!

fduwjj Aug 26, 2025

Uh oh!

kwen2501 Aug 27, 2025

Uh oh!

kwen2501 Aug 27, 2025

Uh oh!

kwen2501 Aug 27, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

fduwjj commented Sep 27, 2025

Uh oh!

pytorchmergebot commented Sep 27, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

fduwjj commented Aug 5, 2025 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Aug 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/159837

✅ You can merge normally! (1 Unrelated Failure)

Uh oh!

codingwithsurya commented Aug 9, 2025

Uh oh!

H-Huang Aug 15, 2025

Choose a reason for hiding this comment

Uh oh!

fduwjj Aug 26, 2025

Choose a reason for hiding this comment

Uh oh!

kwen2501 Aug 27, 2025

Choose a reason for hiding this comment

Uh oh!

kwen2501 Aug 27, 2025

Choose a reason for hiding this comment

Uh oh!

kwen2501 Aug 27, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

fduwjj commented Sep 27, 2025

Uh oh!

pytorchmergebot commented Sep 27, 2025

Merge started

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

fduwjj commented Aug 5, 2025 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Aug 5, 2025 •

edited

Loading