[SymmMem] Fix put_signal + wait_until hang by kwen2501 · Pull Request #163194 · pytorch/pytorch

kwen2501 · 2025-09-17T21:16:10Z

Stack from ghstack (oldest at bottom):

The test used a wrong ptr to refer to remote address:

            dst_ptr = out_hdl.buffer_ptrs[peer]
            src_ptr = inp_hdl.buffer_ptrs[rank]
            sig_ptr = out_hdl.signal_pad_ptrs[peer]

All three indices should be rank instead of peer because NVSHMEM APIs accept local address as input and perform translation internally. Without correct signal address, the peer would be waiting, thus hang.

Also adjusted the signature of nvshmem.putmem_signal_block to accept tensor instead of pointer.

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta @ezyang @msaroufim @dcci

[ghstack-poisoned]

pytorch-bot · 2025-09-17T21:16:14Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/163194

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit d2f3060 with merge base 4840a1a ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: ae6bcf7 Pull-Request: #163194

kwen2501 · 2025-09-18T15:38:28Z

@pytorchbot merge

pytorchmergebot · 2025-09-18T15:40:18Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2025-09-18T18:20:04Z

This PR (#163194) was merged in 80f8be9 but it is still open, likely due to a Github bug, so mergebot is closing it manually. If you think this is a mistake, please feel free to reopen and contact Dev Infra.

The test used a wrong ptr to refer to remote address: ``` dst_ptr = out_hdl.buffer_ptrs[peer] src_ptr = inp_hdl.buffer_ptrs[rank] sig_ptr = out_hdl.signal_pad_ptrs[peer] ``` All three indices should be `rank` instead of `peer` because NVSHMEM APIs accept local address as input and perform translation internally. Without correct signal address, the peer would be waiting, thus hang. Also adjusted the signature of `nvshmem.putmem_signal_block` to accept tensor instead of pointer. Pull Request resolved: pytorch#163194 Approved by: https://github.com/ngimel ghstack dependencies: pytorch#163025, pytorch#163152

…3423) ### Issue The previous `enable_triton` UI requires the user-defined Triton kernel have a "nvshmem" in its name. If users did not do so, the kernel would miss the NVSHMEM init, and silently hit CUDA IMA. The `@require_nvshmem` decorator eliminates the above name requirement (and the `enable_triton` call). ### Usage: ``` @requires_nvshmem @triton.jit def foo(...): ... foo[(1, 1)](...) ``` It also remove the need of passing `extern_lib` to `foo` (handled by the decorator now). Pull Request resolved: #163423 Approved by: https://github.com/ngimel ghstack dependencies: #163025, #163152, #163194

kwen2501 · 2025-09-21T20:55:21Z

@pytorchbot cherry-pick --onto release/2.9 --fixes #162934 -c critical

The test used a wrong ptr to refer to remote address: ``` dst_ptr = out_hdl.buffer_ptrs[peer] src_ptr = inp_hdl.buffer_ptrs[rank] sig_ptr = out_hdl.signal_pad_ptrs[peer] ``` All three indices should be `rank` instead of `peer` because NVSHMEM APIs accept local address as input and perform translation internally. Without correct signal address, the peer would be waiting, thus hang. Also adjusted the signature of `nvshmem.putmem_signal_block` to accept tensor instead of pointer. Pull Request resolved: #163194 Approved by: https://github.com/ngimel ghstack dependencies: #163025, #163152 (cherry picked from commit 80f8be9)

pytorchbot · 2025-09-21T21:00:38Z

Cherry picking #163194

The cherry pick PR is at #163458 and it is linked with issue #162934. The following tracker issues are updated:

[v.2.9.0] Release Tracker #162497 (comment)

Details for Dev Infra team

Raised by workflow job

The test used a wrong ptr to refer to remote address: ``` dst_ptr = out_hdl.buffer_ptrs[peer] src_ptr = inp_hdl.buffer_ptrs[rank] sig_ptr = out_hdl.signal_pad_ptrs[peer] ``` All three indices should be `rank` instead of `peer` because NVSHMEM APIs accept local address as input and perform translation internally. Without correct signal address, the peer would be waiting, thus hang. Also adjusted the signature of `nvshmem.putmem_signal_block` to accept tensor instead of pointer. Pull Request resolved: pytorch#163194 Approved by: https://github.com/ngimel ghstack dependencies: pytorch#163025, pytorch#163152

…orch#163423) ### Issue The previous `enable_triton` UI requires the user-defined Triton kernel have a "nvshmem" in its name. If users did not do so, the kernel would miss the NVSHMEM init, and silently hit CUDA IMA. The `@require_nvshmem` decorator eliminates the above name requirement (and the `enable_triton` call). ### Usage: ``` @requires_nvshmem @triton.jit def foo(...): ... foo[(1, 1)](...) ``` It also remove the need of passing `extern_lib` to `foo` (handled by the decorator now). Pull Request resolved: pytorch#163423 Approved by: https://github.com/ngimel ghstack dependencies: pytorch#163025, pytorch#163152, pytorch#163194

The test used a wrong ptr to refer to remote address: ``` dst_ptr = out_hdl.buffer_ptrs[peer] src_ptr = inp_hdl.buffer_ptrs[rank] sig_ptr = out_hdl.signal_pad_ptrs[peer] ``` All three indices should be `rank` instead of `peer` because NVSHMEM APIs accept local address as input and perform translation internally. Without correct signal address, the peer would be waiting, thus hang. Also adjusted the signature of `nvshmem.putmem_signal_block` to accept tensor instead of pointer. Pull Request resolved: pytorch#163194 Approved by: https://github.com/ngimel ghstack dependencies: pytorch#163025, pytorch#163152

…orch#163423) ### Issue The previous `enable_triton` UI requires the user-defined Triton kernel have a "nvshmem" in its name. If users did not do so, the kernel would miss the NVSHMEM init, and silently hit CUDA IMA. The `@require_nvshmem` decorator eliminates the above name requirement (and the `enable_triton` call). ### Usage: ``` @requires_nvshmem @triton.jit def foo(...): ... foo[(1, 1)](...) ``` It also remove the need of passing `extern_lib` to `foo` (handled by the decorator now). Pull Request resolved: pytorch#163423 Approved by: https://github.com/ngimel ghstack dependencies: pytorch#163025, pytorch#163152, pytorch#163194

[SymmMem] Fix put_signal + wait_until hang (#163194) The test used a wrong ptr to refer to remote address: ``` dst_ptr = out_hdl.buffer_ptrs[peer] src_ptr = inp_hdl.buffer_ptrs[rank] sig_ptr = out_hdl.signal_pad_ptrs[peer] ``` All three indices should be `rank` instead of `peer` because NVSHMEM APIs accept local address as input and perform translation internally. Without correct signal address, the peer would be waiting, thus hang. Also adjusted the signature of `nvshmem.putmem_signal_block` to accept tensor instead of pointer. Pull Request resolved: #163194 Approved by: https://github.com/ngimel ghstack dependencies: #163025, #163152 (cherry picked from commit 80f8be9) Co-authored-by: Ke Wen <kw2501@meta.com>

The test used a wrong ptr to refer to remote address: ``` dst_ptr = out_hdl.buffer_ptrs[peer] src_ptr = inp_hdl.buffer_ptrs[rank] sig_ptr = out_hdl.signal_pad_ptrs[peer] ``` All three indices should be `rank` instead of `peer` because NVSHMEM APIs accept local address as input and perform translation internally. Without correct signal address, the peer would be waiting, thus hang. Also adjusted the signature of `nvshmem.putmem_signal_block` to accept tensor instead of pointer. Pull Request resolved: pytorch#163194 Approved by: https://github.com/ngimel ghstack dependencies: pytorch#163025, pytorch#163152

…orch#163423) ### Issue The previous `enable_triton` UI requires the user-defined Triton kernel have a "nvshmem" in its name. If users did not do so, the kernel would miss the NVSHMEM init, and silently hit CUDA IMA. The `@require_nvshmem` decorator eliminates the above name requirement (and the `enable_triton` call). ### Usage: ``` @requires_nvshmem @triton.jit def foo(...): ... foo[(1, 1)](...) ``` It also remove the need of passing `extern_lib` to `foo` (handled by the decorator now). Pull Request resolved: pytorch#163423 Approved by: https://github.com/ngimel ghstack dependencies: pytorch#163025, pytorch#163152, pytorch#163194

Update

d2f3060

[ghstack-poisoned]

pytorch-bot bot added ciflow/h100-symm-mem oncall: distributed Add this issue/PR to distributed oncall triage queue labels Sep 17, 2025

kwen2501 added a commit that referenced this pull request Sep 17, 2025

[SymmMem] Fix put_signal + wait_until hang

5472681

ghstack-source-id: ae6bcf7 Pull-Request: #163194

kwen2501 mentioned this pull request Sep 17, 2025

[SymmMem] Fix NVSHMEM plugin + Triton 3.5 #163152

Closed

kwen2501 added the release notes: distributed (symm_mem) release note label for symmetric memory label Sep 17, 2025

kwen2501 requested review from fegin and ngimel September 17, 2025 21:20

ngimel approved these changes Sep 18, 2025

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Sep 18, 2025

pytorchmergebot added the merging label Sep 18, 2025

pytorchbot mentioned this pull request Sep 18, 2025

[SymmMem] Fix NVSHMEM plugin + Triton 3.5 #163262

Merged

pytorchmergebot added the Merged label Sep 18, 2025

pytorchmergebot closed this Sep 18, 2025

pytorchmergebot removed the merging label Sep 18, 2025

This was referenced Sep 21, 2025

[SymmMem] Fix put_signal + wait_until hang #163458

Merged

[v.2.9.0] Release Tracker #162497

Closed

github-actions bot deleted the gh/kwen2501/251/head branch October 22, 2025 02:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SymmMem] Fix put_signal + wait_until hang#163194

[SymmMem] Fix put_signal + wait_until hang#163194
kwen2501 wants to merge 1 commit intogh/kwen2501/251/basefrom
gh/kwen2501/251/head

kwen2501 commented Sep 17, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Sep 17, 2025 •

edited

Loading

Uh oh!

kwen2501 commented Sep 18, 2025

Uh oh!

pytorchmergebot commented Sep 18, 2025

Uh oh!

pytorchmergebot commented Sep 18, 2025

Uh oh!

kwen2501 commented Sep 21, 2025

Uh oh!

pytorchbot commented Sep 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

kwen2501 commented Sep 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Sep 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/163194

✅ No Failures

Uh oh!

kwen2501 commented Sep 18, 2025

Uh oh!

pytorchmergebot commented Sep 18, 2025

Merge started

Uh oh!

pytorchmergebot commented Sep 18, 2025

Uh oh!

kwen2501 commented Sep 21, 2025

Uh oh!

pytorchbot commented Sep 21, 2025

Cherry picking #163194

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

kwen2501 commented Sep 17, 2025 •

edited

Loading

pytorch-bot bot commented Sep 17, 2025 •

edited

Loading