[SymmMem] Fix put_signal + wait_until hang#163194
Closed
kwen2501 wants to merge 1 commit intogh/kwen2501/251/basefrom
Closed
[SymmMem] Fix put_signal + wait_until hang#163194kwen2501 wants to merge 1 commit intogh/kwen2501/251/basefrom
kwen2501 wants to merge 1 commit intogh/kwen2501/251/basefrom
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/163194
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit d2f3060 with merge base 4840a1a ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
ngimel
approved these changes
Sep 18, 2025
Collaborator
Author
|
@pytorchbot merge |
Collaborator
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Collaborator
jeffkbkim
pushed a commit
to jeffkbkim/pytorch
that referenced
this pull request
Sep 18, 2025
The test used a wrong ptr to refer to remote address:
```
dst_ptr = out_hdl.buffer_ptrs[peer]
src_ptr = inp_hdl.buffer_ptrs[rank]
sig_ptr = out_hdl.signal_pad_ptrs[peer]
```
All three indices should be `rank` instead of `peer` because NVSHMEM APIs accept local address as input and perform translation internally. Without correct signal address, the peer would be waiting, thus hang.
Also adjusted the signature of `nvshmem.putmem_signal_block` to accept tensor instead of pointer.
Pull Request resolved: pytorch#163194
Approved by: https://github.com/ngimel
ghstack dependencies: pytorch#163025, pytorch#163152
pytorchmergebot
pushed a commit
that referenced
this pull request
Sep 21, 2025
…3423) ### Issue The previous `enable_triton` UI requires the user-defined Triton kernel have a "nvshmem" in its name. If users did not do so, the kernel would miss the NVSHMEM init, and silently hit CUDA IMA. The `@require_nvshmem` decorator eliminates the above name requirement (and the `enable_triton` call). ### Usage: ``` @requires_nvshmem @triton.jit def foo(...): ... foo[(1, 1)](...) ``` It also remove the need of passing `extern_lib` to `foo` (handled by the decorator now). Pull Request resolved: #163423 Approved by: https://github.com/ngimel ghstack dependencies: #163025, #163152, #163194
Collaborator
Author
|
@pytorchbot cherry-pick --onto release/2.9 --fixes #162934 -c critical |
pytorchbot
pushed a commit
that referenced
this pull request
Sep 21, 2025
The test used a wrong ptr to refer to remote address:
```
dst_ptr = out_hdl.buffer_ptrs[peer]
src_ptr = inp_hdl.buffer_ptrs[rank]
sig_ptr = out_hdl.signal_pad_ptrs[peer]
```
All three indices should be `rank` instead of `peer` because NVSHMEM APIs accept local address as input and perform translation internally. Without correct signal address, the peer would be waiting, thus hang.
Also adjusted the signature of `nvshmem.putmem_signal_block` to accept tensor instead of pointer.
Pull Request resolved: #163194
Approved by: https://github.com/ngimel
ghstack dependencies: #163025, #163152
(cherry picked from commit 80f8be9)
Collaborator
Cherry picking #163194The cherry pick PR is at #163458 and it is linked with issue #162934. The following tracker issues are updated: Details for Dev Infra teamRaised by workflow job |
This was referenced Sep 21, 2025
mansiag05
pushed a commit
to mansiag05/pytorch
that referenced
this pull request
Sep 22, 2025
The test used a wrong ptr to refer to remote address:
```
dst_ptr = out_hdl.buffer_ptrs[peer]
src_ptr = inp_hdl.buffer_ptrs[rank]
sig_ptr = out_hdl.signal_pad_ptrs[peer]
```
All three indices should be `rank` instead of `peer` because NVSHMEM APIs accept local address as input and perform translation internally. Without correct signal address, the peer would be waiting, thus hang.
Also adjusted the signature of `nvshmem.putmem_signal_block` to accept tensor instead of pointer.
Pull Request resolved: pytorch#163194
Approved by: https://github.com/ngimel
ghstack dependencies: pytorch#163025, pytorch#163152
mansiag05
pushed a commit
to mansiag05/pytorch
that referenced
this pull request
Sep 22, 2025
…orch#163423) ### Issue The previous `enable_triton` UI requires the user-defined Triton kernel have a "nvshmem" in its name. If users did not do so, the kernel would miss the NVSHMEM init, and silently hit CUDA IMA. The `@require_nvshmem` decorator eliminates the above name requirement (and the `enable_triton` call). ### Usage: ``` @requires_nvshmem @triton.jit def foo(...): ... foo[(1, 1)](...) ``` It also remove the need of passing `extern_lib` to `foo` (handled by the decorator now). Pull Request resolved: pytorch#163423 Approved by: https://github.com/ngimel ghstack dependencies: pytorch#163025, pytorch#163152, pytorch#163194
cleonard530
pushed a commit
to cleonard530/pytorch
that referenced
this pull request
Sep 22, 2025
The test used a wrong ptr to refer to remote address:
```
dst_ptr = out_hdl.buffer_ptrs[peer]
src_ptr = inp_hdl.buffer_ptrs[rank]
sig_ptr = out_hdl.signal_pad_ptrs[peer]
```
All three indices should be `rank` instead of `peer` because NVSHMEM APIs accept local address as input and perform translation internally. Without correct signal address, the peer would be waiting, thus hang.
Also adjusted the signature of `nvshmem.putmem_signal_block` to accept tensor instead of pointer.
Pull Request resolved: pytorch#163194
Approved by: https://github.com/ngimel
ghstack dependencies: pytorch#163025, pytorch#163152
cleonard530
pushed a commit
to cleonard530/pytorch
that referenced
this pull request
Sep 22, 2025
…orch#163423) ### Issue The previous `enable_triton` UI requires the user-defined Triton kernel have a "nvshmem" in its name. If users did not do so, the kernel would miss the NVSHMEM init, and silently hit CUDA IMA. The `@require_nvshmem` decorator eliminates the above name requirement (and the `enable_triton` call). ### Usage: ``` @requires_nvshmem @triton.jit def foo(...): ... foo[(1, 1)](...) ``` It also remove the need of passing `extern_lib` to `foo` (handled by the decorator now). Pull Request resolved: pytorch#163423 Approved by: https://github.com/ngimel ghstack dependencies: pytorch#163025, pytorch#163152, pytorch#163194
Camyll
pushed a commit
that referenced
this pull request
Sep 23, 2025
[SymmMem] Fix put_signal + wait_until hang (#163194) The test used a wrong ptr to refer to remote address: ``` dst_ptr = out_hdl.buffer_ptrs[peer] src_ptr = inp_hdl.buffer_ptrs[rank] sig_ptr = out_hdl.signal_pad_ptrs[peer] ``` All three indices should be `rank` instead of `peer` because NVSHMEM APIs accept local address as input and perform translation internally. Without correct signal address, the peer would be waiting, thus hang. Also adjusted the signature of `nvshmem.putmem_signal_block` to accept tensor instead of pointer. Pull Request resolved: #163194 Approved by: https://github.com/ngimel ghstack dependencies: #163025, #163152 (cherry picked from commit 80f8be9) Co-authored-by: Ke Wen <kw2501@meta.com>
dsashidh
pushed a commit
to dsashidh/pytorch
that referenced
this pull request
Sep 26, 2025
The test used a wrong ptr to refer to remote address:
```
dst_ptr = out_hdl.buffer_ptrs[peer]
src_ptr = inp_hdl.buffer_ptrs[rank]
sig_ptr = out_hdl.signal_pad_ptrs[peer]
```
All three indices should be `rank` instead of `peer` because NVSHMEM APIs accept local address as input and perform translation internally. Without correct signal address, the peer would be waiting, thus hang.
Also adjusted the signature of `nvshmem.putmem_signal_block` to accept tensor instead of pointer.
Pull Request resolved: pytorch#163194
Approved by: https://github.com/ngimel
ghstack dependencies: pytorch#163025, pytorch#163152
dsashidh
pushed a commit
to dsashidh/pytorch
that referenced
this pull request
Sep 26, 2025
…orch#163423) ### Issue The previous `enable_triton` UI requires the user-defined Triton kernel have a "nvshmem" in its name. If users did not do so, the kernel would miss the NVSHMEM init, and silently hit CUDA IMA. The `@require_nvshmem` decorator eliminates the above name requirement (and the `enable_triton` call). ### Usage: ``` @requires_nvshmem @triton.jit def foo(...): ... foo[(1, 1)](...) ``` It also remove the need of passing `extern_lib` to `foo` (handled by the decorator now). Pull Request resolved: pytorch#163423 Approved by: https://github.com/ngimel ghstack dependencies: pytorch#163025, pytorch#163152, pytorch#163194
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Stack from ghstack (oldest at bottom):
The test used a wrong ptr to refer to remote address:
All three indices should be
rankinstead ofpeerbecause NVSHMEM APIs accept local address as input and perform translation internally. Without correct signal address, the peer would be waiting, thus hang.Also adjusted the signature of
nvshmem.putmem_signal_blockto accept tensor instead of pointer.cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta @ezyang @msaroufim @dcci