Skip to content

[symm_mem] Added a wait for signal and put signal for one side API#159837

Closed
fduwjj wants to merge 6 commits intogh/fduwjj/176/basefrom
gh/fduwjj/176/head
Closed

[symm_mem] Added a wait for signal and put signal for one side API#159837
fduwjj wants to merge 6 commits intogh/fduwjj/176/basefrom
gh/fduwjj/176/head

Conversation

@fduwjj
Copy link
Contributor

@fduwjj fduwjj commented Aug 5, 2025

fduwjj added a commit that referenced this pull request Aug 5, 2025
@pytorch-bot pytorch-bot bot added ciflow/h100-symm-mem oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels Aug 5, 2025
@pytorch-bot
Copy link

pytorch-bot bot commented Aug 5, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/159837

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 89b4650 with merge base fde929c (image):

FLAKY - The following job failed but was likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@fduwjj fduwjj marked this pull request as draft August 5, 2025 03:51
…ne side API"

cc H-Huang awgu wanchaol fegin wz337 wconstab d4l3k pragupta

[ghstack-poisoned]
fduwjj added a commit that referenced this pull request Aug 5, 2025
@codingwithsurya
Copy link
Contributor

nice :)

}

void nvshmem_put_with_signal(at::Tensor& tensor, int64_t peer) {
auto hdl = c10d::symmetric_memory::rendezvous(tensor, "0");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's "0" in this case. Also are we expected to call rendezvous amongst every rank in the group? Or just the ranks that get put/get-ing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case probably all ranks?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"0" means global group. It is a temporary setting that can go wrong if the group is not actually global.
We need handle to remember the group which it has rendezvoused on.

…ne side API"

cc H-Huang awgu wanchaol fegin wz337 wconstab d4l3k pragupta

[ghstack-poisoned]
fduwjj added a commit that referenced this pull request Aug 26, 2025
…ne side API"

cc H-Huang awgu wanchaol fegin wz337 wconstab d4l3k pragupta

[ghstack-poisoned]
fduwjj added a commit that referenced this pull request Aug 26, 2025
}

void nvshmem_put_with_signal(at::Tensor& tensor, int64_t peer) {
auto hdl = c10d::symmetric_memory::rendezvous(tensor, "0");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"0" means global group. It is a temporary setting that can go wrong if the group is not actually global.
We need handle to remember the group which it has rendezvoused on.


c10::cuda::CUDAGuard guard(tensor.device());
auto stream = at::cuda::getCurrentCUDAStream();
nvshmemx_putmem_signal_on_stream(buffer_ptr, tensor.data_ptr(), buffer_size, static_cast<uint64_t*>(signal_ptr), NVSHMEM_SIGNAL_SET, 1, peer, stream);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here the dst can be tensor.data_ptr() too. (A reminder for myself to refactor the whole file after we land MemPool support.

…ne side API"

cc H-Huang awgu wanchaol fegin wz337 wconstab d4l3k pragupta

[ghstack-poisoned]
fduwjj added a commit that referenced this pull request Sep 24, 2025
…ne side API"

cc H-Huang awgu wanchaol fegin wz337 wconstab d4l3k pragupta

[ghstack-poisoned]
fduwjj added a commit that referenced this pull request Sep 27, 2025
@fduwjj
Copy link
Contributor Author

fduwjj commented Sep 27, 2025

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Sep 27, 2025
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@fduwjj fduwjj changed the title [WIP][symm_mem] Add a wait for signal and put signal for one side API [symm_mem] Added a wait for signal and put signal for one side API Sep 29, 2025
maggiemoss pushed a commit to maggiemoss/pytorch that referenced this pull request Sep 29, 2025
@github-actions github-actions bot deleted the gh/fduwjj/176/head branch October 30, 2025 02:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/h100-symm-mem ciflow/trunk Trigger trunk jobs on your pull request Merged oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants