[SymmMem] Add MemPool support to CUDA backend by kwen2501 · Pull Request #169740 · pytorch/pytorch

kwen2501 · 2025-12-06T01:36:05Z

Stack from ghstack (oldest at bottom):

[1/N] Extended rendezvous matching condition from exact address match to case where tensor falls in allocation range.

[2/N] Shifted all heavy stuff (involving cudaMalloc) from cudaSymmetricMemory to cudaPeerAllocInfo. The former now corresponds to a tensor, while the letter corresponds to an allocation. Tensors on the same allocation share the same cudaPeerAllocInfo.

[3/N] Added tests.

[ghstack-poisoned]

pytorch-bot · 2025-12-06T01:36:08Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/169740

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

❌ 3 New Failures, 3 Pending

As of commit 1082ad3 with merge base 04ae0e1 ():

NEW FAILURES - The following jobs have failed:

trunk / linux-jammy-rocm-py3.10 / test (default, 1, 6, linux.rocm.gpu.gfx942.1) (gh)
Process completed with exit code 1.
trunk / linux-jammy-rocm-py3.10 / test (default, 2, 6, linux.rocm.gpu.gfx942.1) (gh)
Process completed with exit code 1.
trunk / linux-jammy-rocm-py3.10 / test (default, 3, 6, linux.rocm.gpu.gfx942.1) (gh)
Process completed with exit code 1.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

[1/N][SymmMem] Reuse handle when ptr falls in allocation range [2/N] Reuse peer allocation info ghstack-source-id: a3b0e21 Pull-Request: #169740

[ghstack-poisoned]

[1/N][SymmMem] Reuse handle when ptr falls in allocation range [2/N] Reuse peer allocation info ghstack-source-id: 0637eec Pull-Request: #169740

[ghstack-poisoned]

[1/N][SymmMem] Reuse handle when ptr falls in allocation range [2/N] Reuse peer allocation info ghstack-source-id: bdc9039 Pull-Request: #169740

torch/csrc/distributed/c10d/symm_mem/CUDASymmetricMemory.cu

[ghstack-poisoned]

[1/N][SymmMem] Reuse handle when ptr falls in allocation range [2/N] Reuse peer allocation info ghstack-source-id: b3c60b7 Pull-Request: #169740

[ghstack-poisoned]

[1/N][SymmMem] Reuse handle when ptr falls in allocation range [2/N] Reuse peer allocation info ghstack-source-id: 2ecdb7f Pull-Request: #169740

ngimel · 2025-12-10T01:31:31Z

torch/csrc/distributed/c10d/symm_mem/CUDASymmetricMemory.cu

 }

+/* Search for a block that covers the given ptr, and write back the offset to
+ * the base ptr; error out if not found */


comment seems wrong, returns nullptr if not found, not error

kwen2501 · 2025-12-10T01:54:21Z

@pytorchbot merge

pytorchmergebot · 2025-12-10T01:56:14Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2025-12-10T02:02:45Z

Starting merge as part of PR stack under #170008

pytorchmergebot · 2025-12-10T02:12:12Z

Merge failed

Reason: 1 mandatory check(s) failed. The first few are:

Lint / quick-checks / linux-job

Dig deeper by viewing the failures on hud

Details for Dev Infra team

Raised by workflow job

Failing merge rule: Core Maintainers

kwen2501 · 2025-12-10T03:46:34Z

@pytorchbot merge -f "RoCM failure comes from runner: Available diskspace is less than 30 percent. Not enough diskspace"

pytorchmergebot · 2025-12-10T03:48:25Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

We'd like to default some flags for SymmetricMemory pools, e.g. - use_on_oom=False - no_split=True to improve UX and to fortify safety. We thus provide a wrapper with the above flags preset: ``` pool = torch.distributed._symmetric_memory.get_mem_pool(device) ``` Since these flags are internal to the wrapper, we also maintain the flexibility to vary it in the future. Pull Request resolved: #170008 Approved by: https://github.com/ngimel ghstack dependencies: #169739, #169740

[1/N] Extended rendezvous matching condition from exact address match to case where tensor falls in allocation range. [2/N] Shifted all heavy stuff (involving cudaMalloc) from `cudaSymmetricMemory` to `cudaPeerAllocInfo`. The former now corresponds to a tensor, while the letter corresponds to an allocation. Tensors on the same allocation share the same `cudaPeerAllocInfo`. [3/N] Added tests. Pull Request resolved: pytorch#169740 Approved by: https://github.com/ngimel ghstack dependencies: pytorch#169739

We'd like to default some flags for SymmetricMemory pools, e.g. - use_on_oom=False - no_split=True to improve UX and to fortify safety. We thus provide a wrapper with the above flags preset: ``` pool = torch.distributed._symmetric_memory.get_mem_pool(device) ``` Since these flags are internal to the wrapper, we also maintain the flexibility to vary it in the future. Pull Request resolved: pytorch#170008 Approved by: https://github.com/ngimel ghstack dependencies: pytorch#169739, pytorch#169740

Update

9856f94

[ghstack-poisoned]

pytorch-bot bot added ciflow/h100-symm-mem release notes: distributed (c10d) release notes category labels Dec 6, 2025

kwen2501 added a commit that referenced this pull request Dec 6, 2025

[SymmMem] Add MemPool support to CUDA backend

b1133f0

[1/N][SymmMem] Reuse handle when ptr falls in allocation range [2/N] Reuse peer allocation info ghstack-source-id: a3b0e21 Pull-Request: #169740

kwen2501 mentioned this pull request Dec 6, 2025

[MemPool] Add no-split option #169739

Closed

kwen2501 added release notes: distributed (symm_mem) release note label for symmetric memory module: symm_mem Issues and PRs of Symmetric Memory labels Dec 6, 2025

eqy requested a review from galv December 6, 2025 01:45

pytorchbot added the open source label Dec 6, 2025

Update

95cf068

[ghstack-poisoned]

kwen2501 added a commit that referenced this pull request Dec 6, 2025

[SymmMem] Add MemPool support to CUDA backend

339b942

[1/N][SymmMem] Reuse handle when ptr falls in allocation range [2/N] Reuse peer allocation info ghstack-source-id: 0637eec Pull-Request: #169740

Update

1e095ea

[ghstack-poisoned]

kwen2501 added a commit that referenced this pull request Dec 8, 2025

[SymmMem] Add MemPool support to CUDA backend

b065b59

[1/N][SymmMem] Reuse handle when ptr falls in allocation range [2/N] Reuse peer allocation info ghstack-source-id: bdc9039 Pull-Request: #169740

Skylion007 reviewed Dec 8, 2025

View reviewed changes

torch/csrc/distributed/c10d/symm_mem/CUDASymmetricMemory.cu Outdated Show resolved Hide resolved

Update

569ba97

[ghstack-poisoned]

kwen2501 added a commit that referenced this pull request Dec 8, 2025

[SymmMem] Add MemPool support to CUDA backend

8220043

[1/N][SymmMem] Reuse handle when ptr falls in allocation range [2/N] Reuse peer allocation info ghstack-source-id: b3c60b7 Pull-Request: #169740

kwen2501 requested review from fduwjj, fegin and ngimel December 8, 2025 23:11

Update

1082ad3

[ghstack-poisoned]

kwen2501 added a commit that referenced this pull request Dec 9, 2025

[SymmMem] Add MemPool support to CUDA backend

d7a6d60

[1/N][SymmMem] Reuse handle when ptr falls in allocation range [2/N] Reuse peer allocation info ghstack-source-id: 2ecdb7f Pull-Request: #169740

kwen2501 mentioned this pull request Dec 9, 2025

[SymmMem] Add get_mem_pool wrapper #170008

Closed

ngimel approved these changes Dec 10, 2025

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Dec 10, 2025

pytorchmergebot added the merging label Dec 10, 2025

pytorchmergebot removed the merging label Dec 10, 2025

pytorchmergebot added the merging label Dec 10, 2025

pytorchmergebot added the Merged label Dec 10, 2025

pytorchmergebot closed this in ae7c14f Dec 10, 2025

pytorchmergebot removed the merging label Dec 10, 2025

kwen2501 mentioned this pull request Dec 10, 2025

[SymmMem] Add MemPool support to CUDA backend #169053

Closed

kwen2501 mentioned this pull request Dec 12, 2025

[RFC] support symmetric memory in torch.compile #162859

Open

github-actions bot deleted the gh/kwen2501/290/head branch January 10, 2026 02:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SymmMem] Add MemPool support to CUDA backend#169740

[SymmMem] Add MemPool support to CUDA backend#169740
kwen2501 wants to merge 5 commits intogh/kwen2501/290/basefrom
gh/kwen2501/290/head

kwen2501 commented Dec 6, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Dec 6, 2025 •

edited

Loading

Uh oh!

Uh oh!

ngimel Dec 10, 2025

Uh oh!

kwen2501 commented Dec 10, 2025

Uh oh!

pytorchmergebot commented Dec 10, 2025

Uh oh!

pytorchmergebot commented Dec 10, 2025

Uh oh!

pytorchmergebot commented Dec 10, 2025

Uh oh!

kwen2501 commented Dec 10, 2025

Uh oh!

pytorchmergebot commented Dec 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

kwen2501 commented Dec 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Dec 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/169740

❌ 3 New Failures, 3 Pending

Uh oh!

Uh oh!

ngimel Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

kwen2501 commented Dec 10, 2025

Uh oh!

pytorchmergebot commented Dec 10, 2025

Merge started

Uh oh!

pytorchmergebot commented Dec 10, 2025

Uh oh!

pytorchmergebot commented Dec 10, 2025

Merge failed

Uh oh!

kwen2501 commented Dec 10, 2025

Uh oh!

pytorchmergebot commented Dec 10, 2025

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

kwen2501 commented Dec 6, 2025 •

edited

Loading

pytorch-bot bot commented Dec 6, 2025 •

edited

Loading