[SymmMem] Add MemPool support to CUDA backend#169740
[SymmMem] Add MemPool support to CUDA backend#169740kwen2501 wants to merge 5 commits intogh/kwen2501/290/basefrom
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/169740
Note: Links to docs will display an error until the docs builds have been completed. ❌ 3 New Failures, 3 PendingAs of commit 1082ad3 with merge base 04ae0e1 ( NEW FAILURES - The following jobs have failed:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
| } | ||
|
|
||
| /* Search for a block that covers the given ptr, and write back the offset to | ||
| * the base ptr; error out if not found */ |
There was a problem hiding this comment.
comment seems wrong, returns nullptr if not found, not error
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
|
Starting merge as part of PR stack under #170008 |
Merge failedReason: 1 mandatory check(s) failed. The first few are: Dig deeper by viewing the failures on hud |
|
@pytorchbot merge -f "RoCM failure comes from runner: Available diskspace is less than 30 percent. Not enough diskspace" |
Merge startedYour change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
We'd like to default some flags for SymmetricMemory pools, e.g. - use_on_oom=False - no_split=True to improve UX and to fortify safety. We thus provide a wrapper with the above flags preset: ``` pool = torch.distributed._symmetric_memory.get_mem_pool(device) ``` Since these flags are internal to the wrapper, we also maintain the flexibility to vary it in the future. Pull Request resolved: #170008 Approved by: https://github.com/ngimel ghstack dependencies: #169739, #169740
[1/N] Extended rendezvous matching condition from exact address match to case where tensor falls in allocation range. [2/N] Shifted all heavy stuff (involving cudaMalloc) from `cudaSymmetricMemory` to `cudaPeerAllocInfo`. The former now corresponds to a tensor, while the letter corresponds to an allocation. Tensors on the same allocation share the same `cudaPeerAllocInfo`. [3/N] Added tests. Pull Request resolved: pytorch#169740 Approved by: https://github.com/ngimel ghstack dependencies: pytorch#169739
We'd like to default some flags for SymmetricMemory pools, e.g. - use_on_oom=False - no_split=True to improve UX and to fortify safety. We thus provide a wrapper with the above flags preset: ``` pool = torch.distributed._symmetric_memory.get_mem_pool(device) ``` Since these flags are internal to the wrapper, we also maintain the flexibility to vary it in the future. Pull Request resolved: pytorch#170008 Approved by: https://github.com/ngimel ghstack dependencies: pytorch#169739, pytorch#169740
We'd like to default some flags for SymmetricMemory pools, e.g. - use_on_oom=False - no_split=True to improve UX and to fortify safety. We thus provide a wrapper with the above flags preset: ``` pool = torch.distributed._symmetric_memory.get_mem_pool(device) ``` Since these flags are internal to the wrapper, we also maintain the flexibility to vary it in the future. Pull Request resolved: pytorch#170008 Approved by: https://github.com/ngimel ghstack dependencies: pytorch#169739, pytorch#169740
Stack from ghstack (oldest at bottom):
[1/N] Extended rendezvous matching condition from exact address match to case where tensor falls in allocation range.
[2/N] Shifted all heavy stuff (involving cudaMalloc) from
cudaSymmetricMemorytocudaPeerAllocInfo. The former now corresponds to a tensor, while the letter corresponds to an allocation. Tensors on the same allocation share the samecudaPeerAllocInfo.[3/N] Added tests.