[SymmMem] Fix memory allocation hold-up#162680
[SymmMem] Fix memory allocation hold-up#162680kwen2501 wants to merge 7 commits intogh/kwen2501/239/basefrom
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/162680
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit f1c4e75 with merge base 4840a1a ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
And perhaps, this is worth a test? |
|
I see #162681 is (indirectly) testing it |
| public: | ||
| NVSHMEMPeerAllocInfo( | ||
| std::shared_ptr<NVSHMEMAllocation> allocation, | ||
| NVSHMEMAllocation* allocation, |
There was a problem hiding this comment.
Why use bare pointers here? We should use unique_ptr here and just use std::move with the member intiializers
There was a problem hiding this comment.
I will let you battle with @ezyang lol, see discussion above.
tldr:
- there is no ownership move here.
- Passing a bare pointer just for reading some values in the object.
There was a problem hiding this comment.
no ownership move means don't std::move
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 1 jobs have failed, first few of them are: Limited CI for symmetric memory tests on H100 / linux-jammy-cuda12.8-py3.10-gcc11-sm90-symm / test (h100-symm-mem, 1, 1, linux.aws.h100.4) Details for Dev Infra teamRaised by workflow job |
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
|
@pytorchbot cherry-pick --onto release/2.9 -c critical |
Problem: Without MemPool it looks like nvshmem backend never deallocates memory. Cause: Handles in `symm_mems_` (a map) keeps reference to memory allocations. Solution: - Remove reference to allocation from handles -- the reference is never used anyway. - Use `unique_ptr` instead of `shared_ptr` to wrap allocation to ensure single ownership. Pull Request resolved: #162680 Approved by: https://github.com/ezyang ghstack dependencies: #163298 (cherry picked from commit 7130b17)
Cherry picking #162680The cherry pick PR is at #163375 and it is recommended to link a critical cherry pick PR with an issue. The following tracker issues are updated: Details for Dev Infra teamRaised by workflow job |
Problem: Without MemPool it looks like nvshmem backend never deallocates memory. Cause: Handles in `symm_mems_` (a map) keeps reference to memory allocations. Solution: - Remove reference to allocation from handles -- the reference is never used anyway. - Use `unique_ptr` instead of `shared_ptr` to wrap allocation to ensure single ownership. Pull Request resolved: pytorch#162680 Approved by: https://github.com/ezyang ghstack dependencies: pytorch#163298
Problem: Without MemPool it looks like nvshmem backend never deallocates memory. Cause: Handles in `symm_mems_` (a map) keeps reference to memory allocations. Solution: - Remove reference to allocation from handles -- the reference is never used anyway. - Use `unique_ptr` instead of `shared_ptr` to wrap allocation to ensure single ownership. Pull Request resolved: pytorch#162680 Approved by: https://github.com/ezyang ghstack dependencies: pytorch#163298
Problem: Without MemPool it looks like nvshmem backend never deallocates memory. Cause: Handles in `symm_mems_` (a map) keeps reference to memory allocations. Solution: - Remove reference to allocation from handles -- the reference is never used anyway. - Use `unique_ptr` instead of `shared_ptr` to wrap allocation to ensure single ownership. Pull Request resolved: pytorch#162680 Approved by: https://github.com/ezyang ghstack dependencies: pytorch#163298
[SymmMem] Fix memory allocation hold-up (#162680) Problem: Without MemPool it looks like nvshmem backend never deallocates memory. Cause: Handles in `symm_mems_` (a map) keeps reference to memory allocations. Solution: - Remove reference to allocation from handles -- the reference is never used anyway. - Use `unique_ptr` instead of `shared_ptr` to wrap allocation to ensure single ownership. Pull Request resolved: #162680 Approved by: https://github.com/ezyang ghstack dependencies: #163298 (cherry picked from commit 7130b17) Co-authored-by: Ke Wen <kw2501@meta.com>
Unique ptr ghstack-source-id: 8dac2af Pull-Request-resolved: pytorch/pytorch#162680 Barrier before and after access
Stack from ghstack (oldest at bottom):
num_active_allocationsand dealloc checks #162681Problem:
Without MemPool it looks like nvshmem backend never deallocates memory.
Cause:
Handles in
symm_mems_(a map) keeps reference to memory allocations.Solution:
unique_ptrinstead ofshared_ptrto wrap allocation to ensure single ownership.cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta @ezyang @msaroufim @dcci