[c10d][Sym mem] Add set_signal_pad_size API for SymmetricMemory#169156
[c10d][Sym mem] Add set_signal_pad_size API for SymmetricMemory#169156yang-yu-hang wants to merge 1 commit intopytorch:mainfrom
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/169156
Note: Links to docs will display an error until the docs builds have been completed. ❌ 1 New Failure, 2 Unrelated FailuresAs of commit baf585a with merge base d900f5e ( NEW FAILURE - The following job has failed:
FLAKY - The following job failed but was likely due to flakiness present on trunk:
BROKEN TRUNK - The following job failed but was present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
51db3dc to
5788dec
Compare
5788dec to
2596b3a
Compare
ngimel
left a comment
There was a problem hiding this comment.
Looks good, a few minor comments. Thanks!
ce459a8 to
577f9a6
Compare
577f9a6 to
1d835c1
Compare
|
@pytorchmergebot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 2 jobs have failed, first few of them are: trunk / win-vs2022-cpu-py3 / build, trunk / win-vs2022-cuda12.8-py3 / build Details for Dev Infra teamRaised by workflow job |
|
@pytorchmergebot merge -r |
|
@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here |
|
Successfully rebased |
1d835c1 to
f368c05
Compare
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 2 jobs have failed, first few of them are: trunk / win-vs2022-cpu-py3 / build, trunk / win-vs2022-cuda12.8-py3 / build Details for Dev Infra teamRaised by workflow job |
|
@pytorchmergebot merge -i |
Merge startedYour change will be merged while ignoring the following 1 checks: trunk / linux-jammy-rocm-py3.10 / test (default, 4, 6, linux.rocm.gpu.gfx942.1, unstable) Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
|
The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command |
|
@pytorchmergebot merge -i |
Merge startedYour change will be merged while ignoring the following 1 checks: trunk / linux-jammy-rocm-py3.10 / test (default, 4, 6, linux.rocm.gpu.gfx942.1, unstable) Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 1 jobs have failed, first few of them are: Limited CI for symmetric memory tests on H100 / linux-jammy-cuda12.8-py3.10-gcc11-sm90-symm / test (h100-symm-mem, 1, 1, linux.aws.h100.4) Details for Dev Infra teamRaised by workflow job |
Summary: The signal pad size for symmetric memory was previously hardcoded as a constexpr, which may be too small for workloads that launch a large number of blocks. This change exposes `set_signal_pad_size` and `get_signal_pad_size` APIs to allow users to configure the signal pad size before making symmetric memory allocations. ### Changes: #### 1. Core API (C++) - Renamed `signal_pad_size` constexpr to `default_signal_pad_size` in CUDASymmetricMemoryTypes.hpp - Added `get_signal_pad_size()` and `set_signal_pad_size(size_t)` function declarations in CUDASymmetricMemoryTypes.hpp - Implemented the functions in SymmetricMemory.cpp using `std::optional<size_t>` to distinguish between default and user-configured values - Added TORCH_API exports in SymmetricMemory.hpp for public API access #### 2. Backend Updates - Updated CUDASymmetricMemory.cu to call `get_signal_pad_size()` instead of using the hardcoded constant - Updated NCCLSymmetricMemory.cu to use configurable signal pad size with local variable `signal_pad_size` - Updated NVSHMEMSymmetricMemory.cu to use configurable signal pad size with local variable `signal_pad_size` #### 3. Python Bindings - Added Python bindings in init.cpp with comprehensive docstrings explaining usage - Added Python wrapper functions in torch/distributed/_symmetric_memory/__init__.py - Updated `__all__` to export the new API functions #### 4. Tests - Added `test_get_signal_pad_size()` to verify the API returns a positive integer and Python/C++ consistency - Added `test_set_signal_pad_size()` to verify setting, getting, and restoring signal pad size values Test Plan: Build and existing symmetric memory tests. The API follows the same pattern as existing `set_backend`/`get_backend` APIs.
a8d202a to
baf585a
Compare
|
@yang-yu-hang has imported this pull request. If you are a Meta employee, you can view this in D88278682. |
|
@pytorchmergebot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 1 mandatory check(s) failed. The first few are: Dig deeper by viewing the failures on hud |
|
@pytorchmergebot merge -r |
|
@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here |
|
The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command |
|
@pytorchbot merge -f "test failures unrelated" |
Merge startedYour change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
…rch#169156) Summary: The signal pad size for symmetric memory was previously hardcoded as a constexpr, which may be too small for workloads that launch a large number of blocks. This change exposes `set_signal_pad_size` and `get_signal_pad_size` APIs to allow users to configure the signal pad size before making symmetric memory allocations. ### Changes: #### 1. Core API (C++) - Renamed `signal_pad_size` constexpr to `default_signal_pad_size` in CUDASymmetricMemoryTypes.hpp - Added `get_signal_pad_size()` and `set_signal_pad_size(size_t)` function declarations in CUDASymmetricMemoryTypes.hpp - Implemented the functions in SymmetricMemory.cpp using `std::optional<size_t>` to distinguish between default and user-configured values - Added TORCH_API exports in SymmetricMemory.hpp for public API access #### 2. Backend Updates - Updated CUDASymmetricMemory.cu to call `get_signal_pad_size()` instead of using the hardcoded constant - Updated NCCLSymmetricMemory.cu to use configurable signal pad size with local variable `signal_pad_size` - Updated NVSHMEMSymmetricMemory.cu to use configurable signal pad size with local variable `signal_pad_size` #### 3. Python Bindings - Added Python bindings in init.cpp with comprehensive docstrings explaining usage - Added Python wrapper functions in torch/distributed/_symmetric_memory/__init__.py - Updated `__all__` to export the new API functions #### 4. Tests - Added `test_get_signal_pad_size()` to verify the API returns a positive integer and Python/C++ consistency - Added `test_set_signal_pad_size()` to verify setting, getting, and restoring signal pad size values Test Plan: `PYTHONPATH=. python3 test/distributed/test_symmetric_memory.py` Pull Request resolved: pytorch#169156 Approved by: https://github.com/ngimel
) Summary: The signal pad size for symmetric memory was previously hardcoded as a constexpr, which may be too small for workloads that launch a large number of blocks. This change exposes `set_signal_pad_size` and `get_signal_pad_size` APIs to allow users to configure the signal pad size before making symmetric memory allocations. ### Changes: #### 1. Core API (C++) - Renamed `signal_pad_size` constexpr to `default_signal_pad_size` in CUDASymmetricMemoryTypes.hpp - Added `get_signal_pad_size()` and `set_signal_pad_size(size_t)` function declarations in CUDASymmetricMemoryTypes.hpp - Implemented the functions in SymmetricMemory.cpp using `std::optional<size_t>` to distinguish between default and user-configured values - Added TORCH_API exports in SymmetricMemory.hpp for public API access #### 2. Backend Updates - Updated CUDASymmetricMemory.cu to call `get_signal_pad_size()` instead of using the hardcoded constant - Updated NCCLSymmetricMemory.cu to use configurable signal pad size with local variable `signal_pad_size` - Updated NVSHMEMSymmetricMemory.cu to use configurable signal pad size with local variable `signal_pad_size` #### 3. Python Bindings - Added Python bindings in init.cpp with comprehensive docstrings explaining usage - Added Python wrapper functions in torch/distributed/_symmetric_memory/__init__.py - Updated `__all__` to export the new API functions #### 4. Tests - Added `test_get_signal_pad_size()` to verify the API returns a positive integer and Python/C++ consistency - Added `test_set_signal_pad_size()` to verify setting, getting, and restoring signal pad size values Test Plan: `PYTHONPATH=. python3 test/distributed/test_symmetric_memory.py` Pull Request resolved: #169156 Approved by: https://github.com/ngimel
Summary:
The signal pad size for symmetric memory was previously hardcoded as a constexpr, which may be too small for workloads that launch a large number of blocks. This change exposes
set_signal_pad_sizeandget_signal_pad_sizeAPIs to allow users to configure the signal pad size before making symmetric memory allocations.Changes:
1. Core API (C++)
signal_pad_sizeconstexpr todefault_signal_pad_sizein CUDASymmetricMemoryTypes.hppget_signal_pad_size()andset_signal_pad_size(size_t)function declarations in CUDASymmetricMemoryTypes.hppstd::optional<size_t>to distinguish between default and user-configured values2. Backend Updates
get_signal_pad_size()instead of using the hardcoded constantsignal_pad_sizesignal_pad_size3. Python Bindings
__all__to export the new API functions4. Tests
test_get_signal_pad_size()to verify the API returns a positive integer and Python/C++ consistencytest_set_signal_pad_size()to verify setting, getting, and restoring signal pad size valuesTest Plan:
PYTHONPATH=. python3 test/distributed/test_symmetric_memory.py