Test Copy Engine All-Gather#170265
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/170265
Note: Links to docs will display an error until the docs builds have been completed. ❗ 1 Active SEVsThere are 1 currently active SEVs. If your PR is affected, please view them below: ✅ You can merge normally! (1 Unrelated Failure)As of commit be0a4a1 with merge base eed7d91 ( FLAKY - The following job failed but was likely due to flakiness present on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
cc @weifengpy for potential use in FSDP for reducing compute-comm contention. |
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
NCCL 2.28 added Copy Engine (CE) support. Condition: - Tensors be symmetrically registered (e.g. coming from `symm_mem.empty`) - `NCCL_CTA_POLICY_ZERO` be passed to `ncclConfig` or env var `NCCL_CTA_POLICY=2` Confirmed use of CE via profile: <img width="988" height="132" alt="Screenshot 2025-12-11 at 4 47 50 PM" src="https://github.com/user-attachments/assets/2077d88b-34d9-4155-b323-646cab904e68" /> (First kernel is from regular all-gather, second kernel is from all-gather on tensors that have been window registered) Caveat: As of 2.28.9, CE collectives cannot be run on default stream, so we are testing it with `async_op=True` or with a side stream. Pull Request resolved: pytorch#170265 Approved by: https://github.com/fduwjj
|
Wonder whether Copy Engine All-Gather works with torch.compile? |
|
@Microve There are two scenarios: (1) If the eager-mode program has been rewritten to enable CE, i.e. the user has been using symmetric memory: (2) If the eager-mode program is written without symmetric memory: |
NCCL 2.28 added Copy Engine (CE) support. Condition: - Tensors be symmetrically registered (e.g. coming from `symm_mem.empty`) - `NCCL_CTA_POLICY_ZERO` be passed to `ncclConfig` or env var `NCCL_CTA_POLICY=2` Confirmed use of CE via profile: <img width="988" height="132" alt="Screenshot 2025-12-11 at 4 47 50 PM" src="https://github.com/user-attachments/assets/2077d88b-34d9-4155-b323-646cab904e68" /> (First kernel is from regular all-gather, second kernel is from all-gather on tensors that have been window registered) Caveat: As of 2.28.9, CE collectives cannot be run on default stream, so we are testing it with `async_op=True` or with a side stream. Pull Request resolved: pytorch#170265 Approved by: https://github.com/fduwjj
Stack from ghstack (oldest at bottom):
NCCL 2.28 added Copy Engine (CE) support.
Condition:
symm_mem.empty)NCCL_CTA_POLICY_ZERObe passed toncclConfigor env varNCCL_CTA_POLICY=2Confirmed use of CE via profile:

(First kernel is from regular all-gather, second kernel is from all-gather on tensors that have been window registered)
Caveat:
As of 2.28.9, CE collectives cannot be run on default stream, so we are testing it with
async_op=Trueor with a side stream.