[FSDP2] provide public API to share cuda streams across roots by weifengpy · Pull Request #165024 · pytorch/pytorch

weifengpy · 2025-10-09T01:25:45Z

for pipeline parallel, we can have multiple FSDP roots (chunks)

model = nn.Sequential([chunk0, chunk1])
fully_shard(model.chunk0)
fully_shard(model.chunk1)

we can call share_comm_ctx to share all-gather, reduce-scatter, all-reduce cuda streams. this avoids inter-stream memory fragmentation

from torch.distributed.fsdp import share_comm_ctx
share_comm_ctx([model.chunk0, model.chunk1])

unit test: pytest -s test/distributed/_composable/fsdp/test_fully_shard_training.py -k test_share_comm_context

Stack from ghstack (oldest at bottom):

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta @msaroufim @dcci

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]

pytorch-bot · 2025-10-09T01:25:49Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/165024

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 0cc16ef with merge base 3a110c9 ():

FLAKY - The following job failed but was likely due to flakiness present on trunk:

trunk / macos-py3-arm64 / test (default, 1, 3, macos-m1-stable) (gh) (similar failure)
RuntimeError: doctests 1/1 failed!

This comment was automatically generated by Dr. CI and updates every 15 minutes.

weifengpy · 2025-10-09T20:33:45Z

@pytorchmergebot merge

pytorchmergebot · 2025-10-09T20:36:02Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2025-10-10T01:35:34Z

Merge failed

Reason: 1 jobs have failed, first few of them are: trunk / macos-py3-arm64 / test (mps, 1, 1, macos-m2-15)

Details for Dev Infra team

Raised by workflow job

Skylion007 · 2025-10-10T19:10:59Z

torch/testing/_internal/common_fsdp.py



+@contextlib.contextmanager
+def patch_foreach_all_gather(new_foreach_all_gather: Callable):


Please use ParamSpec to preserve the typing for type checking if possible

good suggestion! I just tried but seems to be quite involved for mypy - it requires not only specify type, but also arg name Arg(Stream, 'reduce_scatter_stream'). I might need to learn more about ParamSpec to find a better way. Since this is a unit test util, I might treat it as a follow up

Error (MYPY) [assignment] Incompatible types in assignment (expression has type "Callable[[list[FSDPParam], ProcessGroup, bool, Stream, Stream, device, AllGather], AllGatherResult | None]", variable has type "Callable[[Arg(list[FSDPParam], 'fsdp_params'), Arg(ProcessGroup, 'group'), Arg(bool, 'async_op'), Arg(Stream, 'all_gather_copy_in_stream'), Arg(Stream, 'all_gather_stream'), Arg(device, 'device'), Arg(AllGather, 'all_gather_comm')], AllGatherResult | None]") 1020 | ) 1021 | dist.barrier() 1022 | torch.distributed.fsdp._fully_shard._fsdp_param_group.foreach_all_gather = ( >>> 1023 | new_foreach_all_gather 1024 | ) 1025 | try: 1026 | yield Error (MYPY) [assignment] Incompatible types in assignment (expression has type "Callable[[list[FSDPParam], list[Tensor], ProcessGroup, Stream, ReduceScatter, dtype | None, dtype | None, device, float | None, ProcessGroup | None, Stream, bool, Tensor | None, Callable[[Tensor], None] | None, bool], tuple[Tensor, Event, Event, Tensor | None, Event | None, Tensor | None]]", variable has type "Callable[[Arg(list[FSDPParam], 'fsdp_params'), Arg(list[Tensor], 'unsharded_grads'), Arg(ProcessGroup, 'reduce_scatter_group'), Arg(Stream, 'reduce_scatter_stream'), Arg(ReduceScatter, 'reduce_scatter_comm'), Arg(dtype | None, 'orig_dtype'), Arg(dtype | None, 'reduce_dtype'), Arg(device, 'device'), Arg(float | None, 'gradient_divide_factor'), Arg(ProcessGroup | None, 'all_reduce_group'), Arg(Stream, 'all_reduce_stream'), Arg(bool, 'all_reduce_grads'), Arg(Tensor | None, 'partial_reduce_output'), Arg(Callable[[Tensor], None] | None, 'all_reduce_hook'), DefaultArg(bool, 'force_sum_reduction_for_comms')], tuple[Tensor, Event, Event, Tensor | None, Event | None, Tensor | None]]") 1066 | ) 1067 | dist.barrier() 1068 | torch.distributed.fsdp._fully_shard._fsdp_param_group.foreach_reduce = ( >>> 1069 | new_foreach_reduce 1070 | ) 1071 | try: 1072 | yield

…ots" for pipeline parallel, we can have multiple FSDP roots (chunks) ``` model = nn.Sequential([chunk0, chunk1]) fully_shard(model.chunk0) fully_shard(model.chunk1) ``` we can call `share_comm_ctx` to share all-gather, reduce-scatter, all-reduce cuda streams. this avoids inter-stream memory fragmentation ``` from torch.distributed.fsdp import share_comm_ctx share_comm_ctx([model.chunk0, model.chunk1]) ``` unit test: `pytest -s test/distributed/_composable/fsdp/test_fully_shard_training.py -k test_share_comm_context` Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta msaroufim dcci [ghstack-poisoned]

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: cfa17a5 Pull Request resolved: #165024

weifengpy · 2025-10-13T22:41:45Z

@pytorchmergebot merge

pytorchmergebot · 2025-10-13T22:44:11Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2025-10-14T04:42:32Z

The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command
For more information see pytorch-bot wiki.

weifengpy · 2025-10-14T17:48:47Z

@pytorchbot merge -f "unrelated CI error"

pytorchmergebot · 2025-10-14T17:50:26Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

I just want to print CommDebugMode and know if there is communication. implementing `__repr__` for `print(comm_mode)` ``` comm_mode = CommDebugMode() with comm_mode: out = torch.mm(inps, weight) print(comm_mode) # CommDebugMode(get_total_counts()=0) ``` Tags: Pull Request resolved: #165006 Approved by: https://github.com/anshul-si ghstack dependencies: #165024

…h#165024) for pipeline parallel, we can have multiple FSDP roots (chunks) ``` model = nn.Sequential([chunk0, chunk1]) fully_shard(model.chunk0) fully_shard(model.chunk1) ``` we can call `share_comm_ctx` to share all-gather, reduce-scatter, all-reduce cuda streams. this avoids inter-stream memory fragmentation ``` from torch.distributed.fsdp import share_comm_ctx share_comm_ctx([model.chunk0, model.chunk1]) ``` unit test: `pytest -s test/distributed/_composable/fsdp/test_fully_shard_training.py -k test_share_comm_context` Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: pytorch#165024 Approved by: https://github.com/mori360

…#165006) I just want to print CommDebugMode and know if there is communication. implementing `__repr__` for `print(comm_mode)` ``` comm_mode = CommDebugMode() with comm_mode: out = torch.mm(inps, weight) print(comm_mode) # CommDebugMode(get_total_counts()=0) ``` Tags: Pull Request resolved: pytorch#165006 Approved by: https://github.com/anshul-si ghstack dependencies: pytorch#165024

…h#165024) for pipeline parallel, we can have multiple FSDP roots (chunks) ``` model = nn.Sequential([chunk0, chunk1]) fully_shard(model.chunk0) fully_shard(model.chunk1) ``` we can call `share_comm_ctx` to share all-gather, reduce-scatter, all-reduce cuda streams. this avoids inter-stream memory fragmentation ``` from torch.distributed.fsdp import share_comm_ctx share_comm_ctx([model.chunk0, model.chunk1]) ``` unit test: `pytest -s test/distributed/_composable/fsdp/test_fully_shard_training.py -k test_share_comm_context` Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: pytorch#165024 Approved by: https://github.com/mori360

…#165006) I just want to print CommDebugMode and know if there is communication. implementing `__repr__` for `print(comm_mode)` ``` comm_mode = CommDebugMode() with comm_mode: out = torch.mm(inps, weight) print(comm_mode) # CommDebugMode(get_total_counts()=0) ``` Tags: Pull Request resolved: pytorch#165006 Approved by: https://github.com/anshul-si ghstack dependencies: pytorch#165024

[FSDP2] provide public API to share cuda streams across roots

acac684

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]

weifengpy mentioned this pull request Oct 9, 2025

[DTensor] add __repr__ for CommDebugMode(get_total_count()=) #165006

Closed

pytorch-bot bot added ciflow/inductor oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (fsdp) release notes category labels Oct 9, 2025

weifengpy mentioned this pull request Oct 9, 2025

[DO NOT REVIEW][DTensor] support batch falttening and unflattening #164558

Closed

weifengpy requested review from H-Huang and tianyu-l October 9, 2025 01:38

weifengpy added release notes: distributed (fsdp2) release notes category and removed release notes: distributed (fsdp) release notes category labels Oct 9, 2025

weifengpy requested a review from mori360 October 9, 2025 01:50

mori360 approved these changes Oct 9, 2025

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Oct 9, 2025

pytorchmergebot added the merging label Oct 9, 2025

pytorchmergebot removed the merging label Oct 10, 2025

Skylion007 reviewed Oct 10, 2025

View reviewed changes

weifengpy added 4 commits October 13, 2025 11:05

weifengpy added a commit that referenced this pull request Oct 13, 2025

[FSDP2] provide public API to share cuda streams across roots

ff65511

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: cfa17a5 Pull Request resolved: #165024

pytorchmergebot added the merging label Oct 13, 2025

pytorchmergebot closed this in 6918f17 Oct 14, 2025

pytorchmergebot added Merged and removed merging labels Oct 14, 2025

github-actions bot deleted the gh/weifengpy/37/head branch November 14, 2025 02:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FSDP2] provide public API to share cuda streams across roots#165024

[FSDP2] provide public API to share cuda streams across roots#165024
weifengpy wants to merge 5 commits intogh/weifengpy/37/basefrom
gh/weifengpy/37/head

weifengpy commented Oct 9, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Oct 9, 2025 •

edited

Loading

Uh oh!

weifengpy commented Oct 9, 2025

Uh oh!

pytorchmergebot commented Oct 9, 2025

Uh oh!

pytorchmergebot commented Oct 10, 2025

Uh oh!

Skylion007 Oct 10, 2025

Uh oh!

weifengpy Oct 13, 2025

Uh oh!

weifengpy commented Oct 13, 2025

Uh oh!

pytorchmergebot commented Oct 13, 2025

Uh oh!

pytorchmergebot commented Oct 14, 2025

Uh oh!

weifengpy commented Oct 14, 2025

Uh oh!

pytorchmergebot commented Oct 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants



		@contextlib.contextmanager
		def patch_foreach_all_gather(new_foreach_all_gather: Callable):

Conversation

weifengpy commented Oct 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Oct 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/165024

✅ You can merge normally! (1 Unrelated Failure)

Uh oh!

weifengpy commented Oct 9, 2025

Uh oh!

pytorchmergebot commented Oct 9, 2025

Merge started

Uh oh!

pytorchmergebot commented Oct 10, 2025

Merge failed

Uh oh!

Skylion007 Oct 10, 2025

Choose a reason for hiding this comment

Uh oh!

weifengpy Oct 13, 2025

Choose a reason for hiding this comment

Uh oh!

weifengpy commented Oct 13, 2025

Uh oh!

pytorchmergebot commented Oct 13, 2025

Merge started

Uh oh!

pytorchmergebot commented Oct 14, 2025

Uh oh!

weifengpy commented Oct 14, 2025

Uh oh!

pytorchmergebot commented Oct 14, 2025

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

weifengpy commented Oct 9, 2025 •

edited

Loading

pytorch-bot bot commented Oct 9, 2025 •

edited

Loading