Skip to content

[PP] Fix edge case with FSDP when stages_per_rank > 3#165467

Closed
H-Huang wants to merge 1 commit intopytorch:mainfrom
H-Huang:deepseekv3
Closed

[PP] Fix edge case with FSDP when stages_per_rank > 3#165467
H-Huang wants to merge 1 commit intopytorch:mainfrom
H-Huang:deepseekv3

Conversation

@H-Huang
Copy link
Member

@H-Huang H-Huang commented Oct 14, 2025

There is an edge case with FSDP + PP when we add UNSHARD + RESHARD, we at max have 3 stages unsharded,

def _add_unshard_reshard(
compute_actions: list[Optional[_Action]],
max_active_stages: int = 3,

This change is need to be able to unshard and reshard a stage multiple times.

cc @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta @msaroufim @dcci

@pytorch-bot pytorch-bot bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Oct 14, 2025
@pytorch-bot
Copy link

pytorch-bot bot commented Oct 14, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/165467

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 7ab04e4 with merge base 3a110c9 (image):

FLAKY - The following job failed but was likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@H-Huang H-Huang requested review from fegin, kwen2501 and wwwjn October 14, 2025 20:15
@H-Huang H-Huang added release notes: distributed (pipeline) release notes category module: pipelining Pipeline Parallelism ciflow/trunk Trigger trunk jobs on your pull request labels Oct 14, 2025
Copy link

@wwwjn wwwjn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@H-Huang
Copy link
Member Author

H-Huang commented Oct 15, 2025

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

zhudada0120 pushed a commit to zhudada0120/pytorch that referenced this pull request Oct 15, 2025
There is an edge case with FSDP + PP when we add UNSHARD + RESHARD, we at max have 3 stages unsharded, https://github.com/pytorch/pytorch/blob/3f83e8915e86a93da2fe01fda45602dcd0e3ebfd/torch/distributed/pipelining/schedules.py#L1029-L1031

This change is need to be able to unshard and reshard a stage multiple times.

Pull Request resolved: pytorch#165467
Approved by: https://github.com/wwwjn
Chao1Han pushed a commit to Chao1Han/pytorch that referenced this pull request Oct 21, 2025
There is an edge case with FSDP + PP when we add UNSHARD + RESHARD, we at max have 3 stages unsharded, https://github.com/pytorch/pytorch/blob/3f83e8915e86a93da2fe01fda45602dcd0e3ebfd/torch/distributed/pipelining/schedules.py#L1029-L1031

This change is need to be able to unshard and reshard a stage multiple times.

Pull Request resolved: pytorch#165467
Approved by: https://github.com/wwwjn
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request Merged module: pipelining Pipeline Parallelism oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (pipeline) release notes category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants