Skip to content

Optimize global save-plan validation#166820

Closed
Aminsed wants to merge 1 commit intopytorch:mainfrom
Aminsed:optimize-global-plan-validation
Closed

Optimize global save-plan validation#166820
Aminsed wants to merge 1 commit intopytorch:mainfrom
Aminsed:optimize-global-plan-validation

Conversation

@Aminsed
Copy link
Contributor

@Aminsed Aminsed commented Nov 2, 2025

Summary

Testing

  • python test/distributed/checkpoint/test_planner.py -k ValidateGlobalPlan

Benchmarks

chunks old runtime new runtime
1 024 0.121 s 0.0014 s
2 048 0.486 s 0.0027 s
4 096 2.474 s 0.0058 s
8 192 8.014 s 0.0126 s
16 384 32.740 s 0.026 s

@ezyang

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta @msaroufim @dcci @LucasLLC @pradeepfn

@pytorch-bot
Copy link

pytorch-bot bot commented Nov 2, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/166820

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit e656775 with merge base 4957ae5 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (checkpoint) labels Nov 2, 2025
@Skylion007
Copy link
Collaborator

Nice, now if we can just fix the linear memory requirement on coordinator_rank 0 for deduping local save plans.

@Skylion007 Skylion007 requested review from LucasLLC and ezyang November 2, 2025 18:14
@Aminsed
Copy link
Contributor Author

Aminsed commented Nov 2, 2025

@Skylion007 Makes sense, happy to explore streaming dedup in a separate PR.

@ezyang ezyang requested a review from wconstab November 3, 2025 03:46
@wconstab wconstab requested a review from pradeepfn November 3, 2025 17:17
@ezyang
Copy link
Contributor

ezyang commented Nov 3, 2025

@pradeepfn @meetv18 @daulet-askarov you three were named as potential reviewers, PTAL

@fegin fegin added oncall: distributed checkpointing Oncall label should be attached to any issues related to distributed checkpointing. ciflow/trunk Trigger trunk jobs on your pull request labels Nov 3, 2025
@pytorch-bot
Copy link

pytorch-bot bot commented Nov 3, 2025

To add the ciflow label ciflow/trunk please first approve the workflows that are awaiting approval (scroll to the bottom of this page).

This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows.

@pytorch-bot pytorch-bot bot removed the ciflow/trunk Trigger trunk jobs on your pull request label Nov 3, 2025
Copy link
Contributor

@LucasLLC LucasLLC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems to make sense to me, so approving to unblock. Please make sure this is properly tested!

Copy link
Collaborator

@Skylion007 Skylion007 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good with minor nit

@ezyang
Copy link
Contributor

ezyang commented Nov 5, 2025

just needs typecheck fix ig

@ezyang
Copy link
Contributor

ezyang commented Nov 6, 2025

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Nov 6, 2025
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 mandatory check(s) failed. The first few are:

Dig deeper by viewing the failures on hud

Details for Dev Infra team Raised by workflow job

Failing merge rule: Core Maintainers

@Aminsed Aminsed force-pushed the optimize-global-plan-validation branch from d38c652 to e656775 Compare November 8, 2025 00:40
@pytorch-bot pytorch-bot bot removed the ciflow/trunk Trigger trunk jobs on your pull request label Nov 8, 2025
@Aminsed
Copy link
Contributor Author

Aminsed commented Nov 8, 2025

@pytorchbot merge

@pytorch-bot
Copy link

pytorch-bot bot commented Nov 8, 2025

Pull workflow has not been scheduled for the PR yet. It could be because author doesn't have permissions to run those or skip-checks keywords were added to PR/commits, aborting merge. Please get/give approval for the workflows and/or remove skip ci decorators before next merge attempt. If you think this is a mistake, please contact PyTorch Dev Infra.

@Aminsed
Copy link
Contributor Author

Aminsed commented Nov 8, 2025

Hi @ezyang, could you please re-run @pytorchbot merge when convenient? Thanks!

@Skylion007
Copy link
Collaborator

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Nov 8, 2025
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@Skylion007 Skylion007 added the better-engineering Relatively self-contained tasks for better engineering contributors label Nov 8, 2025
Silv3S pushed a commit to Silv3S/pytorch that referenced this pull request Nov 18, 2025
## Summary
- Fixes pytorch#163548 by replacing the quadratic chunk-overlap scan in `_validate_global_plan` with a sweep-line pass that sorts chunk intervals and keeps an active set via `bisect_right`, giving O(n log n) behavior for metadata validation.
- Add focused tests in `TestValidateGlobalPlan` covering overlapping and non-overlapping shard layouts to lock in the faster path.

## Testing
- python test/distributed/checkpoint/test_planner.py -k ValidateGlobalPlan

## Benchmarks
| chunks | old runtime | new runtime |
|--------|-------------|-------------|
| 1 024  | 0.121 s     | 0.0014 s    |
| 2 048  | 0.486 s     | 0.0027 s    |
| 4 096  | 2.474 s     | 0.0058 s    |
| 8 192  | 8.014 s     | 0.0126 s    |
| 16 384 | 32.740 s    | 0.026 s     |

@ezyang

Pull Request resolved: pytorch#166820
Approved by: https://github.com/LucasLLC, https://github.com/Skylion007
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

better-engineering Relatively self-contained tasks for better engineering contributors ciflow/trunk Trigger trunk jobs on your pull request Merged oncall: distributed checkpointing Oncall label should be attached to any issues related to distributed checkpointing. oncall: distributed Add this issue/PR to distributed oncall triage queue open source release notes: distributed (checkpoint)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

DefaultSavePlanner._validate_global_plan is costly slow when FSDP size is extremely large (quadratic)

7 participants