Optimize global save-plan validation#166820
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/166820
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit e656775 with merge base 4957ae5 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
Nice, now if we can just fix the linear memory requirement on coordinator_rank 0 for deduping local save plans. |
|
@Skylion007 Makes sense, happy to explore streaming dedup in a separate PR. |
|
@pradeepfn @meetv18 @daulet-askarov you three were named as potential reviewers, PTAL |
|
To add the ciflow label This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows. |
LucasLLC
left a comment
There was a problem hiding this comment.
Seems to make sense to me, so approving to unblock. Please make sure this is properly tested!
Skylion007
left a comment
There was a problem hiding this comment.
Looks good with minor nit
|
just needs typecheck fix ig |
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 1 mandatory check(s) failed. The first few are: Dig deeper by viewing the failures on hud |
d38c652 to
e656775
Compare
|
@pytorchbot merge |
|
Pull workflow has not been scheduled for the PR yet. It could be because author doesn't have permissions to run those or skip-checks keywords were added to PR/commits, aborting merge. Please get/give approval for the workflows and/or remove skip ci decorators before next merge attempt. If you think this is a mistake, please contact PyTorch Dev Infra. |
|
Hi @ezyang, could you please re-run @pytorchbot merge when convenient? Thanks! |
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
## Summary - Fixes pytorch#163548 by replacing the quadratic chunk-overlap scan in `_validate_global_plan` with a sweep-line pass that sorts chunk intervals and keeps an active set via `bisect_right`, giving O(n log n) behavior for metadata validation. - Add focused tests in `TestValidateGlobalPlan` covering overlapping and non-overlapping shard layouts to lock in the faster path. ## Testing - python test/distributed/checkpoint/test_planner.py -k ValidateGlobalPlan ## Benchmarks | chunks | old runtime | new runtime | |--------|-------------|-------------| | 1 024 | 0.121 s | 0.0014 s | | 2 048 | 0.486 s | 0.0027 s | | 4 096 | 2.474 s | 0.0058 s | | 8 192 | 8.014 s | 0.0126 s | | 16 384 | 32.740 s | 0.026 s | @ezyang Pull Request resolved: pytorch#166820 Approved by: https://github.com/LucasLLC, https://github.com/Skylion007
Summary
DefaultSavePlanner._validate_global_planis costly slow when FSDP size is extremely large (quadratic) #163548 by replacing the quadratic chunk-overlap scan in_validate_global_planwith a sweep-line pass that sorts chunk intervals and keeps an active set viabisect_right, giving O(n log n) behavior for metadata validation.TestValidateGlobalPlancovering overlapping and non-overlapping shard layouts to lock in the faster path.Testing
Benchmarks
@ezyang
cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta @msaroufim @dcci @LucasLLC @pradeepfn