[FSDP2] Added HSDP grad acc tests and some minor changes #125479

awgu · 2024-05-03T17:42:12Z

Stack from ghstack (oldest at bottom):

This adds HSDP to the existing gradient accumulation tests and includes some minor changes to simplify things a tiny bit.

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang @d4l3k

[ghstack-poisoned]

pytorch-bot · 2024-05-03T17:42:14Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/125479

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 54ee411 with merge base b03fb49 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

cc mrshenli pritamdamania87 zhaojuanmao satgera gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l wconstab yf225 chauhang d4l3k [ghstack-poisoned]

This adds HSDP to the existing gradient accumulation tests and includes some minor changes. cc mrshenli pritamdamania87 zhaojuanmao satgera gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l wconstab yf225 chauhang d4l3k [ghstack-poisoned]

wanchaol · 2024-05-03T19:13:48Z

torch/testing/_internal/common_fsdp.py



+@contextlib.contextmanager
+def patch_all_reduce(new_all_reduce: Callable):


hopefully after #125475 landed lots of these would be simplified!

awgu · 2024-05-03T20:33:56Z

@pytorchbot merge

pytorchmergebot · 2024-05-03T20:35:47Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

**Context** We are interested in supporting the case where HSDP reduce-scatters but does not all-reduce in a microbatch backward. This saves communication while still saving memory. Only on the last microbatch do we need to both reduce-scatter and all-reduce. This is not implemented yet and will hopefully come in a future PR. There is one notable part of doing this. On the last microbatch, we need to perform an accumulation step after reduce-scatter and before all-reduce. If not, then the preceding microbatch's gradients will not be contributed across the replica group. (In other words, we cannot simply accumulate _after_ all-reduce.) Consider 32 GPUs with 4-way replication and 8-way sharding and 2 microbatches, and focus on global rank 0. - After the first microbatch, rank 0 will have its shard of $\frac{1}{8} \sum_{i \in S(0)} g_i^{(1)}$, where we define $S(0) = \{0, 1, \dots, 7\}$ to be the ranks in its shard group and we define the $(1)$ superscript to denote the first microbatch. - Upon the second microbatch, rank 0 after its reduce-scatter will additionally have its shard of $\frac{1}{8} \sum_{i \in S(0)} g_i^{(2)}$. If we only all-reduce this, then this second microbatch's gradients become $\frac{1}{32} \sum_{i=0, 1, \dots, 31} g_i^{(2)}$, so in total, rank 0 has $\frac{1}{8} \sum_{i \in S(0)} g_i^{(1)} + \frac{1}{32} \sum_{i=0, 1, \dots, 31} g_i^{(2)}$, which is wrong. - Importantly, we must accumulate $\frac{1}{8} \sum_{i \in S(0)} g_i^{(1)} + \frac{1}{8} \sum_{i \in S(0)} g_i^{(2)} = \frac{1}{8}\sum_{i \in S(0)} (g_i^{(1)} + g_i^{(2)})$ first before all-reducing to get $\frac{1}{32} \sum_{i=0, 1, \dots, 31} (g_i^{(1)} + g_i^{(2)})$. Now, note how under this approach, we want a factor of $\frac{1}{8}$ only (i.e. reciprocal of the shard group size), not $\frac{1}{32}$, for the first microbatch's gradients. - For bf16/fp32, since we use `ReduceOp.AVG` and we only reduce-scatter on the first microbatch, we correctly have a factor of $\frac{1}{8}$ on the first microbatch. - For fp16, since we precompute the gradient divide factors at init time assuming always reducing over both shard and replica groups, we incorrectly have a factor of $\frac{1}{32}$ on the first microbatch, deviating from the bf16/fp32 case. We can address this issue by matching the bf16/fp32 vs. fp16 semantics by computing the divide factors at runtime based on which process groups were passed into the reduction function (`foreach_reduce`). **Additional Notes** How to implement the HSDP reduce-scatter but no all-reduce is not entirely clear yet. (What is the cleanest way to do this?) We need to store the partial reduce-scatter output and check for it upon the next backward. We should also be sure to error if the set of parameters receiving gradients changes, in which case we cannot support this easily. Anyway, we will implement this in a follow-up. Pull Request resolved: #125484 Approved by: https://github.com/wanchaol ghstack dependencies: #125431, #125479

[FSDP2] Added HSDP grad acc tests and some minor changes

2295372

[ghstack-poisoned]

awgu mentioned this pull request May 3, 2024

[FSDP2] Added test to show rank 0 broadcast for HSDP replicas #125431

Closed

pytorch-bot bot added ci-td-distributed oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (fsdp) release notes category labels May 3, 2024

Update on "[FSDP2] Added HSDP grad acc tests and some minor changes"

3190fd2

cc mrshenli pritamdamania87 zhaojuanmao satgera gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l wconstab yf225 chauhang d4l3k [ghstack-poisoned]

awgu added release notes: distributed (fsdp2) release notes category and removed release notes: distributed (fsdp) release notes category labels May 3, 2024

awgu mentioned this pull request May 3, 2024

[FSDP2] Computed grad divide factors at runtime #125484

Closed

awgu added ciflow/inductor ciflow/trunk Trigger trunk jobs on your pull request labels May 3, 2024

awgu marked this pull request as ready for review May 3, 2024 19:10

awgu requested a review from a team as a code owner May 3, 2024 19:10

awgu requested review from wanchaol and weifengpy May 3, 2024 19:10

wanchaol approved these changes May 3, 2024

View reviewed changes

pytorchmergebot added the merging label May 3, 2024

pytorchmergebot closed this in 996bb74 May 3, 2024

pytorchmergebot added the Merged label May 3, 2024

pytorchmergebot removed the merging label May 3, 2024

github-actions bot deleted the gh/awgu/579/head branch June 5, 2024 01:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[FSDP2] Added HSDP grad acc tests and some minor changes #125479

[FSDP2] Added HSDP grad acc tests and some minor changes #125479

Uh oh!

awgu commented May 3, 2024 •

edited

Loading

Uh oh!

pytorch-bot bot commented May 3, 2024 •

edited

Loading

Uh oh!

wanchaol May 3, 2024

Uh oh!

awgu commented May 3, 2024

Uh oh!

pytorchmergebot commented May 3, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants



		@contextlib.contextmanager
		def patch_all_reduce(new_all_reduce: Callable):

[FSDP2] Added HSDP grad acc tests and some minor changes #125479

[FSDP2] Added HSDP grad acc tests and some minor changes #125479

Uh oh!

Conversation

awgu commented May 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented May 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/125479

✅ No Failures

Uh oh!

wanchaol May 3, 2024

Choose a reason for hiding this comment

Uh oh!

awgu commented May 3, 2024

Uh oh!

pytorchmergebot commented May 3, 2024

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

awgu commented May 3, 2024 •

edited

Loading

pytorch-bot bot commented May 3, 2024 •

edited

Loading