Skip to content

Conversation

@awgu
Copy link
Collaborator

@awgu awgu commented May 2, 2024

Stack from ghstack (oldest at bottom):

This PR shows a simple utility to broadcast the parameters across replicas for HSDP:

replicate_group = mesh.get_group("replicate")
for param in model.parameters():
    # E.g. for mesh [[0, 1, 2, 3], [4, 5, 6, 7]] sharding on dim-1 and
    # replicating on dim-0, broadcast with sources 0, 1, 2, 3
    src_rank = dist.get_process_group_ranks(replicate_group)[0]
    torch.distributed.broadcast(
        param.to_local(), src=src_rank, group=replicate_group
    )

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang @d4l3k

@pytorch-bot
Copy link

pytorch-bot bot commented May 2, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/125431

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 7505a71 with merge base b03fb49 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added ci-td-distributed oncall: distributed Add this issue/PR to distributed oncall triage queue topic: not user facing topic category labels May 2, 2024
awgu pushed a commit that referenced this pull request May 2, 2024
# E.g. for mesh [[0, 1, 2, 3], [4, 5, 6, 7]] sharding on dim-1 and
# replicating on dim-0, broadcast with sources 0, 1, 2, 3
src_rank = dist.get_process_group_ranks(replicate_group)[0]
torch.distributed.broadcast(
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Today, in-place c10d broadcast is preferred.

If we want to use functional broadcast:

  1. We need to verify the semantics. We may still need to get the src_rank like we do here, which is confusing since it is the rank with respect to the global process group, not the one passed to broadcast.
  2. We need to swap the newly broadcasted tensor in. Since FSDP holds a reference, we cannot just setattr(module, param_name, broadcasted_param) since that would leave FSDP's reference as stale. We may consider using swap_tensors, but we are blocked by the local tensor padding issue since the broadcasted parameter would not have padding and is actually a view into the padded local tensor.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think a inplace broadcast make sense here!

return 4

@unittest.skipIf(not TEST_CUDA, "no cuda")
def test_hsdp_broadcast_across_replicas(self):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might wonder, why not always have HSDP broadcast during init time. The issue is that we only need to broadcast if we are initializing from scratch (not a checkpoint). If we are initializing from a checkpoint, then we are already guaranteed that replicas are the same, and broadcasting is wasteful and can affect init time.

@awgu awgu marked this pull request as ready for review May 2, 2024 21:50
@awgu awgu requested review from wanchaol and weifengpy May 2, 2024 21:50
@awgu awgu added the ciflow/trunk Trigger trunk jobs on your pull request label May 3, 2024
Copy link
Contributor

@weifengpy weifengpy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice! I feel good to have the exmaple of getting global rank from replicate_group. hopefully people copy from here instead of using local rank in c10d

…cas"


This PR shows a simple utility to broadcast the parameters across replicas for HSDP:
```
replicate_group = mesh.get_group("replicate")
for param in model.parameters():
    # E.g. for mesh [[0, 1, 2, 3], [4, 5, 6, 7]] sharding on dim-1 and
    # replicating on dim-0, broadcast with sources 0, 1, 2, 3
    src_rank = dist.get_process_group_ranks(replicate_group)[0]
    torch.distributed.broadcast(
        param.to_local(), src=src_rank, group=replicate_group
    )
```


cc mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l wconstab yf225 chauhang d4l3k

[ghstack-poisoned]
@awgu
Copy link
Collaborator Author

awgu commented May 3, 2024

By the way, I am open to the idea of some fsdp/utils.py file for some common code snippets like this. However, in my personal opinion, I think that (1) we should not do this by default in FSDP and (2) we should not make it a flag.

(1) is because it is a waste to broadcast if loading from a state dict.
(2) is because it makes this inflexible in the user-code, as the user must know at FSDP sharding time whether it will load a state dict or not. We can imagine the code to be cleaner if we have the shared FSDP sharding code and then later when choosing to load from state dict or init weights, we decide whether to broadcast or not.

@weifengpy
Copy link
Contributor

(1) we should not do this by default in FSDP and (2) we should not make it a flag.

(1) feels easy to understand
(2) 'make it a flag' means a ENV flag or 'make it an official best practice'?

@awgu
Copy link
Collaborator Author

awgu commented May 3, 2024

(2) 'make it a flag' means a ENV flag or 'make it an official best practice'?

Today, FSDP1 has a sync_module_states: bool flag to control this kind of behavior. I would prefer to not add an analogous flag to FSDP2. We should definitely socialize properly / document that if you are pretraining from scratch with HSDP, you should broadcast across replicas (unless you know your seed is set properly).

Copy link
Collaborator

@wanchaol wanchaol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

# E.g. for mesh [[0, 1, 2, 3], [4, 5, 6, 7]] sharding on dim-1 and
# replicating on dim-0, broadcast with sources 0, 1, 2, 3
src_rank = dist.get_process_group_ranks(replicate_group)[0]
torch.distributed.broadcast(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think a inplace broadcast make sense here!

@awgu
Copy link
Collaborator Author

awgu commented May 3, 2024

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

pytorchmergebot pushed a commit that referenced this pull request May 3, 2024
This adds HSDP to the existing gradient accumulation tests and includes some minor changes to simplify things a tiny bit.

Pull Request resolved: #125479
Approved by: https://github.com/wanchaol
ghstack dependencies: #125431
pytorchmergebot pushed a commit that referenced this pull request May 3, 2024
**Context**
We are interested in supporting the case where HSDP reduce-scatters but does not all-reduce in a microbatch backward. This saves communication while still saving memory. Only on the last microbatch do we need to both reduce-scatter and all-reduce. This is not implemented yet and will hopefully come in a future PR.

There is one notable part of doing this. On the last microbatch, we need to perform an accumulation step after reduce-scatter and before all-reduce. If not, then the preceding microbatch's gradients will not be contributed across the replica group. (In other words, we cannot simply accumulate _after_ all-reduce.)

Consider 32 GPUs with 4-way replication and 8-way sharding and 2 microbatches, and focus on global rank 0.
- After the first microbatch, rank 0 will have its shard of $\frac{1}{8} \sum_{i \in S(0)} g_i^{(1)}$, where we define $S(0) = \{0, 1, \dots, 7\}$ to be the ranks in its shard group and we define the $(1)$ superscript to denote the first microbatch.
- Upon the second microbatch, rank 0 after its reduce-scatter will additionally have its shard of $\frac{1}{8} \sum_{i \in S(0)} g_i^{(2)}$. If we only all-reduce this, then this second microbatch's gradients become $\frac{1}{32} \sum_{i=0, 1, \dots, 31} g_i^{(2)}$, so in total, rank 0 has $\frac{1}{8} \sum_{i \in S(0)} g_i^{(1)} + \frac{1}{32} \sum_{i=0, 1, \dots, 31} g_i^{(2)}$, which is wrong.
- Importantly, we must accumulate $\frac{1}{8} \sum_{i \in S(0)} g_i^{(1)}  + \frac{1}{8} \sum_{i \in S(0)} g_i^{(2)} = \frac{1}{8}\sum_{i \in S(0)} (g_i^{(1)} + g_i^{(2)})$ first before all-reducing to get $\frac{1}{32} \sum_{i=0, 1, \dots, 31} (g_i^{(1)} + g_i^{(2)})$.

Now, note how under this approach, we want a factor of $\frac{1}{8}$ only (i.e. reciprocal of the shard group size), not $\frac{1}{32}$, for the first microbatch's gradients.
- For bf16/fp32, since we use `ReduceOp.AVG` and we only reduce-scatter on the first microbatch, we correctly have a factor of $\frac{1}{8}$ on the first microbatch.
- For fp16, since we precompute the gradient divide factors at init time assuming always reducing over both shard and replica groups, we incorrectly have a factor of $\frac{1}{32}$ on the first microbatch, deviating from the bf16/fp32 case.

We can address this issue by matching the bf16/fp32 vs. fp16 semantics by computing the divide factors at runtime based on which process groups were passed into the reduction function (`foreach_reduce`).

**Additional Notes**
How to implement the HSDP reduce-scatter but no all-reduce is not entirely clear yet. (What is the cleanest way to do this?) We need to store the partial reduce-scatter output and check for it upon the next backward. We should also be sure to error if the set of parameters receiving gradients changes, in which case we cannot support this easily. Anyway, we will implement this in a follow-up.

Pull Request resolved: #125484
Approved by: https://github.com/wanchaol
ghstack dependencies: #125431, #125479
pytorchmergebot pushed a commit that referenced this pull request May 5, 2024
**Context**
We are interested in supporting the case where HSDP reduce-scatters but does not all-reduce in a microbatch backward. This saves communication while still saving memory. Only on the last microbatch do we need to both reduce-scatter and all-reduce. This is not implemented yet and will hopefully come in a future PR.

There is one notable part of doing this. On the last microbatch, we need to perform an accumulation step after reduce-scatter and before all-reduce. If not, then the preceding microbatch's gradients will not be contributed across the replica group. (In other words, we cannot simply accumulate _after_ all-reduce.)

Consider 32 GPUs with 4-way replication and 8-way sharding and 2 microbatches, and focus on global rank 0.
- After the first microbatch, rank 0 will have its shard of $\frac{1}{8} \sum_{i \in S(0)} g_i^{(1)}$, where we define $S(0) = \{0, 1, \dots, 7\}$ to be the ranks in its shard group and we define the $(1)$ superscript to denote the first microbatch.
- Upon the second microbatch, rank 0 after its reduce-scatter will additionally have its shard of $\frac{1}{8} \sum_{i \in S(0)} g_i^{(2)}$. If we only all-reduce this, then this second microbatch's gradients become $\frac{1}{32} \sum_{i=0, 1, \dots, 31} g_i^{(2)}$, so in total, rank 0 has $\frac{1}{8} \sum_{i \in S(0)} g_i^{(1)} + \frac{1}{32} \sum_{i=0, 1, \dots, 31} g_i^{(2)}$, which is wrong.
- Importantly, we must accumulate $\frac{1}{8} \sum_{i \in S(0)} g_i^{(1)}  + \frac{1}{8} \sum_{i \in S(0)} g_i^{(2)} = \frac{1}{8}\sum_{i \in S(0)} (g_i^{(1)} + g_i^{(2)})$ first before all-reducing to get $\frac{1}{32} \sum_{i=0, 1, \dots, 31} (g_i^{(1)} + g_i^{(2)})$.

Now, note how under this approach, we want a factor of $\frac{1}{8}$ only (i.e. reciprocal of the shard group size), not $\frac{1}{32}$, for the first microbatch's gradients.
- For bf16/fp32, since we use `ReduceOp.AVG` and we only reduce-scatter on the first microbatch, we correctly have a factor of $\frac{1}{8}$ on the first microbatch.
- For fp16, since we precompute the gradient divide factors at init time assuming always reducing over both shard and replica groups, we incorrectly have a factor of $\frac{1}{32}$ on the first microbatch, deviating from the bf16/fp32 case.

We can address this issue by matching the bf16/fp32 vs. fp16 semantics by computing the divide factors at runtime based on which process groups were passed into the reduction function (`foreach_reduce`).

**Additional Notes**
How to implement the HSDP reduce-scatter but no all-reduce is not entirely clear yet. (What is the cleanest way to do this?) We need to store the partial reduce-scatter output and check for it upon the next backward. We should also be sure to error if the set of parameters receiving gradients changes, in which case we cannot support this easily. Anyway, we will implement this in a follow-up.

Pull Request resolved: #125484
Approved by: https://github.com/wanchaol
ghstack dependencies: #125431, #125479
@github-actions github-actions bot deleted the gh/awgu/578/head branch June 5, 2024 01:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-td-distributed ciflow/trunk Trigger trunk jobs on your pull request Merged oncall: distributed Add this issue/PR to distributed oncall triage queue topic: not user facing topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants