fix: use grad div factor when fsdp_degree=1 by garrett361 · Pull Request #167178 · pytorch/pytorch

garrett361 · 2025-11-06T02:55:08Z

fully_shard's gradient_divide_factor isn't currently respected when the sharding degree = 1. This PR ensures the division factor applies also in this case.

This is a bit of an edge case, but it arises in torchtitan, e.g. with expert parallelism and ep_degree=world_size we still wrap the routed experts in fully_shard because:

It lets us take advantage of its mixed-precision mechanisms.
A specific gradient_divide_factor is needed for correctness

This PR ensures correctness in the reduce_scatter_group.size()==1 case.

Reproducer and sample failures are in the gist here. The net effect is that the EP grads are too-large by a factor of the world size in the case described above. I checked that the proposed fix makes these tests pass.

I guess I should add a test for this, too?

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta @msaroufim @dcci

pytorch-bot · 2025-11-06T02:55:12Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/167178

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit a88ea4f with merge base dc00842 ():

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

trunk / linux-jammy-py3-clang12-executorch / test (executorch, 1, 1, lf.linux.2xlarge, unstable) (gh) (#166072)
backends/xnnpack/test/recipes/test_xnnpack_recipes.py::TestXnnpackRecipes::test_int8_static_quant_recipe

This comment was automatically generated by Dr. CI and updates every 15 minutes.

linux-foundation-easycla · 2025-11-06T02:55:16Z

The committers listed above are authorized under a signed CLA.

✅ login: garrett361 / name: Garrett Goon (019e636, 0353eac, 6c015d6, 8862e11, 9810b16, a88ea4f, a90a8d5, b971070, cb6a1d9, e967d2c, ef9dcf2)

tianyu-l · 2025-11-06T03:36:55Z

Thanks a lot! Sounds right to me. I'll let @weifengpy review / stamp.

Yes, a test would be appreciated.

weifengpy · 2025-11-06T06:28:14Z

good catch!

weifengpy · 2025-11-06T06:28:37Z

triggering CI - waiting for signals

garrett361 · 2025-11-06T14:43:24Z

Thanks @weifengpy ! Figuring out CLA stuff with work, then will add a test.

garrett361 · 2025-11-07T21:19:56Z

@weifengpy still waiting on CLA stuff, but started looking at where I'd add a test. I'm not seeing gradient_divide_factor tested anywhere in the code base. Am I missing a test somewhere? Closest I can find is in TestFullyShardCollectiveOps where it's mentioned but left as None.

Should I add a test here? Seems like I'd have to add a decent bit of infra code to get a proper test set up.

weifengpy · 2025-11-10T19:18:11Z

Should I add a test here? Seems like I'd have to add a decent bit of infra code to get a proper test set up.

adding a unit test would be great! I was mentioning CI to make sure it does not break shard world size 2+, but having unit test on 1 is better

cc @anshul-si for a bug fix in world size 1

Signed-off-by: Garrett Goon <goon@ibm.com>

garrett361 · 2025-11-11T16:30:05Z

I found some other issues in the code and tests related to this topic:

There is a current edge case where if the user calls model.set_gradient_divide_factor(factor) and factor happens to equal the data parallel size, then a (I believe) unintended code path is taken: we end up dividing grads by factor and then averaging over the allreduce group, rather than summing. From the code below:

pytorch/torch/distributed/fsdp/_fully_shard/_fsdp_collectives.py

Lines 727 to 736 in dc00842

    
           if not overflow_risk and not force_sum_reduction_for_comms: 
        
               if factor == data_parallel_size: 
        
                   # Warning: NCCL ReduceOp.AVG may produce incorrect results with 
        
                   # world size 1. 
        
                   if data_parallel_size == 1: 
        
                       return None, None, ReduceOp.SUM, ReduceOp.SUM 
        
                   return None, None, ReduceOp.AVG, ReduceOp.AVG 
        
               else: 
        
                   reduce_scatter_op = torch.distributed._make_nccl_premul_sum(1 / factor) 
        
                   return None, None, reduce_scatter_op, ReduceOp.SUM

if factor != data_parallel_size then we end up at the final line and return None, None, reduce_scatter_op, ReduceOp.SUM. But if we happen to set factor = data_parallel_size, then we enter the first if statement and return None, None, ReduceOp.AVG, ReduceOp.AVG, assuming data_parallel_size > 1. This is a change in semantics due to swapping the final ReduceOp.SUM return value for a ReduceOp.AVG in the latter case. I believe we should always want ReduceOp.SUM if a custom division factor is provided and changed the code to reflect that, but let me know if that is not desired.

I found the relevant _test_set_reduce_scatter_divide_factor test, and as written it wasn't sensitive enough to catch some of these issues. I updated this PR to increase the sensitivity of this test and cover more cases. This is how I caught the above edge case.

There's also the test _test_set_reduce_scatter_divide_factor_mixed_prevision, but this doesn't seem to actually test anything about custom division factors because we apply the division factor only to the outermost fully_shard wrapped module (the whole model) which doesn't seem to own any parameters itself, since it's a Sequential:

pytorch/test/distributed/_composable/fsdp/test_fully_shard_comm.py

Lines 474 to 482 in dc00842

    
           model = nn.Sequential(*[MLP(16) for _ in range(3)]) 
        
           ref_model = copy.deepcopy(model).to(device_type) 
        
           ref_model_bf16 = copy.deepcopy(ref_model).to(param_dtype) 
        
           ref_optim = torch.optim.AdamW(ref_model.parameters(), lr=1e-2) 
        
           for mlp in model: 
        
               fully_shard(mlp, mp_policy=mp_policy) 
        
           model = fully_shard(model, mp_policy=mp_policy) 
        
           optim = torch.optim.AdamW(model.parameters(), lr=1e-2) 
        
           model.set_reduce_scatter_divide_factor(divide_factor)

Sound right? I didn't touch this test yet, but can if desired. Trying to keep the PR small.

CC @anshul-si @weifengpy @tianyu-l

Tested locally that everything in test/distributed/_composable/fsdp/test_fully_shard_comm.py passes.

weifengpy · 2025-11-12T22:22:53Z

@garrett361 thanks for the detailed unit tests

garrett361 · 2025-11-13T02:06:01Z

Thanks @weifengpy , do you want me to make any changes to _test_set_reduce_scatter_divide_factor_mixed_prevision as well?

garrett361 · 2025-11-17T21:48:57Z

Hi @weifengpy , checking in on this. Anything else you need from my end? Thanks!

weifengpy · 2025-11-19T01:09:14Z

Hi @weifengpy , checking in on this. Anything else you need from my end? Thanks!

@garrett361 just took another look and it's safe. we can land once CI passes. no need to cover _test_set_reduce_scatter_divide_factor_mixed_prevision

weifengpy · 2025-11-19T04:59:24Z

@pytorchmergebot merge

pytorchmergebot · 2025-11-19T05:01:12Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (fsdp) release notes category labels Nov 6, 2025

pytorchbot added the open source label Nov 6, 2025

tianyu-l requested a review from weifengpy November 6, 2025 03:33

tianyu-l requested review from anshul-si, awgu and mori360 November 6, 2025 03:37

garrett361 force-pushed the fully-shard-divide-fix branch from 7078973 to 0490875 Compare November 6, 2025 03:45

weifengpy approved these changes Nov 6, 2025

View reviewed changes

garrett361 added 4 commits November 11, 2025 15:26

fix: use grad div factor when fsdp_degree=1

0353eac

Signed-off-by: Garrett Goon <goon@ibm.com>

default case is None, not 1.0

b971070

fix: factor == data_parallel_size edge case

9810b16

fix: make tests sensitive enough to catch errs

8862e11

garrett361 force-pushed the fully-shard-divide-fix branch from 337f84f to 8862e11 Compare November 11, 2025 15:26

garrett361 added 4 commits November 11, 2025 16:03

set_gradient_divide_factor in tests

6c015d6

ReduceOp.AVG for custom reduce_scatter when poss

a90a8d5

ensure factor is defined when needed

019e636

minimize diff

ef9dcf2

garrett361 added 2 commits November 12, 2025 16:34

fix: missing None check

e967d2c

linting

cb6a1d9

fix comment grammar

a88ea4f

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Nov 19, 2025

pytorchmergebot added the merging label Nov 19, 2025

pytorchmergebot added the Merged label Nov 19, 2025

pytorchmergebot closed this in 9abc9aa Nov 19, 2025

pytorchmergebot removed the merging label Nov 19, 2025

Conversation

garrett361 commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/167178

✅ You can merge normally! (1 Unrelated Failure)

Uh oh!

linux-foundation-easycla bot commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tianyu-l commented Nov 6, 2025

Uh oh!

weifengpy commented Nov 6, 2025

Uh oh!

weifengpy commented Nov 6, 2025

Uh oh!

garrett361 commented Nov 6, 2025

Uh oh!

garrett361 commented Nov 7, 2025

Uh oh!

weifengpy commented Nov 10, 2025

Uh oh!

garrett361 commented Nov 11, 2025

Uh oh!

weifengpy commented Nov 12, 2025

Uh oh!

garrett361 commented Nov 13, 2025

Uh oh!

garrett361 commented Nov 17, 2025

Uh oh!

weifengpy commented Nov 19, 2025

Uh oh!

weifengpy commented Nov 19, 2025

Uh oh!

pytorchmergebot commented Nov 19, 2025

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

garrett361 commented Nov 6, 2025 •

edited

Loading

pytorch-bot bot commented Nov 6, 2025 •

edited

Loading

linux-foundation-easycla bot commented Nov 6, 2025 •

edited

Loading