[FSDP2] idempotent reset_sharded_param: no-op if _local_tensor is already padded by weifengpy · Pull Request #163130 · pytorch/pytorch

weifengpy · 2025-09-17T01:01:31Z

torchtitan use cached state dict for ft. reset_sharded_param should be idempotent if model.parameters() are padded already

# pad DTensor._local_tensor
fully_shard(model)
sd = fsdp_model.state_dict()
# reset_sharded_param should be a no-op in lazy_init
loss = fsdp_model(inp).sum()

this PR make reset_sharded_param idempotent by checking storage data ptr and return early

unit test

pytest -s test/distributed/_composable/fsdp/test_fully_shard_state_dict.py -k test_cached_state_dict

Stack from ghstack (oldest at bottom):

-> [FSDP2] idempotent reset_sharded_param: no-op if _local_tensor is already padded #163130

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta @ezyang @msaroufim @dcci

[ghstack-poisoned]

pytorch-bot · 2025-09-17T01:01:34Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/163130

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 8351a06 with merge base f6ea41e ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: 7018072 Pull Request resolved: #163130

cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta ezyang msaroufim dcci [ghstack-poisoned]

ghstack-source-id: 9a48d6b Pull Request resolved: #163130

weifengpy · 2025-09-17T01:07:13Z

adding unit tests

…dempotent" resolves pytorch/torchtitan#1136 cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta ezyang msaroufim dcci [ghstack-poisoned]

ghstack-source-id: cf80597 Pull Request resolved: #163130

…dempotent" resolves pytorch/torchtitan#1136 cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta ezyang msaroufim dcci [ghstack-poisoned]

ghstack-source-id: e42f1ed Pull Request resolved: #163130

weifengpy · 2025-09-17T18:16:57Z

ci error is not relevant

awgu · 2025-09-17T19:31:20Z

sorry I did not follow why we would need to reset the sharded param in the state dict pre hook and could not figure it out from the unit test

(e.g. where did the padding get lost in the first place / why was there not padding re-added at that point)

fegin · 2025-09-17T20:00:09Z

@awgu The current issue is that some training frameworks, including TorchTitan, calls model.state_dict() before the first forward() and uses that result throughout the entire training without calling model.state_dict() again. This will result in incorrect checkpoints being saved.

tianyu-l

Thanks for the fix!

tianyu-l · 2025-09-17T20:06:18Z

torch/distributed/fsdp/_fully_shard/_fsdp_param.py

            return
        updated_local_tensor = False
+        # `reset_sharded_param` can be called twice
+        # 1st time in sd = model.state_dict()


Please update comments based on offline discussions:

first time should be during fully_shard call

2nd time could happen with / without state dict load. If with load, the 2nd time should not be a no-op.

updated. good catch!

…dempotent" resolves pytorch/torchtitan#1136 torchtitan use cached state dict for ft. fsdp2 should run padding for sharded params ``` # should call reset sharded params for padding sd = fsdp_model.state_dict() # reset sharded params should be a no-op loss = fsdp_model(inp).sum() ``` this PR does two thing * reset sharded params in state dict pre hook * make sharded params idempotent by checking storage data ptr and return early unit test ``` pytest -s test/distributed/_composable/fsdp/test_fully_shard_state_dict.py -k test_cached_state_dict ``` cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta ezyang msaroufim dcci [ghstack-poisoned]

ghstack-source-id: 4b4908e Pull Request resolved: #163130

weifengpy · 2025-09-17T20:20:48Z

where did the padding get lost in the first place / why was there not padding re-added at that point

good question! no padding is getting losted. here is what I want to achieve

fully_shard(model) with padded local_tensor -> model.state_dict() -> model(input) reset_sharded_param should be no-op

I just need reset_sharded_param to be idempotent. without the PR, we are always creating new padded tensors, because local_tensor.size() != padded_sharded_size is always true (local_tensor is narrowed to origninal size after padding)

No need to call reset_sharded_param in state dict hooks. I modified the PR

…dempotent" resolves pytorch/torchtitan#1136 torchtitan use cached state dict for ft. reset_sharded_param should be idempotent if model.parameters() are padded already ``` # pad DTensor._local_tensor fully_shard(model) sd = fsdp_model.state_dict() # reset_sharded_param should be a no-op in lazy_init loss = fsdp_model(inp).sum() ``` this PR make `reset_sharded_param` idempotent by checking storage data ptr and return early unit test ``` pytest -s test/distributed/_composable/fsdp/test_fully_shard_state_dict.py -k test_cached_state_dict ``` cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta ezyang msaroufim dcci [ghstack-poisoned]

ghstack-source-id: 5a8850e Pull Request resolved: #163130

weifengpy · 2025-09-17T22:05:54Z

@pytorchmergebot merge

pytorchmergebot · 2025-09-17T22:07:36Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2025-09-17T22:29:01Z

Merge failed

Reason: 1 jobs have failed, first few of them are: trunk / linux-jammy-cuda12.8-py3.10-gcc11 / test (distributed, 3, 3, linux.g4dn.12xlarge.nvidia.gpu)

Details for Dev Infra team

Raised by workflow job

fegin · 2025-09-18T06:01:56Z

@weifengpy The test failure looks real.

…nsor is already padded" resolves pytorch/torchtitan#1136 torchtitan use cached state dict for ft. reset_sharded_param should be idempotent if model.parameters() are padded already ``` # pad DTensor._local_tensor fully_shard(model) sd = fsdp_model.state_dict() # reset_sharded_param should be a no-op in lazy_init loss = fsdp_model(inp).sum() ``` this PR make `reset_sharded_param` idempotent by checking storage data ptr and return early unit test ``` pytest -s test/distributed/_composable/fsdp/test_fully_shard_state_dict.py -k test_cached_state_dict ``` cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta ezyang msaroufim dcci [ghstack-poisoned]

ghstack-source-id: 240264a Pull Request resolved: #163130

weifengpy · 2025-09-18T06:36:51Z

@pytorchmergebot merge

pytorchmergebot · 2025-09-18T06:42:12Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

wwwjn · 2025-09-18T14:35:21Z

model.state_dict() -> model(input) reset_sharded_param should be no-op

Thanks for the fix @weifengpy ! By reading the code and comments, the reset_sharded_param() should only pad again if the model local tensor changed, right?

And this PR seems not merged into another base branch instead of main branch, is there any plan to upstream to main?

weifengpy · 2025-09-18T14:35:52Z

@weifengpy The test failure looks real.

right, updated the PR to skip tensor subclass

weifengpy · 2025-09-18T15:03:43Z

Thanks for the fix @weifengpy ! By reading the code and comments, the reset_sharded_param() should only pad again if the model local tensor changed, right?

that's right. for example, loading state dict triggers padding

And this PR seems not merged into another base branch instead of main branch, is there any plan to upstream to main?

this is ghstack so the branch looks weird. but it's merged into main

…eady padded (pytorch#163130) resolves pytorch/torchtitan#1136 torchtitan use cached state dict for ft. reset_sharded_param should be idempotent if model.parameters() are padded already ``` # pad DTensor._local_tensor fully_shard(model) sd = fsdp_model.state_dict() # reset_sharded_param should be a no-op in lazy_init loss = fsdp_model(inp).sum() ``` this PR make `reset_sharded_param` idempotent by checking storage data ptr and return early unit test ``` pytest -s test/distributed/_composable/fsdp/test_fully_shard_state_dict.py -k test_cached_state_dict ``` Pull Request resolved: pytorch#163130 Approved by: https://github.com/tianyu-l

reset sharded params in state_dict() and make it idempotent

12d8ec9

[ghstack-poisoned]

weifengpy added a commit that referenced this pull request Sep 17, 2025

reset sharded params in state_dict() and make it idempotent

e9b76f8

ghstack-source-id: 7018072 Pull Request resolved: #163130

pytorch-bot bot added ciflow/inductor oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (fsdp) release notes category labels Sep 17, 2025

Update on "reset sharded params in state_dict() and make it idempotent"

706c36a

cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta ezyang msaroufim dcci [ghstack-poisoned]

weifengpy added a commit that referenced this pull request Sep 17, 2025

reset sharded params in state_dict() and make it idempotent

374fdfa

ghstack-source-id: 9a48d6b Pull Request resolved: #163130

weifengpy changed the title ~~reset sharded params in state_dict() and make it idempotent~~ [FSDP2] reset sharded params in state_dict() and make it idempotent Sep 17, 2025

weifengpy marked this pull request as draft September 17, 2025 01:07

Update on "[FSDP2] reset sharded params in state_dict() and make it i…

71da4ce

…dempotent" resolves pytorch/torchtitan#1136 cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta ezyang msaroufim dcci [ghstack-poisoned]

weifengpy added a commit that referenced this pull request Sep 17, 2025

reset sharded params in state_dict() and make it idempotent

cd8eac3

ghstack-source-id: cf80597 Pull Request resolved: #163130

Update on "[FSDP2] reset sharded params in state_dict() and make it i…

d8538c0

…dempotent" resolves pytorch/torchtitan#1136 cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta ezyang msaroufim dcci [ghstack-poisoned]

weifengpy added a commit that referenced this pull request Sep 17, 2025

reset sharded params in state_dict() and make it idempotent

325f464

ghstack-source-id: e42f1ed Pull Request resolved: #163130

weifengpy marked this pull request as ready for review September 17, 2025 08:31

weifengpy requested review from fegin, tianyu-l and wwwjn September 17, 2025 08:31

tianyu-l approved these changes Sep 17, 2025

View reviewed changes

tianyu-l linked an issue Sep 17, 2025 that may be closed by this pull request

Inconsistent loss when resume training with vocab size that is not divisible by world size. pytorch/torchtitan#1136

Closed

weifengpy added a commit that referenced this pull request Sep 17, 2025

reset sharded params in state_dict() and make it idempotent

0f37296

ghstack-source-id: 4b4908e Pull Request resolved: #163130

weifengpy added a commit that referenced this pull request Sep 17, 2025

reset sharded params in state_dict() and make it idempotent

fcd876c

ghstack-source-id: 5a8850e Pull Request resolved: #163130

weifengpy changed the title ~~[FSDP2] reset sharded params in state_dict() and make it idempotent~~ [FSDP2] idempotent reset_sharded_param: no-op of _local_tensor are already padded Sep 17, 2025

weifengpy requested a review from awgu September 17, 2025 20:27

weifengpy mentioned this pull request Sep 17, 2025

temp fix state dict loading: avoid cache_state_dict pytorch/torchtitan#1702

Closed

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Sep 17, 2025

pytorchmergebot added the merging label Sep 17, 2025

pytorchmergebot removed the merging label Sep 17, 2025

weifengpy added a commit that referenced this pull request Sep 18, 2025

reset sharded params in state_dict() and make it idempotent

88c3770

ghstack-source-id: 240264a Pull Request resolved: #163130

pytorchmergebot added the merging label Sep 18, 2025

pytorchmergebot added the Merged label Sep 18, 2025

pytorchmergebot closed this in cfb8aec Sep 18, 2025

pytorchmergebot removed the merging label Sep 18, 2025

weifengpy added release notes: distributed (fsdp2) release notes category and removed release notes: distributed (fsdp) release notes category labels Sep 18, 2025

This was referenced Sep 18, 2025

Inconsistent loss when resume training with vocab size that is not divisible by world size. pytorch/torchtitan#1136

Closed

[DO NOT LAND] hack fsdp2 to support AC(fully_shard(model)) #163391

Closed

weifengpy mentioned this pull request Sep 26, 2025

[FSDP2] support AC(FSDP) for torchtitan's MOE #164009

Closed

github-actions bot deleted the gh/weifengpy/31/head branch October 19, 2025 02:19

Conversation

weifengpy commented Sep 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Sep 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/163130

✅ No Failures

Uh oh!

weifengpy commented Sep 17, 2025

Uh oh!

weifengpy commented Sep 17, 2025

Uh oh!

awgu commented Sep 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fegin commented Sep 17, 2025

Uh oh!

tianyu-l left a comment

Choose a reason for hiding this comment

Uh oh!

tianyu-l Sep 17, 2025

Choose a reason for hiding this comment

Uh oh!

weifengpy Sep 17, 2025

Choose a reason for hiding this comment

Uh oh!

weifengpy commented Sep 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

weifengpy commented Sep 17, 2025

Uh oh!

pytorchmergebot commented Sep 17, 2025

Merge started

Uh oh!

pytorchmergebot commented Sep 17, 2025

Merge failed

Uh oh!

fegin commented Sep 18, 2025

Uh oh!

weifengpy commented Sep 18, 2025

Uh oh!

pytorchmergebot commented Sep 18, 2025

Merge started

Uh oh!

wwwjn commented Sep 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

weifengpy commented Sep 18, 2025

Uh oh!

weifengpy commented Sep 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

weifengpy commented Sep 17, 2025 •

edited

Loading

pytorch-bot bot commented Sep 17, 2025 •

edited

Loading

awgu commented Sep 17, 2025 •

edited

Loading

weifengpy commented Sep 17, 2025 •

edited

Loading

wwwjn commented Sep 18, 2025 •

edited

Loading