Skip to content

fixes keyerror when loading parameter with unsaved optimizer state#165228

Closed
arkadip-maitra wants to merge 1 commit intopytorch:mainfrom
arkadip-maitra:fix_#164257
Closed

fixes keyerror when loading parameter with unsaved optimizer state#165228
arkadip-maitra wants to merge 1 commit intopytorch:mainfrom
arkadip-maitra:fix_#164257

Conversation

@arkadip-maitra
Copy link
Collaborator

@arkadip-maitra arkadip-maitra commented Oct 11, 2025

@pytorch-bot
Copy link

pytorch-bot bot commented Oct 11, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/165228

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 685e9fe with merge base 24db5c4 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (checkpoint) labels Oct 11, 2025
@arkadip-maitra arkadip-maitra changed the title fixes keyerror when loading unsaved optimizer parameter fixes keyerror when loading parameter with unsaved optimizer state Oct 11, 2025
@soulitzer soulitzer requested a review from fegin October 13, 2025 15:36
@soulitzer soulitzer added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Oct 13, 2025
@soulitzer soulitzer requested a review from weifengpy October 13, 2025 15:36
@arkadip-maitra
Copy link
Collaborator Author

@weifengpy this test failure seems to be unrelated to my change. can you approve and merge pls

@arkadip-maitra
Copy link
Collaborator Author

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased fix_#164257 onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout fix_#164257 && git pull --rebase)

@fegin fegin added the ciflow/trunk Trigger trunk jobs on your pull request label Oct 27, 2025
@pytorch-bot
Copy link

pytorch-bot bot commented Oct 27, 2025

To add the ciflow label ciflow/trunk please first approve the workflows that are awaiting approval (scroll to the bottom of this page).

This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows.

@pytorch-bot pytorch-bot bot removed the ciflow/trunk Trigger trunk jobs on your pull request label Oct 27, 2025
@arkadip-maitra
Copy link
Collaborator Author

@pytorchbot merge

@pytorch-bot
Copy link

pytorch-bot bot commented Oct 28, 2025

Pull workflow has not been scheduled for the PR yet. It could be because author doesn't have permissions to run those or skip-checks keywords were added to PR/commits, aborting merge. Please get/give approval for the workflows and/or remove skip ci decorators before next merge attempt. If you think this is a mistake, please contact PyTorch Dev Infra.

@arkadip-maitra
Copy link
Collaborator Author

@fegin can you merge this pr pls

@isuruf
Copy link
Collaborator

isuruf commented Nov 3, 2025

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Nov 3, 2025
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 jobs have failed, first few of them are: trunk / linux-jammy-rocm-py3.10 / test (default, 1, 2, linux.rocm.gpu.gfx942.1)

Details for Dev Infra team Raised by workflow job

@arkadip-maitra
Copy link
Collaborator Author

@isuruf any idea on these tests failing. seems to be unrelated?

@isuruf
Copy link
Collaborator

isuruf commented Nov 4, 2025

@pytorchbot merge -r viable/strict

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased fix_#164257 onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout fix_#164257 && git pull --rebase)

@pytorch-bot pytorch-bot bot removed the ciflow/trunk Trigger trunk jobs on your pull request label Nov 4, 2025
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 3 mandatory check(s) failed. The first few are:

Dig deeper by viewing the failures on hud

Details for Dev Infra team Raised by workflow job

Failing merge rule: Core Maintainers

@arkadip-maitra
Copy link
Collaborator Author

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Nov 5, 2025
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request Merged oncall: distributed Add this issue/PR to distributed oncall triage queue open source release notes: distributed (checkpoint) triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

distributed checkpoint errors with unused weights and stateful optimizer

6 participants