Skip to content

fix fr reset api#166970

Closed
tushar00jain wants to merge 1 commit intopytorch:mainfrom
tushar00jain:pr166970
Closed

fix fr reset api#166970
tushar00jain wants to merge 1 commit intopytorch:mainfrom
tushar00jain:pr166970

Conversation

@tushar00jain
Copy link
Contributor

@tushar00jain tushar00jain commented Nov 4, 2025

Summary:

  • there are various places that access fr's entries_ field
  • if we empty the entries_ on reset, the accesses can result in an error
  • so we only perform a soft delete instead of clearing out the entries copletely
    • only reset id_ on the reset
    • keep track of a reset_epoch which increments everytime reset is called
    • dump_entries only returns entries from the latest epoch
    • api's that access entries also check if the reset epoch matches
  • make the next_ always track the index in the circular buffer - this change was needed to make the soft delete's implementation easier

Stack created with Sapling. Best reviewed with ReviewStack.

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta @msaroufim @dcci

@pytorch-bot
Copy link

pytorch-bot bot commented Nov 4, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/166970

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 43d0547 with merge base eea8ff2 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels Nov 4, 2025
This was referenced Nov 4, 2025
@tushar00jain tushar00jain marked this pull request as ready for review November 4, 2025 17:50
@tushar00jain tushar00jain requested a review from fduwjj November 4, 2025 17:50
@meta-codesync
Copy link

meta-codesync bot commented Nov 4, 2025

@tushar00jain has imported this pull request. If you are a Meta employee, you can view this in D86214258.

@tushar00jain tushar00jain force-pushed the pr166970 branch 2 times, most recently from d9a1ec4 to 787a54e Compare November 4, 2025 20:06
@tushar00jain
Copy link
Contributor Author

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Nov 4, 2025
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 mandatory check(s) failed. The first few are:

Dig deeper by viewing the failures on hud

Details for Dev Infra team Raised by workflow job

Failing merge rule: Core Maintainers

@tushar00jain
Copy link
Contributor Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 mandatory check(s) failed. The first few are:

Dig deeper by viewing the failures on hud

Details for Dev Infra team Raised by workflow job

Failing merge rule: Core Maintainers

@tushar00jain tushar00jain force-pushed the pr166970 branch 2 times, most recently from d1d20ff to 487772e Compare November 5, 2025 19:05
Summary:
- there are various places that access fr's `entries_` field
- if we empty the entries_ on reset, the accesses can result in an error
- so we only perform a soft delete instead of clearing out the entries copletely
  - only reset id_ on the reset
  - keep track of a reset_epoch which increments everytime reset is called
  - dump_entries only returns entries from the latest epoch
  - api's that access entries also check if the reset epoch matches
- make the `next_` always track the index in the circular buffer - this change was needed to make the soft delete's implementation easier
@tushar00jain
Copy link
Contributor Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request Merged oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants