Disable GC in process based async checkpointing by ankitageorge · Pull Request #169613 · pytorch/pytorch

ankitageorge · 2025-12-04T21:30:30Z

Summary:
I've been investigating the source for the large (10 min for 1k jobs) planning collectives. The source of the issue seems to be the GC running in the process during the all gather deserialization of objects https://fburl.com/code/6wdvfioi causing spikes and increase in deserialization time. With GC disabled, for a 96 gpu job, the planning collective is 4s https://fburl.com/scuba/pytorch_dcp_logging/elytxg2y. It's 30s without this enabled https://fburl.com/scuba/pytorch_dcp_logging/hkn233tz. On a 256 gpu job, the deserialization of one save plan when gc is enabled, can take as long as 25s https://fburl.com/mlhub/ewtu6439, when normally it should be in the order of milliseconds.

This diff, behind a JK, will disable the automatic gc, and behind another jk, won't run the manual gc at all. The cost of running manual gc is ~10s (https://fburl.com/mlhub/q4uauv20, https://fburl.com/mlhub/j52cv5mj) for the 96 gpu job. So ideally, we wouldn't have to run it. Based on some testing (https://fburl.com/mlhub/kcw221ly), we aren't actually leaking anything and we shouldn't need to run manual GC, but adding a JK in case this not true, so we don't OOM.

Test Plan:
unit tests
ran a job aps-ankitageorge-2a3f549ddc

Reviewed By: kevinmtang

Differential Revision: D88274329

pytorch-bot · 2025-12-04T21:30:35Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/169613

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 0b92c78 with merge base 5a7a65a ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

meta-codesync · 2025-12-04T21:30:37Z

@ankitageorge has exported this pull request. If you are a Meta employee, you can view the originating Diff in D88274329.

Summary: I've been investigating the source for the large (10 min for 1k jobs) planning collectives. The source of the issue seems to be the GC running in the process during the all gather deserialization of objects https://fburl.com/code/6wdvfioi causing spikes and increase in deserialization time. With GC disabled, for a 96 gpu job, the planning collective is 4s https://fburl.com/scuba/pytorch_dcp_logging/elytxg2y. It's 30s without this enabled https://fburl.com/scuba/pytorch_dcp_logging/hkn233tz. On a 256 gpu job, the deserialization of one save plan when gc is enabled, can take as long as 25s https://fburl.com/mlhub/ewtu6439, when normally it should be in the order of milliseconds. This diff, behind a JK, will disable the automatic gc, and behind another jk, won't run the manual gc at all. The cost of running manual gc is ~10s (https://fburl.com/mlhub/q4uauv20, https://fburl.com/mlhub/j52cv5mj) for the 96 gpu job. So ideally, we wouldn't have to run it. Based on some testing (https://fburl.com/mlhub/kcw221ly), we aren't actually leaking anything and we shouldn't need to run manual GC, but adding a JK in case this not true, so we don't OOM. Test Plan: unit tests ran a job aps-ankitageorge-2a3f549ddc Reviewed By: kevinmtang Differential Revision: D88274329

Skylion007 · 2025-12-06T17:48:54Z

Just use percent formatting for the logger please (passing formatting parameters as args). Since it's info, the string formatting can be skipped most of the time and the fstring is not lazy.

Skylion007 · 2025-12-08T18:01:37Z

Would this make sense as an enum? Seems like combinations of this are mutually exclusive and better to have be parsed out by an enum / string from env rather than fall into Boolean trap. IE you would never want to enable the automatic GC but disable the manual one, right? That seems like an error!

Skylion007

Just use percent formatting, you are just formatting numbers here and they are info level logs so they often can be ignored and not have the value printed.

Skylion007 · 2025-12-08T18:05:17Z

So disable manual but do not disable automatic seems like valid state. Seems like disable none, disable automatic, disable both are the valid state here? Just have this be the case with a special string value? Or even by having it be a special value of 2

Summary: I've been investigating the source for the large (10 min for 1k jobs) planning collectives. The source of the issue seems to be the GC running in the process during the all gather deserialization of objects https://fburl.com/code/6wdvfioi causing spikes and increase in deserialization time. With GC disabled, for a 96 gpu job, the planning collective is 4s https://fburl.com/scuba/pytorch_dcp_logging/elytxg2y. It's 30s without this enabled https://fburl.com/scuba/pytorch_dcp_logging/hkn233tz. On a 256 gpu job, the deserialization of one save plan when gc is enabled, can take as long as 25s https://fburl.com/mlhub/ewtu6439, when normally it should be in the order of milliseconds. This diff, behind a JK, will disable the automatic gc, and behind another jk, won't run the manual gc at all. The cost of running manual gc is ~10s (https://fburl.com/mlhub/q4uauv20, https://fburl.com/mlhub/j52cv5mj) for the 96 gpu job. So ideally, we wouldn't have to run it. Based on some testing (https://fburl.com/mlhub/kcw221ly), we aren't actually leaking anything and we shouldn't need to run manual GC, but adding a JK in case this not true, so we don't OOM. Test Plan: unit tests ran a job aps-ankitageorge-2a3f549ddc Reviewed By: kevinmtang Differential Revision: D88274329

facebook-github-bot · 2025-12-09T00:01:20Z

@pytorchbot merge

(Initiating merge automatically since Phabricator Diff has merged)

pytorchmergebot · 2025-12-09T00:03:17Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Summary: I've been investigating the source for the large (10 min for 1k jobs) planning collectives. The source of the issue seems to be the GC running in the process during the all gather deserialization of objects https://fburl.com/code/6wdvfioi causing spikes and increase in deserialization time. With GC disabled, for a 96 gpu job, the planning collective is 4s https://fburl.com/scuba/pytorch_dcp_logging/elytxg2y. It's 30s without this enabled https://fburl.com/scuba/pytorch_dcp_logging/hkn233tz. On a 256 gpu job, the deserialization of one save plan when gc is enabled, can take as long as 25s https://fburl.com/mlhub/ewtu6439, when normally it should be in the order of milliseconds. This diff, behind a JK, will disable the automatic gc, and behind another jk, won't run the manual gc at all. The cost of running manual gc is ~10s (https://fburl.com/mlhub/q4uauv20, https://fburl.com/mlhub/j52cv5mj) for the 96 gpu job. So ideally, we wouldn't have to run it. Based on some testing (https://fburl.com/mlhub/kcw221ly), we aren't actually leaking anything and we shouldn't need to run manual GC, but adding a JK in case this not true, so we don't OOM. Test Plan: unit tests ran a job aps-ankitageorge-2a3f549ddc Reviewed By: kevinmtang Differential Revision: D88274329 Pull Request resolved: pytorch#169613 Approved by: https://github.com/kevinmtang

wwwjn · 2025-12-09T18:44:59Z

This PR introduce psutils as a dependency, but looks like when installing pytorch nightly, it's not required or listed as a requirement - Torchtitan CI is broken because of ModuleNotFoundError: No module named 'psutil'

https://github.com/pytorch/torchtitan/actions/runs/20073644006/job/57582226977 cc @ankitageorge

ankitageorge · 2025-12-09T20:26:12Z

This PR introduce psutils as a dependency, but looks like when installing pytorch nightly, it's not required or listed as a requirement - Torchtitan CI is broken because of ModuleNotFoundError: No module named 'psutil'

https://github.com/pytorch/torchtitan/actions/runs/20073644006/job/57582226977 cc @ankitageorge

@wwwjn I will fix this

Summary: Adding the psutil dependency in oss, caused issues for some flows pytorch#169613 (comment). Removing it, as it's not really needed. The number of objects collected by GC is enough information. Test Plan: signals Differential Revision: D88772827

Summary: I've been investigating the source for the large (10 min for 1k jobs) planning collectives. The source of the issue seems to be the GC running in the process during the all gather deserialization of objects https://fburl.com/code/6wdvfioi causing spikes and increase in deserialization time. With GC disabled, for a 96 gpu job, the planning collective is 4s https://fburl.com/scuba/pytorch_dcp_logging/elytxg2y. It's 30s without this enabled https://fburl.com/scuba/pytorch_dcp_logging/hkn233tz. On a 256 gpu job, the deserialization of one save plan when gc is enabled, can take as long as 25s https://fburl.com/mlhub/ewtu6439, when normally it should be in the order of milliseconds. This diff, behind a JK, will disable the automatic gc, and behind another jk, won't run the manual gc at all. The cost of running manual gc is ~10s (https://fburl.com/mlhub/q4uauv20, https://fburl.com/mlhub/j52cv5mj) for the 96 gpu job. So ideally, we wouldn't have to run it. Based on some testing (https://fburl.com/mlhub/kcw221ly), we aren't actually leaking anything and we shouldn't need to run manual GC, but adding a JK in case this not true, so we don't OOM. Test Plan: unit tests ran a job aps-ankitageorge-2a3f549ddc Reviewed By: kevinmtang Differential Revision: D88274329 Pull Request resolved: pytorch#169613 Approved by: https://github.com/kevinmtang

Summary: Adding the psutil dependency in oss, caused issues for some flows #169613 (comment). Removing it, as it's not really needed. The number of objects collected by GC is enough information. Test Plan: signals Differential Revision: D88772827 Pull Request resolved: #169985 Approved by: https://github.com/fegin, https://github.com/xmfan

Summary: Adding the psutil dependency in oss, caused issues for some flows #169613 (comment). Removing it, as it's not really needed. The number of objects collected by GC is enough information. Test Plan: signals Differential Revision: D88772827 Pull Request resolved: #169985 Approved by: https://github.com/fegin, https://github.com/xmfan (cherry picked from commit 766882a)

[dcp] remove psutil dependency in asyncprocessexecutor for oss (#169985) Summary: Adding the psutil dependency in oss, caused issues for some flows #169613 (comment). Removing it, as it's not really needed. The number of objects collected by GC is enough information. Test Plan: signals Differential Revision: D88772827 Pull Request resolved: #169985 Approved by: https://github.com/fegin, https://github.com/xmfan (cherry picked from commit 766882a) Co-authored-by: Ankita George <ankitageorge@meta.com>

…ch#169985) Summary: Adding the psutil dependency in oss, caused issues for some flows pytorch#169613 (comment). Removing it, as it's not really needed. The number of objects collected by GC is enough information. Test Plan: signals Differential Revision: D88772827 Pull Request resolved: pytorch#169985 Approved by: https://github.com/fegin, https://github.com/xmfan

Skylion007 · 2025-12-16T19:08:05Z

@ankitageorge, this seems to cause a nasty memory bubble. See the linked PR

ankitageorge · 2025-12-16T19:43:29Z

@Skylion007 This PR shouldn't be the cause of this issue. The change is behind the DCP_DISABLE_AUTOMATIC_GC (self.disable_automatic_gc) env variable being set to True, but the default is False. So this shouldn't have changed anything for open source users, unless they explictly set this environment variable

@ankitageorge, this seems to cause a nasty memory bubble. See the linked PR

pytorch-bot bot added the release notes: distributed (checkpoint) label Dec 4, 2025

meta-codesync bot added fb-exported meta-exported labels Dec 4, 2025

ankitageorge force-pushed the export-D88274329 branch 2 times, most recently from 54a9003 to e896282 Compare December 4, 2025 22:53

ankitageorge force-pushed the export-D88274329 branch from e896282 to 033c578 Compare December 5, 2025 00:02

ankitageorge requested a review from kevinmtang December 8, 2025 15:25

kevinmtang approved these changes Dec 8, 2025

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Dec 8, 2025

Skylion007 reviewed Dec 8, 2025

View reviewed changes

ankitageorge force-pushed the export-D88274329 branch from 033c578 to 0b92c78 Compare December 8, 2025 19:20

pytorchmergebot added the merging label Dec 9, 2025

pytorchmergebot closed this in 91e3c29 Dec 9, 2025

pytorchmergebot added Merged and removed merging labels Dec 9, 2025

sanketpurandare mentioned this pull request Dec 9, 2025

DualPipeV Fw-Bw Overlapping pass with User Annotations meta-pytorch/autoparallel#261

Merged

ankitageorge mentioned this pull request Dec 9, 2025

[dcp] remove psutil dependency in asyncprocessexecutor for oss #169985

Closed

pytorchbot mentioned this pull request Dec 10, 2025

[dcp] remove psutil dependency in asyncprocessexecutor for oss #170121

Merged

Skylion007 mentioned this pull request Dec 16, 2025

Fixes memory leak in async_save in distributed checkpoint #170431

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Disable GC in process based async checkpointing#169613

Disable GC in process based async checkpointing#169613
ankitageorge wants to merge 1 commit intopytorch:mainfrom
ankitageorge:export-D88274329

ankitageorge commented Dec 4, 2025

Uh oh!

pytorch-bot bot commented Dec 4, 2025 •

edited

Loading

Uh oh!

meta-codesync bot commented Dec 4, 2025

Uh oh!

Skylion007 commented Dec 6, 2025 •

edited

Loading

Uh oh!

Skylion007 commented Dec 8, 2025

Uh oh!

Skylion007 left a comment

Uh oh!

Skylion007 commented Dec 8, 2025

Uh oh!

facebook-github-bot commented Dec 9, 2025

Uh oh!

pytorchmergebot commented Dec 9, 2025

Uh oh!

wwwjn commented Dec 9, 2025

Uh oh!

ankitageorge commented Dec 9, 2025

Uh oh!

Skylion007 commented Dec 16, 2025

Uh oh!

ankitageorge commented Dec 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

ankitageorge commented Dec 4, 2025

Uh oh!

pytorch-bot bot commented Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/169613

✅ No Failures

Uh oh!

meta-codesync bot commented Dec 4, 2025

Uh oh!

Skylion007 commented Dec 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Skylion007 commented Dec 8, 2025

Uh oh!

Skylion007 left a comment

Choose a reason for hiding this comment

Uh oh!

Skylion007 commented Dec 8, 2025

Uh oh!

facebook-github-bot commented Dec 9, 2025

Uh oh!

pytorchmergebot commented Dec 9, 2025

Merge started

Uh oh!

wwwjn commented Dec 9, 2025

Uh oh!

ankitageorge commented Dec 9, 2025

Uh oh!

Skylion007 commented Dec 16, 2025

Uh oh!

ankitageorge commented Dec 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

pytorch-bot bot commented Dec 4, 2025 •

edited

Loading

Skylion007 commented Dec 6, 2025 •

edited

Loading