Skip to content

Disable GC in process based async checkpointing#169613

Closed
ankitageorge wants to merge 1 commit intopytorch:mainfrom
ankitageorge:export-D88274329
Closed

Disable GC in process based async checkpointing#169613
ankitageorge wants to merge 1 commit intopytorch:mainfrom
ankitageorge:export-D88274329

Conversation

@ankitageorge
Copy link
Contributor

Summary:
I've been investigating the source for the large (10 min for 1k jobs) planning collectives. The source of the issue seems to be the GC running in the process during the all gather deserialization of objects https://fburl.com/code/6wdvfioi causing spikes and increase in deserialization time. With GC disabled, for a 96 gpu job, the planning collective is 4s https://fburl.com/scuba/pytorch_dcp_logging/elytxg2y. It's 30s without this enabled https://fburl.com/scuba/pytorch_dcp_logging/hkn233tz. On a 256 gpu job, the deserialization of one save plan when gc is enabled, can take as long as 25s https://fburl.com/mlhub/ewtu6439, when normally it should be in the order of milliseconds.

This diff, behind a JK, will disable the automatic gc, and behind another jk, won't run the manual gc at all. The cost of running manual gc is ~10s (https://fburl.com/mlhub/q4uauv20, https://fburl.com/mlhub/j52cv5mj) for the 96 gpu job. So ideally, we wouldn't have to run it. Based on some testing (https://fburl.com/mlhub/kcw221ly), we aren't actually leaking anything and we shouldn't need to run manual GC, but adding a JK in case this not true, so we don't OOM.

Test Plan:
unit tests
ran a job aps-ankitageorge-2a3f549ddc

Reviewed By: kevinmtang

Differential Revision: D88274329

@pytorch-bot
Copy link

pytorch-bot bot commented Dec 4, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/169613

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 0b92c78 with merge base 5a7a65a (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-codesync
Copy link

meta-codesync bot commented Dec 4, 2025

@ankitageorge has exported this pull request. If you are a Meta employee, you can view the originating Diff in D88274329.

ankitageorge added a commit to ankitageorge/pytorch that referenced this pull request Dec 4, 2025
Summary:

I've been investigating the source for the large (10 min for 1k jobs) planning collectives. The source of the issue seems to be the GC running in the process during the all gather deserialization of objects https://fburl.com/code/6wdvfioi causing spikes and increase in deserialization time. With GC disabled, for a 96 gpu job, the planning collective is 4s https://fburl.com/scuba/pytorch_dcp_logging/elytxg2y. It's 30s without this enabled https://fburl.com/scuba/pytorch_dcp_logging/hkn233tz. On a 256 gpu job, the deserialization of one save plan when gc is enabled, can take as long as 25s https://fburl.com/mlhub/ewtu6439, when normally it should be in the order of milliseconds.

This diff, behind a JK, will disable the automatic gc, and behind another jk, won't run the manual gc at all. The cost of running manual gc is ~10s (https://fburl.com/mlhub/q4uauv20, https://fburl.com/mlhub/j52cv5mj) for the 96 gpu job. So ideally, we wouldn't have to run it. Based on some testing (https://fburl.com/mlhub/kcw221ly), we aren't actually leaking anything and we shouldn't need to run manual GC, but adding a JK in case this not true, so we don't OOM.

Test Plan:
unit tests
ran a job aps-ankitageorge-2a3f549ddc

Reviewed By: kevinmtang

Differential Revision: D88274329
@ankitageorge ankitageorge force-pushed the export-D88274329 branch 2 times, most recently from 54a9003 to e896282 Compare December 4, 2025 22:53
ankitageorge added a commit to ankitageorge/pytorch that referenced this pull request Dec 4, 2025
Summary:

I've been investigating the source for the large (10 min for 1k jobs) planning collectives. The source of the issue seems to be the GC running in the process during the all gather deserialization of objects https://fburl.com/code/6wdvfioi causing spikes and increase in deserialization time. With GC disabled, for a 96 gpu job, the planning collective is 4s https://fburl.com/scuba/pytorch_dcp_logging/elytxg2y. It's 30s without this enabled https://fburl.com/scuba/pytorch_dcp_logging/hkn233tz. On a 256 gpu job, the deserialization of one save plan when gc is enabled, can take as long as 25s https://fburl.com/mlhub/ewtu6439, when normally it should be in the order of milliseconds.

This diff, behind a JK, will disable the automatic gc, and behind another jk, won't run the manual gc at all. The cost of running manual gc is ~10s (https://fburl.com/mlhub/q4uauv20, https://fburl.com/mlhub/j52cv5mj) for the 96 gpu job. So ideally, we wouldn't have to run it. Based on some testing (https://fburl.com/mlhub/kcw221ly), we aren't actually leaking anything and we shouldn't need to run manual GC, but adding a JK in case this not true, so we don't OOM.

Test Plan:
unit tests
ran a job aps-ankitageorge-2a3f549ddc

Reviewed By: kevinmtang

Differential Revision: D88274329
ankitageorge added a commit to ankitageorge/pytorch that referenced this pull request Dec 5, 2025
Summary:

I've been investigating the source for the large (10 min for 1k jobs) planning collectives. The source of the issue seems to be the GC running in the process during the all gather deserialization of objects https://fburl.com/code/6wdvfioi causing spikes and increase in deserialization time. With GC disabled, for a 96 gpu job, the planning collective is 4s https://fburl.com/scuba/pytorch_dcp_logging/elytxg2y. It's 30s without this enabled https://fburl.com/scuba/pytorch_dcp_logging/hkn233tz. On a 256 gpu job, the deserialization of one save plan when gc is enabled, can take as long as 25s https://fburl.com/mlhub/ewtu6439, when normally it should be in the order of milliseconds.

This diff, behind a JK, will disable the automatic gc, and behind another jk, won't run the manual gc at all. The cost of running manual gc is ~10s (https://fburl.com/mlhub/q4uauv20, https://fburl.com/mlhub/j52cv5mj) for the 96 gpu job. So ideally, we wouldn't have to run it. Based on some testing (https://fburl.com/mlhub/kcw221ly), we aren't actually leaking anything and we shouldn't need to run manual GC, but adding a JK in case this not true, so we don't OOM.

Test Plan:
unit tests
ran a job aps-ankitageorge-2a3f549ddc

Reviewed By: kevinmtang

Differential Revision: D88274329
@Skylion007
Copy link
Collaborator

Skylion007 commented Dec 6, 2025

Just use percent formatting for the logger please (passing formatting parameters as args). Since it's info, the string formatting can be skipped most of the time and the fstring is not lazy.

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Dec 8, 2025
@Skylion007
Copy link
Collaborator

Would this make sense as an enum? Seems like combinations of this are mutually exclusive and better to have be parsed out by an enum / string from env rather than fall into Boolean trap. IE you would never want to enable the automatic GC but disable the manual one, right? That seems like an error!

Copy link
Collaborator

@Skylion007 Skylion007 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just use percent formatting, you are just formatting numbers here and they are info level logs so they often can be ignored and not have the value printed.

@Skylion007
Copy link
Collaborator

So disable manual but do not disable automatic seems like valid state. Seems like disable none, disable automatic, disable both are the valid state here? Just have this be the case with a special string value? Or even by having it be a special value of 2

Summary:

I've been investigating the source for the large (10 min for 1k jobs) planning collectives. The source of the issue seems to be the GC running in the process during the all gather deserialization of objects https://fburl.com/code/6wdvfioi causing spikes and increase in deserialization time. With GC disabled, for a 96 gpu job, the planning collective is 4s https://fburl.com/scuba/pytorch_dcp_logging/elytxg2y. It's 30s without this enabled https://fburl.com/scuba/pytorch_dcp_logging/hkn233tz. On a 256 gpu job, the deserialization of one save plan when gc is enabled, can take as long as 25s https://fburl.com/mlhub/ewtu6439, when normally it should be in the order of milliseconds.

This diff, behind a JK, will disable the automatic gc, and behind another jk, won't run the manual gc at all. The cost of running manual gc is ~10s (https://fburl.com/mlhub/q4uauv20, https://fburl.com/mlhub/j52cv5mj) for the 96 gpu job. So ideally, we wouldn't have to run it. Based on some testing (https://fburl.com/mlhub/kcw221ly), we aren't actually leaking anything and we shouldn't need to run manual GC, but adding a JK in case this not true, so we don't OOM.

Test Plan:
unit tests
ran a job aps-ankitageorge-2a3f549ddc

Reviewed By: kevinmtang

Differential Revision: D88274329
@facebook-github-bot
Copy link
Contributor

@pytorchbot merge

(Initiating merge automatically since Phabricator Diff has merged)

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

liangxs pushed a commit to liangxs/pytorch that referenced this pull request Dec 9, 2025
Summary:
I've been investigating the source for the large (10 min for 1k jobs) planning collectives. The source of the issue seems to be the GC running in the process during the all gather deserialization of objects https://fburl.com/code/6wdvfioi causing spikes and increase in deserialization time. With GC disabled, for a 96 gpu job, the planning collective is 4s https://fburl.com/scuba/pytorch_dcp_logging/elytxg2y. It's 30s without this enabled https://fburl.com/scuba/pytorch_dcp_logging/hkn233tz. On a 256 gpu job, the deserialization of one save plan when gc is enabled, can take as long as 25s https://fburl.com/mlhub/ewtu6439, when normally it should be in the order of milliseconds.

This diff, behind a JK, will disable the automatic gc, and behind another jk, won't run the manual gc at all. The cost of running manual gc is ~10s (https://fburl.com/mlhub/q4uauv20, https://fburl.com/mlhub/j52cv5mj) for the 96 gpu job. So ideally, we wouldn't have to run it. Based on some testing (https://fburl.com/mlhub/kcw221ly), we aren't actually leaking anything and we shouldn't need to run manual GC, but adding a JK in case this not true, so we don't OOM.

Test Plan:
unit tests
ran a job aps-ankitageorge-2a3f549ddc

Reviewed By: kevinmtang

Differential Revision: D88274329

Pull Request resolved: pytorch#169613
Approved by: https://github.com/kevinmtang
@wwwjn
Copy link

wwwjn commented Dec 9, 2025

This PR introduce psutils as a dependency, but looks like when installing pytorch nightly, it's not required or listed as a requirement - Torchtitan CI is broken because of ModuleNotFoundError: No module named 'psutil'

https://github.com/pytorch/torchtitan/actions/runs/20073644006/job/57582226977 cc @ankitageorge

@ankitageorge
Copy link
Contributor Author

This PR introduce psutils as a dependency, but looks like when installing pytorch nightly, it's not required or listed as a requirement - Torchtitan CI is broken because of ModuleNotFoundError: No module named 'psutil'

https://github.com/pytorch/torchtitan/actions/runs/20073644006/job/57582226977 cc @ankitageorge

@wwwjn I will fix this

ankitageorge added a commit to ankitageorge/pytorch that referenced this pull request Dec 9, 2025
Summary: Adding the psutil dependency in oss, caused issues for some flows pytorch#169613 (comment). Removing it, as it's not really needed. The number of objects collected by GC is enough information.

Test Plan: signals

Differential Revision: D88772827
skpark-rh pushed a commit to skpark-rh/pytorch that referenced this pull request Dec 10, 2025
Summary:
I've been investigating the source for the large (10 min for 1k jobs) planning collectives. The source of the issue seems to be the GC running in the process during the all gather deserialization of objects https://fburl.com/code/6wdvfioi causing spikes and increase in deserialization time. With GC disabled, for a 96 gpu job, the planning collective is 4s https://fburl.com/scuba/pytorch_dcp_logging/elytxg2y. It's 30s without this enabled https://fburl.com/scuba/pytorch_dcp_logging/hkn233tz. On a 256 gpu job, the deserialization of one save plan when gc is enabled, can take as long as 25s https://fburl.com/mlhub/ewtu6439, when normally it should be in the order of milliseconds.

This diff, behind a JK, will disable the automatic gc, and behind another jk, won't run the manual gc at all. The cost of running manual gc is ~10s (https://fburl.com/mlhub/q4uauv20, https://fburl.com/mlhub/j52cv5mj) for the 96 gpu job. So ideally, we wouldn't have to run it. Based on some testing (https://fburl.com/mlhub/kcw221ly), we aren't actually leaking anything and we shouldn't need to run manual GC, but adding a JK in case this not true, so we don't OOM.

Test Plan:
unit tests
ran a job aps-ankitageorge-2a3f549ddc

Reviewed By: kevinmtang

Differential Revision: D88274329

Pull Request resolved: pytorch#169613
Approved by: https://github.com/kevinmtang
pytorchmergebot pushed a commit that referenced this pull request Dec 10, 2025
Summary: Adding the psutil dependency in oss, caused issues for some flows #169613 (comment). Removing it, as it's not really needed. The number of objects collected by GC is enough information.

Test Plan: signals

Differential Revision: D88772827

Pull Request resolved: #169985
Approved by: https://github.com/fegin, https://github.com/xmfan
pytorchbot pushed a commit that referenced this pull request Dec 10, 2025
Summary: Adding the psutil dependency in oss, caused issues for some flows #169613 (comment). Removing it, as it's not really needed. The number of objects collected by GC is enough information.

Test Plan: signals

Differential Revision: D88772827

Pull Request resolved: #169985
Approved by: https://github.com/fegin, https://github.com/xmfan

(cherry picked from commit 766882a)
atalman pushed a commit that referenced this pull request Dec 10, 2025
[dcp] remove psutil dependency in asyncprocessexecutor for oss (#169985)

Summary: Adding the psutil dependency in oss, caused issues for some flows #169613 (comment). Removing it, as it's not really needed. The number of objects collected by GC is enough information.

Test Plan: signals

Differential Revision: D88772827

Pull Request resolved: #169985
Approved by: https://github.com/fegin, https://github.com/xmfan

(cherry picked from commit 766882a)

Co-authored-by: Ankita George <ankitageorge@meta.com>
daisyden pushed a commit to daisyden/pytorch that referenced this pull request Dec 11, 2025
…ch#169985)

Summary: Adding the psutil dependency in oss, caused issues for some flows pytorch#169613 (comment). Removing it, as it's not really needed. The number of objects collected by GC is enough information.

Test Plan: signals

Differential Revision: D88772827

Pull Request resolved: pytorch#169985
Approved by: https://github.com/fegin, https://github.com/xmfan
@Skylion007
Copy link
Collaborator

@ankitageorge, this seems to cause a nasty memory bubble. See the linked PR

@ankitageorge
Copy link
Contributor Author

@Skylion007 This PR shouldn't be the cause of this issue. The change is behind the DCP_DISABLE_AUTOMATIC_GC (self.disable_automatic_gc) env variable being set to True, but the default is False. So this shouldn't have changed anything for open source users, unless they explictly set this environment variable

@ankitageorge, this seems to cause a nasty memory bubble. See the linked PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants