Disable GC in process based async checkpointing#169613
Disable GC in process based async checkpointing#169613ankitageorge wants to merge 1 commit intopytorch:mainfrom
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/169613
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 0b92c78 with merge base 5a7a65a ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
@ankitageorge has exported this pull request. If you are a Meta employee, you can view the originating Diff in D88274329. |
Summary: I've been investigating the source for the large (10 min for 1k jobs) planning collectives. The source of the issue seems to be the GC running in the process during the all gather deserialization of objects https://fburl.com/code/6wdvfioi causing spikes and increase in deserialization time. With GC disabled, for a 96 gpu job, the planning collective is 4s https://fburl.com/scuba/pytorch_dcp_logging/elytxg2y. It's 30s without this enabled https://fburl.com/scuba/pytorch_dcp_logging/hkn233tz. On a 256 gpu job, the deserialization of one save plan when gc is enabled, can take as long as 25s https://fburl.com/mlhub/ewtu6439, when normally it should be in the order of milliseconds. This diff, behind a JK, will disable the automatic gc, and behind another jk, won't run the manual gc at all. The cost of running manual gc is ~10s (https://fburl.com/mlhub/q4uauv20, https://fburl.com/mlhub/j52cv5mj) for the 96 gpu job. So ideally, we wouldn't have to run it. Based on some testing (https://fburl.com/mlhub/kcw221ly), we aren't actually leaking anything and we shouldn't need to run manual GC, but adding a JK in case this not true, so we don't OOM. Test Plan: unit tests ran a job aps-ankitageorge-2a3f549ddc Reviewed By: kevinmtang Differential Revision: D88274329
54a9003 to
e896282
Compare
Summary: I've been investigating the source for the large (10 min for 1k jobs) planning collectives. The source of the issue seems to be the GC running in the process during the all gather deserialization of objects https://fburl.com/code/6wdvfioi causing spikes and increase in deserialization time. With GC disabled, for a 96 gpu job, the planning collective is 4s https://fburl.com/scuba/pytorch_dcp_logging/elytxg2y. It's 30s without this enabled https://fburl.com/scuba/pytorch_dcp_logging/hkn233tz. On a 256 gpu job, the deserialization of one save plan when gc is enabled, can take as long as 25s https://fburl.com/mlhub/ewtu6439, when normally it should be in the order of milliseconds. This diff, behind a JK, will disable the automatic gc, and behind another jk, won't run the manual gc at all. The cost of running manual gc is ~10s (https://fburl.com/mlhub/q4uauv20, https://fburl.com/mlhub/j52cv5mj) for the 96 gpu job. So ideally, we wouldn't have to run it. Based on some testing (https://fburl.com/mlhub/kcw221ly), we aren't actually leaking anything and we shouldn't need to run manual GC, but adding a JK in case this not true, so we don't OOM. Test Plan: unit tests ran a job aps-ankitageorge-2a3f549ddc Reviewed By: kevinmtang Differential Revision: D88274329
e896282 to
033c578
Compare
Summary: I've been investigating the source for the large (10 min for 1k jobs) planning collectives. The source of the issue seems to be the GC running in the process during the all gather deserialization of objects https://fburl.com/code/6wdvfioi causing spikes and increase in deserialization time. With GC disabled, for a 96 gpu job, the planning collective is 4s https://fburl.com/scuba/pytorch_dcp_logging/elytxg2y. It's 30s without this enabled https://fburl.com/scuba/pytorch_dcp_logging/hkn233tz. On a 256 gpu job, the deserialization of one save plan when gc is enabled, can take as long as 25s https://fburl.com/mlhub/ewtu6439, when normally it should be in the order of milliseconds. This diff, behind a JK, will disable the automatic gc, and behind another jk, won't run the manual gc at all. The cost of running manual gc is ~10s (https://fburl.com/mlhub/q4uauv20, https://fburl.com/mlhub/j52cv5mj) for the 96 gpu job. So ideally, we wouldn't have to run it. Based on some testing (https://fburl.com/mlhub/kcw221ly), we aren't actually leaking anything and we shouldn't need to run manual GC, but adding a JK in case this not true, so we don't OOM. Test Plan: unit tests ran a job aps-ankitageorge-2a3f549ddc Reviewed By: kevinmtang Differential Revision: D88274329
|
Just use percent formatting for the logger please (passing formatting parameters as args). Since it's info, the string formatting can be skipped most of the time and the fstring is not lazy. |
|
Would this make sense as an enum? Seems like combinations of this are mutually exclusive and better to have be parsed out by an enum / string from env rather than fall into Boolean trap. IE you would never want to enable the automatic GC but disable the manual one, right? That seems like an error! |
Skylion007
left a comment
There was a problem hiding this comment.
Just use percent formatting, you are just formatting numbers here and they are info level logs so they often can be ignored and not have the value printed.
|
So disable manual but do not disable automatic seems like valid state. Seems like disable none, disable automatic, disable both are the valid state here? Just have this be the case with a special string value? Or even by having it be a special value of 2 |
Summary: I've been investigating the source for the large (10 min for 1k jobs) planning collectives. The source of the issue seems to be the GC running in the process during the all gather deserialization of objects https://fburl.com/code/6wdvfioi causing spikes and increase in deserialization time. With GC disabled, for a 96 gpu job, the planning collective is 4s https://fburl.com/scuba/pytorch_dcp_logging/elytxg2y. It's 30s without this enabled https://fburl.com/scuba/pytorch_dcp_logging/hkn233tz. On a 256 gpu job, the deserialization of one save plan when gc is enabled, can take as long as 25s https://fburl.com/mlhub/ewtu6439, when normally it should be in the order of milliseconds. This diff, behind a JK, will disable the automatic gc, and behind another jk, won't run the manual gc at all. The cost of running manual gc is ~10s (https://fburl.com/mlhub/q4uauv20, https://fburl.com/mlhub/j52cv5mj) for the 96 gpu job. So ideally, we wouldn't have to run it. Based on some testing (https://fburl.com/mlhub/kcw221ly), we aren't actually leaking anything and we shouldn't need to run manual GC, but adding a JK in case this not true, so we don't OOM. Test Plan: unit tests ran a job aps-ankitageorge-2a3f549ddc Reviewed By: kevinmtang Differential Revision: D88274329
033c578 to
0b92c78
Compare
|
@pytorchbot merge (Initiating merge automatically since Phabricator Diff has merged) |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Summary: I've been investigating the source for the large (10 min for 1k jobs) planning collectives. The source of the issue seems to be the GC running in the process during the all gather deserialization of objects https://fburl.com/code/6wdvfioi causing spikes and increase in deserialization time. With GC disabled, for a 96 gpu job, the planning collective is 4s https://fburl.com/scuba/pytorch_dcp_logging/elytxg2y. It's 30s without this enabled https://fburl.com/scuba/pytorch_dcp_logging/hkn233tz. On a 256 gpu job, the deserialization of one save plan when gc is enabled, can take as long as 25s https://fburl.com/mlhub/ewtu6439, when normally it should be in the order of milliseconds. This diff, behind a JK, will disable the automatic gc, and behind another jk, won't run the manual gc at all. The cost of running manual gc is ~10s (https://fburl.com/mlhub/q4uauv20, https://fburl.com/mlhub/j52cv5mj) for the 96 gpu job. So ideally, we wouldn't have to run it. Based on some testing (https://fburl.com/mlhub/kcw221ly), we aren't actually leaking anything and we shouldn't need to run manual GC, but adding a JK in case this not true, so we don't OOM. Test Plan: unit tests ran a job aps-ankitageorge-2a3f549ddc Reviewed By: kevinmtang Differential Revision: D88274329 Pull Request resolved: pytorch#169613 Approved by: https://github.com/kevinmtang
|
This PR introduce https://github.com/pytorch/torchtitan/actions/runs/20073644006/job/57582226977 cc @ankitageorge |
@wwwjn I will fix this |
Summary: Adding the psutil dependency in oss, caused issues for some flows pytorch#169613 (comment). Removing it, as it's not really needed. The number of objects collected by GC is enough information. Test Plan: signals Differential Revision: D88772827
Summary: I've been investigating the source for the large (10 min for 1k jobs) planning collectives. The source of the issue seems to be the GC running in the process during the all gather deserialization of objects https://fburl.com/code/6wdvfioi causing spikes and increase in deserialization time. With GC disabled, for a 96 gpu job, the planning collective is 4s https://fburl.com/scuba/pytorch_dcp_logging/elytxg2y. It's 30s without this enabled https://fburl.com/scuba/pytorch_dcp_logging/hkn233tz. On a 256 gpu job, the deserialization of one save plan when gc is enabled, can take as long as 25s https://fburl.com/mlhub/ewtu6439, when normally it should be in the order of milliseconds. This diff, behind a JK, will disable the automatic gc, and behind another jk, won't run the manual gc at all. The cost of running manual gc is ~10s (https://fburl.com/mlhub/q4uauv20, https://fburl.com/mlhub/j52cv5mj) for the 96 gpu job. So ideally, we wouldn't have to run it. Based on some testing (https://fburl.com/mlhub/kcw221ly), we aren't actually leaking anything and we shouldn't need to run manual GC, but adding a JK in case this not true, so we don't OOM. Test Plan: unit tests ran a job aps-ankitageorge-2a3f549ddc Reviewed By: kevinmtang Differential Revision: D88274329 Pull Request resolved: pytorch#169613 Approved by: https://github.com/kevinmtang
Summary: Adding the psutil dependency in oss, caused issues for some flows #169613 (comment). Removing it, as it's not really needed. The number of objects collected by GC is enough information. Test Plan: signals Differential Revision: D88772827 Pull Request resolved: #169985 Approved by: https://github.com/fegin, https://github.com/xmfan
Summary: Adding the psutil dependency in oss, caused issues for some flows #169613 (comment). Removing it, as it's not really needed. The number of objects collected by GC is enough information. Test Plan: signals Differential Revision: D88772827 Pull Request resolved: #169985 Approved by: https://github.com/fegin, https://github.com/xmfan (cherry picked from commit 766882a)
[dcp] remove psutil dependency in asyncprocessexecutor for oss (#169985) Summary: Adding the psutil dependency in oss, caused issues for some flows #169613 (comment). Removing it, as it's not really needed. The number of objects collected by GC is enough information. Test Plan: signals Differential Revision: D88772827 Pull Request resolved: #169985 Approved by: https://github.com/fegin, https://github.com/xmfan (cherry picked from commit 766882a) Co-authored-by: Ankita George <ankitageorge@meta.com>
…ch#169985) Summary: Adding the psutil dependency in oss, caused issues for some flows pytorch#169613 (comment). Removing it, as it's not really needed. The number of objects collected by GC is enough information. Test Plan: signals Differential Revision: D88772827 Pull Request resolved: pytorch#169985 Approved by: https://github.com/fegin, https://github.com/xmfan
|
@ankitageorge, this seems to cause a nasty memory bubble. See the linked PR |
|
@Skylion007 This PR shouldn't be the cause of this issue. The change is behind the DCP_DISABLE_AUTOMATIC_GC (self.disable_automatic_gc) env variable being set to True, but the default is False. So this shouldn't have changed anything for open source users, unless they explictly set this environment variable
|
Summary:
I've been investigating the source for the large (10 min for 1k jobs) planning collectives. The source of the issue seems to be the GC running in the process during the all gather deserialization of objects https://fburl.com/code/6wdvfioi causing spikes and increase in deserialization time. With GC disabled, for a 96 gpu job, the planning collective is 4s https://fburl.com/scuba/pytorch_dcp_logging/elytxg2y. It's 30s without this enabled https://fburl.com/scuba/pytorch_dcp_logging/hkn233tz. On a 256 gpu job, the deserialization of one save plan when gc is enabled, can take as long as 25s https://fburl.com/mlhub/ewtu6439, when normally it should be in the order of milliseconds.
This diff, behind a JK, will disable the automatic gc, and behind another jk, won't run the manual gc at all. The cost of running manual gc is ~10s (https://fburl.com/mlhub/q4uauv20, https://fburl.com/mlhub/j52cv5mj) for the 96 gpu job. So ideally, we wouldn't have to run it. Based on some testing (https://fburl.com/mlhub/kcw221ly), we aren't actually leaking anything and we shouldn't need to run manual GC, but adding a JK in case this not true, so we don't OOM.
Test Plan:
unit tests
ran a job aps-ankitageorge-2a3f549ddc
Reviewed By: kevinmtang
Differential Revision: D88274329