Refactor memory estimator to use node storages, add test #164783

eellison · 2025-10-06T21:04:28Z

Stack from ghstack (oldest at bottom):

Update the Memory Estimator to use node storages for analysis, which simplifies book keeping, as opposed to manually looking at operator schema. This will also allow me to reuse this component elsewhere.
Factor out into separate class, so that this same logic can be used in scheduling (node allocations / aliasing / uses)
Adds Tests for correctness - right now only on fwd/bwd by itself, not with both.

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben

[ghstack-poisoned]

pytorch-bot · 2025-10-06T21:04:33Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/164783

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 81858a8 with merge base b5e93ff ():

FLAKY - The following job failed but was likely due to flakiness present on trunk:

pull / linux-jammy-py3.13-clang12 / test (dynamo_wrapped, 3, 3, linux.2xlarge) (gh) (similar failure)
'test/profiler/test_profiler.py::TestExperimentalUtils::test_profiler_optimizer_single_tensor_pattern'

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: ac9d5c6 Pull Request resolved: #164783

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov coconutruben [ghstack-poisoned]

ghstack-source-id: f9c6340 Pull Request resolved: #164783

- Update the Memory Estimator to use node storages for analysis, which simplifies book keeping, as opposed to manually looking at operator schema. - Factor out into separate class, so that this same logic can be used in scheduling (node allocations / aliasing / uses) - Adds Tests for correctness - right now only on fwd/bwd by itself, not with both. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov coconutruben [ghstack-poisoned]

ghstack-source-id: ca9908e Pull Request resolved: #164783

- Update the Memory Estimator to use node storages for analysis, which simplifies book keeping, as opposed to manually looking at operator schema. - Factor out into separate class, so that this same logic can be used in scheduling (node allocations / aliasing / uses) - Adds Tests for correctness - right now only on fwd/bwd by itself, not with both. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov coconutruben [ghstack-poisoned]

ghstack-source-id: e813a82 Pull Request resolved: #164783

ruisizhang123

LGTM, thanks!

torch/_inductor/fx_passes/memory_estimator.py

ruisizhang123 · 2025-10-06T22:02:11Z

torch/_inductor/fx_passes/memory_estimator.py

+    # Identify storages allocated by primal placeholder nodes
+    primal_storages: OrderedSet[StorageKey] = OrderedSet()
+    for node in fwd_graph.find_nodes(op="placeholder"):
+        if node.name.startswith("primals"):


curious: is it possible to check the input node names (which are primals) instead of hard-code "primals" in node.name strings here? Not sure if this hard-code would be very fragile to changes in graph inputs.

torch/_inductor/fx_passes/memory_estimator.py

- Update the Memory Estimator to use node storages for analysis, which simplifies book keeping, as opposed to manually looking at operator schema. This will also allow me to reuse this component elsewhere. - Factor out into separate class, so that this same logic can be used in scheduling (node allocations / aliasing / uses) - Adds Tests for correctness - right now only on fwd/bwd by itself, not with both. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov coconutruben [ghstack-poisoned]

ghstack-source-id: 334a118 Pull Request resolved: #164783

eellison · 2025-10-07T15:07:15Z

@pytorchbot merge

pytorchmergebot · 2025-10-07T15:09:16Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

eellison · 2025-10-07T15:38:08Z

cc @karthickai could be relevant for you

pytorchmergebot · 2025-10-07T16:07:41Z

Merge failed

Reason: 1 jobs have failed, first few of them are: inductor / inductor-test / test (inductor_torchbench, 2, 2, linux.g5.4xlarge.nvidia.gpu)

Details for Dev Infra team

Raised by workflow job

eellison · 2025-10-07T16:08:34Z

@pytorchbot merge -i

pytorchmergebot · 2025-10-07T16:11:11Z

Merge started

Your change will be merged while ignoring the following 1 checks: inductor / inductor-test / test (inductor_torchbench, 2, 2, linux.g5.4xlarge.nvidia.gpu)

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2025-10-07T16:16:54Z

Merge failed

Reason: 2 jobs have failed, first few of them are: inductor / unit-test / inductor-cpu-test / test (inductor_avx2, 1, 2, linux.10xlarge.avx2), inductor / inductor-test / test (inductor_timm, 2, 2, linux.g5.4xlarge.nvidia.gpu)

Details for Dev Infra team

Raised by workflow job

- Update the Memory Estimator to use node storages for analysis, which simplifies book keeping, as opposed to manually looking at operator schema. This will also allow me to reuse this component elsewhere. - Factor out into separate class, so that this same logic can be used in scheduling (node allocations / aliasing / uses) - Adds Tests for correctness - right now only on fwd/bwd by itself, not with both. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov coconutruben [ghstack-poisoned]

eellison · 2025-10-08T22:00:03Z

@pytorchbot merge -i

pytorchmergebot · 2025-10-08T22:01:56Z

Merge started

Your change will be merged while ignoring the following 1 checks: pull / linux-jammy-py3.13-clang12 / test (dynamo_wrapped, 3, 3, linux.2xlarge)

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Respect max_coll_distance from overlap scheduler in bucketing, also, add an optimization in path searching. Pull Request resolved: #164944 Approved by: https://github.com/IvanKobzarev ghstack dependencies: #164738, #164783

…ted (#164945) Pull Request resolved: #164945 Approved by: https://github.com/IvanKobzarev ghstack dependencies: #164738, #164783, #164944

ghstack-source-id: 334a118 Pull Request resolved: pytorch#164783

Add Memory Tracker utility, which will track live memory given alternate ordering of nodes. Pull Request resolved: #165059 Approved by: https://github.com/ezyang, https://github.com/IvanKobzarev ghstack dependencies: #164738, #164783, #164944, #164945

Bucketing a number of smallish improvements: - Account for bucketing in overlap calculation: if an in-flight collective exists with the same bucket key, reduce new collectives estimated time by its latency time - Update compute domination so we are ordering based on compute idx, as opposed to compute depth, so we never reorder compute. this makes it a bit easier to reason about memory, and pre-fetching, although we can exploring reordering in the future. - When we wait on a collective, force all collectives on the same process group as it that were enqueued prior to the collective to wait as well. Better Memory Handling: - Pre-fetch limiting - when scheduling collectives for overlap, only pre-fetch up to a certain distance, then schedule off-path collectives (which are typically memory reducing). - When we are above peak memory, schedule waits. TODO: - for each compute node, we know its original memory in the graph. we could limit pre-fetching that goes across peak memory - By scheduling off-path collectives for overlap, we reduce memory, but if there weren't enough compute for overlap, we need to proactively schedule them. not an issue yet on examples. - config some hard coded constants, clean up enablement (can do in subsequent pr) On small llama 2d backward : 578 of 618 potentially hideable collectives hidden original mem 14.4GB, rescheduled mem, 15.9GB on forward: 254/256 potentially hideable collectives hidden original mem 5.8 gb, reshceduled mem 5.8GB WIP: adding tests Pull Request resolved: pytorch#165318 Approved by: https://github.com/ezyang, https://github.com/IvanKobzarev ghstack dependencies: pytorch#164738, pytorch#164783, pytorch#164944, pytorch#164945, pytorch#165059

) - Update the Memory Estimator to use node storages for analysis, which simplifies book keeping, as opposed to manually looking at operator schema. This will also allow me to reuse this component elsewhere. - Factor out into separate class, so that this same logic can be used in scheduling (node allocations / aliasing / uses) - Adds Tests for correctness - right now only on fwd/bwd by itself, not with both. Pull Request resolved: pytorch#164783 Approved by: https://github.com/ruisizhang123 ghstack dependencies: pytorch#164738

Respect max_coll_distance from overlap scheduler in bucketing, also, add an optimization in path searching. Pull Request resolved: pytorch#164944 Approved by: https://github.com/IvanKobzarev ghstack dependencies: pytorch#164738, pytorch#164783

…ted (pytorch#164945) Pull Request resolved: pytorch#164945 Approved by: https://github.com/IvanKobzarev ghstack dependencies: pytorch#164738, pytorch#164783, pytorch#164944

Add Memory Tracker utility, which will track live memory given alternate ordering of nodes. Pull Request resolved: pytorch#165059 Approved by: https://github.com/ezyang, https://github.com/IvanKobzarev ghstack dependencies: pytorch#164738, pytorch#164783, pytorch#164944, pytorch#164945

Bucketing a number of smallish improvements: - Account for bucketing in overlap calculation: if an in-flight collective exists with the same bucket key, reduce new collectives estimated time by its latency time - Update compute domination so we are ordering based on compute idx, as opposed to compute depth, so we never reorder compute. this makes it a bit easier to reason about memory, and pre-fetching, although we can exploring reordering in the future. - When we wait on a collective, force all collectives on the same process group as it that were enqueued prior to the collective to wait as well. Better Memory Handling: - Pre-fetch limiting - when scheduling collectives for overlap, only pre-fetch up to a certain distance, then schedule off-path collectives (which are typically memory reducing). - When we are above peak memory, schedule waits. TODO: - for each compute node, we know its original memory in the graph. we could limit pre-fetching that goes across peak memory - By scheduling off-path collectives for overlap, we reduce memory, but if there weren't enough compute for overlap, we need to proactively schedule them. not an issue yet on examples. - config some hard coded constants, clean up enablement (can do in subsequent pr) On small llama 2d backward : 578 of 618 potentially hideable collectives hidden original mem 14.4GB, rescheduled mem, 15.9GB on forward: 254/256 potentially hideable collectives hidden original mem 5.8 gb, reshceduled mem 5.8GB WIP: adding tests Pull Request resolved: pytorch#165318 Approved by: https://github.com/ezyang, https://github.com/IvanKobzarev ghstack dependencies: pytorch#164738, pytorch#164783, pytorch#164944, pytorch#164945, pytorch#165059

ghstack-source-id: f029d2d Pull Request resolved: pytorch#164783

Refactor memory estimator to use node storages, add test

42f6037

[ghstack-poisoned]

eellison mentioned this pull request Oct 6, 2025

Add memory estimator #164738

Closed

pytorch-bot bot added ciflow/inductor module: inductor labels Oct 6, 2025

eellison added a commit that referenced this pull request Oct 6, 2025

Refactor memory estimator to use node storages, add test

65bf414

ghstack-source-id: ac9d5c6 Pull Request resolved: #164783

Update on "Refactor memory estimator to use node storages, add test"

e43aa95

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov coconutruben [ghstack-poisoned]

eellison added a commit that referenced this pull request Oct 6, 2025

Refactor memory estimator to use node storages, add test

a0b3189

ghstack-source-id: f9c6340 Pull Request resolved: #164783

eellison added the topic: not user facing topic category label Oct 6, 2025

eellison requested a review from ruisizhang123 October 6, 2025 21:24

eellison added a commit that referenced this pull request Oct 6, 2025

Refactor memory estimator to use node storages, add test

bf82e0c

ghstack-source-id: ca9908e Pull Request resolved: #164783

eellison added a commit that referenced this pull request Oct 6, 2025

Refactor memory estimator to use node storages, add test

210d0f4

ghstack-source-id: e813a82 Pull Request resolved: #164783

eellison requested a review from xuanzhang816 October 6, 2025 21:56

ruisizhang123 approved these changes Oct 6, 2025

View reviewed changes

eellison added a commit that referenced this pull request Oct 7, 2025

Refactor memory estimator to use node storages, add test

21368d4

ghstack-source-id: 334a118 Pull Request resolved: #164783

eellison added the ciflow/trunk Trigger trunk jobs on your pull request label Oct 7, 2025

pytorchmergebot added the merging label Oct 7, 2025

eellison mentioned this pull request Oct 7, 2025

consider collective inputs only deallocated on wait() #164839

Closed

pytorchmergebot removed the merging label Oct 7, 2025

pytorchmergebot added the merging label Oct 7, 2025

pytorchmergebot removed the merging label Oct 7, 2025

This was referenced Oct 8, 2025

Limit coll bucketing within node idxs #164944

Closed

Consider collective inputs to be deallocated only when wait is completed #164945

Closed

pytorchmergebot added the merging label Oct 8, 2025

pytorchmergebot added the Merged label Oct 8, 2025

pytorchmergebot closed this in aed5ed1 Oct 8, 2025

pytorchmergebot removed the merging label Oct 8, 2025

eellison mentioned this pull request Oct 9, 2025

Add Memory Estimation Tracker #165059

Closed

karthickai mentioned this pull request Oct 10, 2025

[Inductor] Mutable custom op pattern matching and safety checker #164273

Open

eellison added a commit to eellison/pytorch that referenced this pull request Oct 11, 2025

Refactor memory estimator to use node storages, add test

2ec8337

ghstack-source-id: 334a118 Pull Request resolved: pytorch#164783

eellison mentioned this pull request Oct 13, 2025

Overlap scheduler improvements #165318

Closed

eellison added a commit to eellison/pytorch that referenced this pull request Oct 26, 2025

Refactor memory estimator to use node storages, add test

87a1632

ghstack-source-id: f029d2d Pull Request resolved: pytorch#164783

github-actions bot deleted the gh/eellison/840/head branch November 8, 2025 02:11

Refactor memory estimator to use node storages, add test #164783

Refactor memory estimator to use node storages, add test #164783

Uh oh!

Conversation

eellison commented Oct 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Oct 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/164783

✅ You can merge normally! (1 Unrelated Failure)

Uh oh!

ruisizhang123 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ruisizhang123 Oct 6, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

eellison commented Oct 7, 2025

Uh oh!

pytorchmergebot commented Oct 7, 2025

Merge started

Uh oh!

eellison commented Oct 7, 2025

Uh oh!

pytorchmergebot commented Oct 7, 2025

Merge failed

Uh oh!

eellison commented Oct 7, 2025

Uh oh!

pytorchmergebot commented Oct 7, 2025

Merge started

Uh oh!

pytorchmergebot commented Oct 7, 2025

Merge failed

Uh oh!

eellison commented Oct 8, 2025

Uh oh!

pytorchmergebot commented Oct 8, 2025

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

eellison commented Oct 6, 2025 •

edited

Loading

pytorch-bot bot commented Oct 6, 2025 •

edited

Loading