Skip to content

The Nested Pool#168382

Closed
eee4017 wants to merge 4 commits intopytorch:mainfrom
eee4017:fix-nesting-memory-pool-fast
Closed

The Nested Pool#168382
eee4017 wants to merge 4 commits intopytorch:mainfrom
eee4017:fix-nesting-memory-pool-fast

Conversation

@eee4017
Copy link
Collaborator

@eee4017 eee4017 commented Nov 21, 2025

This PR fixes issue #161193 by simply reversing the iteration order over captures_underway.
After discussing with @galv, we decided to land this minimal fix first to unblock nested MemPool usage.

Long-term, the underlying infrastructure (e.g., captures_underway) still needs refactoring to clearly define the interaction between graph capture, MemPools, and threads. That broader cleanup will be addressed in #168137.

cc @eqy @syed-ahmed @ngimel @galv

@pytorch-bot
Copy link

pytorch-bot bot commented Nov 21, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/168382

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (7 Unrelated Failures)

As of commit 30afe51 with merge base 44ac693 (image):

UNSTABLE - The following jobs are marked as unstable, possibly due to flakiness on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Copy link
Collaborator

@ngimel ngimel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

@ngimel
Copy link
Collaborator

ngimel commented Nov 22, 2025

Do we even need a broader cleanup after this? What are we trying to achieve?

Copy link
Collaborator

@galv galv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

@eee4017
Copy link
Collaborator Author

eee4017 commented Nov 22, 2025

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Nov 22, 2025
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 jobs have failed, first few of them are: trunk / before-test / llm-retrieval

Details for Dev Infra team Raised by workflow job

@pytorch-bot pytorch-bot bot removed the ciflow/trunk Trigger trunk jobs on your pull request label Nov 24, 2025
@eqy
Copy link
Collaborator

eqy commented Nov 24, 2025

@pytorchmergebot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Nov 24, 2025
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 jobs have failed, first few of them are: trunk / linux-jammy-cuda12.8-py3.10-gcc11 / test (default, 5, 5, linux.g6.4xlarge.experimental.nvidia.gpu)

Details for Dev Infra team Raised by workflow job

@ngimel
Copy link
Collaborator

ngimel commented Nov 25, 2025

Test failure looks related

@eee4017 eee4017 force-pushed the fix-nesting-memory-pool-fast branch from b9b99a2 to 9d1b2da Compare November 25, 2025 03:24
@pytorch-bot pytorch-bot bot removed the ciflow/trunk Trigger trunk jobs on your pull request label Nov 25, 2025
@pytorch-bot
Copy link

pytorch-bot bot commented Nov 25, 2025

Pull workflow has not been scheduled for the PR yet. It could be because author doesn't have permissions to run those or skip-checks keywords were added to PR/commits, aborting merge. Please get/give approval for the workflows and/or remove skip ci decorators before next merge attempt. If you think this is a mistake, please contact PyTorch Dev Infra.

@eee4017
Copy link
Collaborator Author

eee4017 commented Nov 25, 2025

@BoyuanFeng is TestSAC.test_graph_partition_cudagraphs_aot_eager_compat_equal still a flaky test?
https://github.com/pytorch/pytorch/actions/runs/19652565263/job/56284617134

See #167274 and #167766

@ngimel
Copy link
Collaborator

ngimel commented Nov 25, 2025

If the test is flaky we can merge once actions start running again

@ngimel
Copy link
Collaborator

ngimel commented Dec 2, 2025

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased fix-nesting-memory-pool-fast onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout fix-nesting-memory-pool-fast && git pull --rebase)

@pytorchmergebot pytorchmergebot force-pushed the fix-nesting-memory-pool-fast branch from 9d1b2da to 30afe51 Compare December 2, 2025 21:50
@ngimel
Copy link
Collaborator

ngimel commented Dec 2, 2025

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Dec 2, 2025
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 jobs have failed, first few of them are: trunk / linux-jammy-rocm-py3.10 / test (default, 4, 6, linux.rocm.gpu.gfx942.1)

Details for Dev Infra team Raised by workflow job

@eqy
Copy link
Collaborator

eqy commented Dec 3, 2025

I think we should be good? remaining failures look like rocm trunk failures

@ngimel
Copy link
Collaborator

ngimel commented Dec 3, 2025

@pytorchbot merge -i

JacobSzwejbka pushed a commit that referenced this pull request Dec 8, 2025
This PR fixes issue #161193 by simply reversing the iteration order over captures_underway.
After discussing with @galv, we decided to land this minimal fix first to unblock nested MemPool usage.

Long-term, the underlying infrastructure (e.g., captures_underway) still needs refactoring to clearly define the interaction between graph capture, MemPools, and threads. That broader cleanup will be addressed in #168137.

Pull Request resolved: #168382
Approved by: https://github.com/eqy, https://github.com/ngimel, https://github.com/galv
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants