Skip to content

Revert #168264 + Python-side LRU cache when native op schema is not supported#168269

Closed
ezyang wants to merge 5 commits intogh/ezyang/3204/basefrom
gh/ezyang/3204/head
Closed

Revert #168264 + Python-side LRU cache when native op schema is not supported#168269
ezyang wants to merge 5 commits intogh/ezyang/3204/basefrom
gh/ezyang/3204/head

Conversation

@ezyang
Copy link
Contributor

@ezyang ezyang commented Nov 20, 2025

Stack from ghstack (oldest at bottom):

This reverts #168264 but with a bugfix for the reason why it was reverted.

Signed-off-by: Edward Z. Yang ezyang@meta.com

[ghstack-poisoned]
ezyang added a commit that referenced this pull request Nov 20, 2025
Authored with claude code

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
ghstack-source-id: b37844f
Pull-Request: #168269
py_op.ptr(),
args.ptr(),
kwargs.ptr());
py::object sharding = checked_vectorcall(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

review with ignore whitespace changes

@ezyang ezyang requested review from wconstab and zpcore November 20, 2025 17:04
@wconstab
Copy link
Contributor

you're working on a test case?

@pytorch-bot
Copy link

pytorch-bot bot commented Nov 20, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/168269

Note: Links to docs will display an error until the docs builds have been completed.

⏳ No Failures, 41 Pending

As of commit 09b6fe1 with merge base d3ccb8f (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@ezyang
Copy link
Contributor Author

ezyang commented Nov 20, 2025

you're working on a test case?

yes. unfortunately claude didn't write a good test lol

Copy link
Collaborator

@albanD albanD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs a test to make sure we don't regress again.
SGTM once added

checked_vectorcall(
sharding_propagator.attr("propagate").ptr(),
py_op_info.ptr());
cached_sharding = py_op_info.attr(dtensor_interned_strings.output_sharding);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be added to the c++ cache here to make subsequent calls faster?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is when opt_native_op_schema fails to be returned, the C++ actually just has no cache key to lookup against, so this really wouldn't help (we HAVE to go into the pytree path before we can compute the cache key).

if (opt_native_op_schema.has_value()) {

// Use Python-side LRU cache when native cache is not available
if (!opt_native_op_schema.has_value()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note the slow path will be triggered when input args contain List[Tensor] or contains symbolic shape.

[ghstack-poisoned]
@ezyang ezyang added the topic: bug fixes topic category label Nov 20, 2025
[ghstack-poisoned]
ezyang added a commit that referenced this pull request Nov 20, 2025
Authored with claude code

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
ghstack-source-id: dcebadd
Pull-Request: #168269
@ezyang ezyang added the release notes: distributed (dtensor) release notes category label Nov 20, 2025
@ezyang
Copy link
Contributor Author

ezyang commented Nov 20, 2025

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Nov 20, 2025
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 jobs have failed, first few of them are: trunk / linux-jammy-rocm-py3.10 / test (distributed, 3, 3, linux.rocm.gpu.gfx942.4)

Details for Dev Infra team Raised by workflow job

[ghstack-poisoned]
ezyang added a commit that referenced this pull request Nov 21, 2025
Authored with claude code

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
ghstack-source-id: 2748b69
Pull-Request: #168269
@ezyang
Copy link
Contributor Author

ezyang commented Nov 21, 2025

The latest implementation is better, and has lots of comments, and I basically undid all the Claude code stuff.

kwargs.ptr(),
py_op_info.ptr());
py_op_info.ptr(),
/*try_cache*/ !opt_native_op_schema.has_value() ? Py_True : Py_False);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

iiuc 2 major changes since the last rev

  1. we now do the "if sharding is actually a tensor, return early" bit 100% of the time that we call into python, whereas previously we only did it on the 'slow path' and not on the 'fast path'
  2. we always call the 'slow path' entrypoint back to python, and added the caching logic into it (gated by try_cache) rather than having 2 callouts to python in c++

LGTM!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if i were a better man, i'd have written tests for these cases lol

[ghstack-poisoned]
ezyang added a commit that referenced this pull request Nov 21, 2025
Authored with claude code

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
ghstack-source-id: 2748b69
Pull-Request: #168269
@ezyang ezyang changed the title Used Python-side LRU cache when native op schema is not supported Revert #168264 + Python-side LRU cache when native op schema is not supported Nov 21, 2025
@pytorch-bot pytorch-bot bot added the ci-no-td Do not run TD on this PR label Nov 21, 2025
@ezyang
Copy link
Contributor Author

ezyang commented Nov 21, 2025

@pytorchbot merge -f "rebuild not necessary, previous commit is identical"

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

JacobSzwejbka pushed a commit that referenced this pull request Dec 8, 2025
…upported (#168269)

This reverts #168264 but with a bugfix for the reason why it was reverted.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: #168269
Approved by: https://github.com/wconstab, https://github.com/albanD, https://github.com/zpcore, https://github.com/malfet
tiendatngcs pushed a commit to tiendatngcs/pytorch-Dec25 that referenced this pull request Dec 10, 2025
Authored with claude code

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
ghstack-source-id: dcebadd
Pull-Request: pytorch/pytorch#168269
tiendatngcs pushed a commit to tiendatngcs/pytorch-Dec25 that referenced this pull request Dec 10, 2025
Authored with claude code

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
ghstack-source-id: 5240429
Pull-Request: pytorch/pytorch#168269
@github-actions github-actions bot deleted the gh/ezyang/3204/head branch December 22, 2025 02:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-no-td Do not run TD on this PR ciflow/inductor ciflow/trunk Trigger trunk jobs on your pull request Merged release notes: distributed (dtensor) release notes category topic: bug fixes topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants