Bugfix to forward autodiff causing different datatype 2 by skpark-rh · Pull Request #165784 · pytorch/pytorch

skpark-rh · 2025-10-17T19:56:35Z

The Problem Summary

The issue boiled down to data type promotion logic. The code base has two different functions that deal with dtype promotion logic. If it is purely multi-dimensional tensor operations, the cpp code gets triggered and that follows the numpy dtype promotion logic. That is why in #160513 NDim tensors are fine as NDim dtypes gets precedence. The issue came with python scalars and 0Dim tensors. When it detects "scalars", a python implementation of dtype promotion logic gets triggered (torch/_prims_common/init.py:1544). Since this is in python, the implementation can't distinguish what is from a wrapped tensor and a 0Dim tensor and thus will just take the highest dtype which is the python double wrapped number.

The Fix

The python implementation for dtype promotion had to know where the scalar came from. Once the scalar can be distinguished then the appropriate dtype can be set. The first approach was to try and expose the is_wrapped_number method but this came with a big issue. During the forward_ad the derivative of those scalars turned out to be ZeroTensors. The ZeroTensor internally uses a hack to initialize a meta dtype tensor which skips expensive dispatch operations. But the copy would not grab everything especially the is_number_wrapped_ property. I thought about modifying the copy but that seemed to go away from the spirit of what the copy was intended for and plus the tests for is_wrapped_number_ requires dim > 0 and a scalar ZeroTensor is a meta dtype tensor which complicates things.

So I chose the route of creating a new property called was_wrapped_number and exposed this property to the python tensor API. I had to modify the autograd code generation to set was_wrapped_number in the mul, add, and div operations in VariableType.cpp. Once this property was set, the dtype promotion logic could be updated to consider wrapped numbers and 0Dim numbers. Once that hierarchy was taken care of, the buggy behavior was fixed.

I wrote a new ops testing module TestForwardADWithScalars. I saw that this bug was unique and required new testing paradigm. This only tests the multiply, add, and divide and I chose this because all operations boil down to these three operations.

[edit]: Just used efficientzerotensor meta and converted that to a python number. Since wrapped number is converted back to a python number, dtype promotion is preserved. The constraint to achieve this happened by setting the forward grad zero tensor of a wrapped number with a wrapped number flag since the tangent of the wrapped number should still be a wrapped number. After that this specific zerotensor was then sent through as a meta type in the BinaryOps.cpp to get appropriate dtype for resulting arithmetic.

@ezyang @OihanJoyot

cc @EikanWang @jgong5 @wenzhe-nrv @sanchitintel

… tensor from a wrapped number.

… a "is_wrapped_number" as true if the derived derivated is also a wrapped number.

…python side to handle dtype promotions.

…numbers. Then using the correct dtype promotions on the python side.

… tensor from a wrapped number.

… a "is_wrapped_number" as true if the derived derivated is also a wrapped number.

…python side to handle dtype promotions.

…numbers. Then using the correct dtype promotions on the python side.

…rch into bugfix/dtype_foward_agrad

…erations caused dtypes to be different.

… tensor from a wrapped number.

… a "is_wrapped_number" as true if the derived derivated is also a wrapped number.

skpark-rh · 2025-11-03T20:59:40Z

I was able to implement the changes requested. Let me know if I need to change something else. Thanks!

ezyang · 2025-11-04T04:33:09Z

torch/csrc/jit/python/pybind_utils.cpp

-      TORCH_INTERNAL_ASSERT(tensor.device().is_cpu());
+    if (tensor.unsafeGetTensorImpl()->is_wrapped_number() ||
+        (tensor._is_zerotensor() &&
+         tensor.unsafeGetTensorImpl()->is_wrapped_number() &&


This branch of the OR conjunction is pointless since if it's true, the lhs is always true too

Yeah, that's true...

ezyang · 2025-11-04T04:33:47Z

torch/csrc/autograd/FunctionsManual.cpp

  return Tensor();
 }

+void update_wrapped_number(Tensor& input, Tensor& output) {


severe parameter ordering blindness here lol

Yeah... for sure. T_T

ezyang

looks good if ci passes

…c type for failed builds.

skpark-rh · 2025-11-04T14:59:52Z

I had build fails and some tests that failed with complex dtypes. I just pushed fixes and they should all pass now.

skpark-rh · 2025-11-04T21:00:39Z

@pytorchmergebot merge

pytorchmergebot · 2025-11-04T21:03:05Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2025-11-04T21:52:17Z

Merge failed

Reason: 2 jobs have failed, first few of them are: trunk / win-vs2022-cpu-py3 / build, trunk / win-vs2022-cuda12.8-py3 / build

Details for Dev Infra team

Raised by workflow job

skpark-rh · 2025-11-05T02:17:01Z

Two win builds are failing with these errors: ~~rm: cannot remove './build/win_tmp/bin': Device or resource busy. Doesn't seem to be an issue with the PR though.~~

[edit]: Found a nonstandard dtype u_int64_t that was causing compile issues on the window builds.

…indow builds.

skpark-rh · 2025-11-05T13:55:49Z

@pytorchmergebot merge

pytorch-bot · 2025-11-05T13:55:55Z

Pull workflow has not been scheduled for the PR yet. It could be because author doesn't have permissions to run those or skip-checks keywords were added to PR/commits, aborting merge. Please get/give approval for the workflows and/or remove skip ci decorators before next merge attempt. If you think this is a mistake, please contact PyTorch Dev Infra.

skpark-rh · 2025-11-05T21:31:20Z

@pytorchmergebot merge

pytorchmergebot · 2025-11-05T21:33:26Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

…calars) Co-authored-by: dilililiwhy<why.wuhuanyu@huawei.com> # message auto-generated for no-merge-commit merge: !26081 merge main_sync_20251028 into master TORCH MAIN SYNC : add update_wrapped_number (bugfix to ForwardADWithScalars) Created-by: dilililiwhy Commit-by: dilililiwhy Merged-by: ascend-robot Description:  **What type of PR is this?** > Uncomment only one ` /kind <>` line, hit enter to put that in a new line, and remove leading whitespaces from that line: > > /kind bug > /kind task > /kind feature **What does this PR do / why do we need it**: 2.10.0.dev20251110 **Which issue(s) this PR fixes**:  Fixes # **Special notes for your reviewers**: pytorch/pytorch#160513 pytorch/pytorch#165784 pytorch/pytorch#166657 See merge request: Ascend/pytorch!26081

skpark-rh added 30 commits September 29, 2025 22:26

Added is_wrapped_number method to determine on the python side if the…

ddf3e41

… tensor from a wrapped number.

Added fw derivative template for mul_Tensor to set a zero tensor with…

c29c212

… a "is_wrapped_number" as true if the derived derivated is also a wrapped number.

Reverted exposing the is_wrapped_number method.

c327cfc

Added a new property called was_wrapped_number and exposed it to the …

ad40ba4

…python side to handle dtype promotions.

Setting the was_wrapped_number for zero tensors derived from wrapped_…

1a10b92

…numbers. Then using the correct dtype promotions on the python side.

Python Tensor init doc update for was_wrapped_number.

76ac42b

Need to update add.

1ed00e4

Added is_wrapped_number method to determine on the python side if the…

145d95a

… tensor from a wrapped number.

Added fw derivative template for mul_Tensor to set a zero tensor with…

11e9d13

… a "is_wrapped_number" as true if the derived derivated is also a wrapped number.

Reverted exposing the is_wrapped_number method.

18faa1e

Added a new property called was_wrapped_number and exposed it to the …

d1f7c11

…python side to handle dtype promotions.

Setting the was_wrapped_number for zero tensors derived from wrapped_…

7a21c49

…numbers. Then using the correct dtype promotions on the python side.

Python Tensor init doc update for was_wrapped_number.

2ccd5d0

Need to update add.

7bc1899

Merge branch 'bugfix/dtype_foward_agrad' of github.com:skpark-rh/pyto…

dc6cd52

…rch into bugfix/dtype_foward_agrad

Added passthrough for new property.

8c2d2d0

Merge branch 'pytorch:main' into bugfix/dtype_foward_agrad

f8a0d63

Merge branch 'main' into bugfix/dtype_foward_agrad

167cb03

Merge branch 'pytorch:main' into bugfix/dtype_foward_agrad

d2d2bc8

Merge branch 'main' into bugfix/dtype_foward_agrad

b08d7e0

Initalize boolean variable to false.

4eca0bd

Added wrapped_num template for div.

c3885f8

Wrote new test for the forward autograd bug where basic arithmetic op…

c3fb426

…erations caused dtypes to be different.

Merge branch 'main' into bugfix/dtype_foward_agrad

f8efa8f

Merge branch 'pytorch:main' into bugfix/dtype_foward_agrad

b1fd13f

Clean up with lintrunner.

36a4c6a

was_wrapped_number property doesn't have a TorchScript implementation.

6b2308f

Merge branch 'main' into bugfix/dtype_foward_agrad

03fd032

Added is_wrapped_number method to determine on the python side if the…

0dc5cbf

… tensor from a wrapped number.

Added fw derivative template for mul_Tensor to set a zero tensor with…

746eb7d

… a "is_wrapped_number" as true if the derived derivated is also a wrapped number.

skpark-rh requested a review from ezyang November 3, 2025 20:58

linting.

28b75c1

ezyang reviewed Nov 4, 2025

View reviewed changes

ezyang approved these changes Nov 4, 2025

View reviewed changes

skpark-rh added 2 commits November 4, 2025 14:54

Added support for number casting bools and complex zeros. Added stati…

596f830

…c type for failed builds.

linting

49413b9

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Nov 4, 2025

pytorchmergebot added the merging label Nov 4, 2025

pytorchmergebot removed the merging label Nov 4, 2025

Replaced nonstandard u_int64_t type with standard uint64_t type for w…

73eef55

…indow builds.

pytorch-bot bot removed the ciflow/trunk Trigger trunk jobs on your pull request label Nov 5, 2025

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Nov 5, 2025

pytorchmergebot added the merging label Nov 5, 2025

pytorchmergebot added the Merged label Nov 6, 2025

pytorchmergebot closed this in 69af749 Nov 6, 2025

pytorchmergebot removed the merging label Nov 6, 2025

skpark-rh deleted the bugfix/dtype_foward_agrad branch November 6, 2025 13:38

Conversation

skpark-rh commented Oct 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

The Problem Summary

The Fix

Uh oh!

skpark-rh commented Nov 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ezyang Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

skpark-rh Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

ezyang Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

skpark-rh Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

ezyang left a comment

Choose a reason for hiding this comment

Uh oh!

skpark-rh commented Nov 4, 2025

Uh oh!

skpark-rh commented Nov 4, 2025

Uh oh!

pytorchmergebot commented Nov 4, 2025

Merge started

Uh oh!

pytorchmergebot commented Nov 4, 2025

Merge failed

Uh oh!

skpark-rh commented Nov 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

skpark-rh commented Nov 5, 2025

Uh oh!

pytorch-bot bot commented Nov 5, 2025

Uh oh!

skpark-rh commented Nov 5, 2025

Uh oh!

pytorchmergebot commented Nov 5, 2025

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

skpark-rh commented Oct 17, 2025 •

edited

Loading

skpark-rh commented Nov 3, 2025 •

edited

Loading

skpark-rh commented Nov 5, 2025 •

edited

Loading