Fix NaN gradients in atan2_backward when both inputs are zero by pushkar-hue · Pull Request #166787 · pytorch/pytorch

pushkar-hue · 2025-11-01T06:12:16Z

Description of Bug 🐛

As reported in #165427, When both the input of atan2 function is zero the gradient becomes NaN. During the forward pass, atan2 successfully avoids division-by-zero issue, but during backpropagation gradients become NaN.

This is because the backward pass calculates (self * self + other * other).reciprocal(), which becomes inf at (0, 0). The subsequent multiplication by zero (0 * inf) results in NaN.

Changes

Added an at::where condition to handle zero denominators in atan2_backward.
If denom is zero return 0 for the reciprocal; otherwise, use the original value.

Testing

Added test_atan2_zero_gradient in test/test_autograd.py to verify atan2 returns 0.0 gradients for (0,0).

cc: @soulitzer

pytorch-bot · 2025-11-01T06:12:19Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/166787

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit d3b3829 with merge base c5d91d9 ():

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

trunk / linux-jammy-py3-clang12-executorch / test (executorch, 1, 1, linux.2xlarge, unstable) (gh) (#166072)
backends/xnnpack/test/recipes/test_xnnpack_recipes.py::TestXnnpackRecipes::test_int8_static_quant_recipe

This comment was automatically generated by Dr. CI and updates every 15 minutes.

pushkar-hue · 2025-11-01T06:16:01Z

@pytorchbot label "release notes: autograd"

Skylion007 · 2025-11-01T17:12:37Z

torch/csrc/autograd/FunctionsManual.cpp

-  auto recip = (self * self + other * other).reciprocal();
+  auto denom = self * self + other * other;
+  auto recip = denom.reciprocal();
+  recip = at::where(denom == 0, at::zeros_like(recip), recip);


In this case, I think the 2nd arg can be a scalar or scalar Tensor. No need to allocate a large zeros matrix.

Also wondering there is a way to make this where update inplace. Isn't it the same as 'recip[denom == 0] = 0' in python shorthand. Should be a way to do a selective scalar assignment in CPP explicitly if the shorthand doesn't work. Inplace where_ is also an option

You are right, I don't need to create a Tensor second arg can just be scalar. Also, I looked into it and I found out equivalent of recip[denom == 0] = 0 would to do this in CPP recip.masked_fill_(denom == 0, 0) I am testing it out locally first if it works then I'll push the changes.

pushkar-hue · 2025-11-04T06:06:19Z

@Skylion007 I have pushed the requested changes you can take a look and let me know if there's anything I need to consider.

pushkar-hue · 2025-11-07T12:05:54Z

@Skylion007 I'm looking at the two failing CI checks:

The test (dynamo_wrapped...) job failed with a ConnectionResetError. This looks like a temporary network glitch.
The Lint/lintrunner-pyrely-partial job is failing with errors in files unrelated to my PR like torch/fx/experimental/validator.py.

I believe both failures are unrelated to my code changes. Could you please confirm? Thanks!

soulitzer

Thanks!

soulitzer · 2025-11-11T14:53:10Z

@pytorchbot rebase

soulitzer · 2025-11-11T14:53:38Z

I believe both failures are unrelated to my code changes. Could you please confirm? Thanks!

Yeah failures look unrelated

pytorchmergebot · 2025-11-11T14:54:42Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot · 2025-11-11T14:54:46Z

Successfully rebased fix-atan2-anomalies onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout fix-atan2-anomalies && git pull --rebase)

soulitzer · 2025-11-11T14:55:19Z

@pytorchbot merge

pytorchmergebot · 2025-11-11T14:57:18Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2025-11-11T15:55:54Z

Merge failed

Reason: 1 jobs have failed, first few of them are: trunk / macos-py3-arm64 / test (default, 2, 3, macos-m1-stable)

Details for Dev Infra team

Raised by workflow job

soulitzer · 2025-11-11T16:43:01Z

Failures look real, and it solution seems to be needing a bit more complexity: you don't want to in-place masked_fill_ always because masked_fill_ is not supported by NestedTensor, and in the higher-order gradients case you don't want to mutate saved tensors.

pushkar-hue · 2025-11-11T16:57:38Z

does my older solution which avoids in place update of recip works? using recip = at::where(denom == 0, at::zeros_like(recip), recip);

or maybe I can do something like this

if (at::GradMode::is_enabled()){
recip = at::where(denom == 0, at::zeros_like(recip), recip);
}
else{
  recip.masked_fill_(denom == 0, 0);
}

let me know if this fix the issue or I am missing something @soulitzer

pushkar-hue · 2025-11-11T18:08:31Z

nvm I tested it locally and the only solution that seems to work is just this

  auto denom = self * self + other * other;
  auto recip = denom.reciprocal();
  recip = at::where(denom == 0, at::zeros_like(recip), recip);

I'm not sure how efficient is this out of place update of recipe, but this is the only solution that seems to work I may have been mistaken.

soulitzer · 2025-11-11T19:12:06Z

Maybe try

 at::areAnyTensorSubclassLike(...) ||
  at::GradMode::is_enabled()

Let's avoid keeping many temporary tensors around at once

pushkar-hue · 2025-11-12T05:31:34Z

Thanks @soulitzer! Your suggestion worked perfectly. I have pushed the changes it passed all the local test.

pushkar-hue · 2025-11-12T17:22:06Z

sorry to force push it was getting some lint error so i had to rebase locally and resolve some merge conflicts to solve the lint error this should pass now hopefully

soulitzer

Thanks!

soulitzer · 2025-11-14T17:06:04Z

@pytorchbot merge

pytorchmergebot · 2025-11-14T17:08:12Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

…h#166787) Fixes pytorch#165427 ## Description of Bug 🐛 As reported in pytorch#165427, When both the input of `atan2` function is zero the gradient becomes `NaN`. During the forward pass, `atan2` successfully avoids division-by-zero issue, but during backpropagation gradients become `NaN`. This is because the backward pass calculates `(self * self + other * other).reciprocal()`, which becomes `inf` at `(0, 0)`. The subsequent multiplication by zero `(0 * inf)` results in `NaN`. ## Changes - Added an `at::where` condition to handle zero denominators in `atan2_backward`. - If denom is zero return 0 for the reciprocal; otherwise, use the original value. ## Testing - Added` test_atan2_zero_gradient` in `test/test_autograd.py` to verify `atan2` returns `0.0` gradients for `(0,0)`. Pull Request resolved: pytorch#166787 Approved by: https://github.com/soulitzer

pushkar-hue requested review from albanD and soulitzer as code owners November 1, 2025 06:12

pytorch-bot bot added the release notes: autograd release notes category label Nov 1, 2025

pytorchbot added the open source label Nov 1, 2025

Skylion007 reviewed Nov 1, 2025

View reviewed changes

albanD removed their request for review November 4, 2025 17:50

pushkar-hue requested a review from Skylion007 November 7, 2025 12:06

janeyx99 added triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module open source and removed open source labels Nov 7, 2025

soulitzer approved these changes Nov 11, 2025

View reviewed changes

pytorchmergebot force-pushed the fix-atan2-anomalies branch from ca563e3 to d83a900 Compare November 11, 2025 14:54

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Nov 11, 2025

pytorchmergebot added the merging label Nov 11, 2025

pytorchmergebot removed the merging label Nov 11, 2025

pytorch-bot bot removed the ciflow/trunk Trigger trunk jobs on your pull request label Nov 12, 2025

pushkar-hue added 5 commits November 12, 2025 22:28

fix nan anomaly of atan2

f9f6ac7

test for atan2

e3bb823

lint formating

0d65de8

inplace assignment of recip

8668b7b

fixed NestedTensor CI failure

d3b3829

pushkar-hue force-pushed the fix-atan2-anomalies branch from a900e96 to d3b3829 Compare November 12, 2025 17:20

pushkar-hue requested a review from soulitzer November 14, 2025 16:05

soulitzer approved these changes Nov 14, 2025

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Nov 14, 2025

pytorchmergebot added the merging label Nov 14, 2025

pytorchmergebot added the Merged label Nov 14, 2025

pytorchmergebot closed this in dd37a1a Nov 14, 2025

pytorchmergebot removed the merging label Nov 14, 2025

pushkar-hue deleted the fix-atan2-anomalies branch December 26, 2025 06:18

Conversation

pushkar-hue commented Nov 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description of Bug 🐛

Changes

Testing

Uh oh!

pytorch-bot bot commented Nov 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/166787

✅ You can merge normally! (1 Unrelated Failure)

Uh oh!

pushkar-hue commented Nov 1, 2025

Uh oh!

Skylion007 Nov 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pushkar-hue Nov 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pushkar-hue commented Nov 4, 2025

Uh oh!

pushkar-hue commented Nov 7, 2025

Uh oh!

soulitzer left a comment

Choose a reason for hiding this comment

Uh oh!

soulitzer commented Nov 11, 2025

Uh oh!

soulitzer commented Nov 11, 2025

Uh oh!

pytorchmergebot commented Nov 11, 2025

Uh oh!

pytorchmergebot commented Nov 11, 2025

Uh oh!

soulitzer commented Nov 11, 2025

Uh oh!

pytorchmergebot commented Nov 11, 2025

Merge started

Uh oh!

pytorchmergebot commented Nov 11, 2025

Merge failed

Uh oh!

soulitzer commented Nov 11, 2025

Uh oh!

pushkar-hue commented Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pushkar-hue commented Nov 11, 2025

Uh oh!

soulitzer commented Nov 11, 2025

Uh oh!

pushkar-hue commented Nov 12, 2025

Uh oh!

pushkar-hue commented Nov 12, 2025

Uh oh!

soulitzer left a comment

Choose a reason for hiding this comment

Uh oh!

soulitzer commented Nov 14, 2025

Uh oh!

pytorchmergebot commented Nov 14, 2025

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

pushkar-hue commented Nov 1, 2025 •

edited

Loading

pytorch-bot bot commented Nov 1, 2025 •

edited

Loading

Skylion007 Nov 1, 2025 •

edited

Loading

pushkar-hue Nov 4, 2025 •

edited

Loading

pushkar-hue commented Nov 11, 2025 •

edited

Loading