Skip to content

Fix NaN gradients in atan2_backward when both inputs are zero#166787

Closed
pushkar-hue wants to merge 5 commits intopytorch:mainfrom
pushkar-hue:fix-atan2-anomalies
Closed

Fix NaN gradients in atan2_backward when both inputs are zero#166787
pushkar-hue wants to merge 5 commits intopytorch:mainfrom
pushkar-hue:fix-atan2-anomalies

Conversation

@pushkar-hue
Copy link
Contributor

@pushkar-hue pushkar-hue commented Nov 1, 2025

Fixes #165427

Description of Bug 🐛

As reported in #165427, When both the input of atan2 function is zero the gradient becomes NaN. During the forward pass, atan2 successfully avoids division-by-zero issue, but during backpropagation gradients become NaN.

This is because the backward pass calculates (self * self + other * other).reciprocal(), which becomes inf at (0, 0). The subsequent multiplication by zero (0 * inf) results in NaN.

Changes

  • Added an at::where condition to handle zero denominators in atan2_backward.
  • If denom is zero return 0 for the reciprocal; otherwise, use the original value.

Testing

  • Added test_atan2_zero_gradient in test/test_autograd.py to verify atan2 returns 0.0 gradients for (0,0).

cc: @soulitzer

@pytorch-bot
Copy link

pytorch-bot bot commented Nov 1, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/166787

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit d3b3829 with merge base c5d91d9 (image):

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pushkar-hue
Copy link
Contributor Author

@pytorchbot label "release notes: autograd"

@pytorch-bot pytorch-bot bot added the release notes: autograd release notes category label Nov 1, 2025
auto recip = (self * self + other * other).reciprocal();
auto denom = self * self + other * other;
auto recip = denom.reciprocal();
recip = at::where(denom == 0, at::zeros_like(recip), recip);
Copy link
Collaborator

@Skylion007 Skylion007 Nov 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, I think the 2nd arg can be a scalar or scalar Tensor. No need to allocate a large zeros matrix.

Also wondering there is a way to make this where update inplace. Isn't it the same as 'recip[denom == 0] = 0' in python shorthand. Should be a way to do a selective scalar assignment in CPP explicitly if the shorthand doesn't work. Inplace where_ is also an option

Copy link
Contributor Author

@pushkar-hue pushkar-hue Nov 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, I don't need to create a Tensor second arg can just be scalar. Also, I looked into it and I found out equivalent of recip[denom == 0] = 0 would to do this in CPP recip.masked_fill_(denom == 0, 0) I am testing it out locally first if it works then I'll push the changes.

@pushkar-hue
Copy link
Contributor Author

@Skylion007 I have pushed the requested changes you can take a look and let me know if there's anything I need to consider.

@albanD albanD removed their request for review November 4, 2025 17:50
@pushkar-hue
Copy link
Contributor Author

@Skylion007 I'm looking at the two failing CI checks:

  • The test (dynamo_wrapped...) job failed with a ConnectionResetError. This looks like a temporary network glitch.

  • The Lint/lintrunner-pyrely-partial job is failing with errors in files unrelated to my PR like torch/fx/experimental/validator.py.

I believe both failures are unrelated to my code changes. Could you please confirm? Thanks!

@janeyx99 janeyx99 added triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module open source and removed open source labels Nov 7, 2025
Copy link
Contributor

@soulitzer soulitzer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@soulitzer
Copy link
Contributor

@pytorchbot rebase

@soulitzer
Copy link
Contributor

I believe both failures are unrelated to my code changes. Could you please confirm? Thanks!

Yeah failures look unrelated

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased fix-atan2-anomalies onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout fix-atan2-anomalies && git pull --rebase)

@soulitzer
Copy link
Contributor

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Nov 11, 2025
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 jobs have failed, first few of them are: trunk / macos-py3-arm64 / test (default, 2, 3, macos-m1-stable)

Details for Dev Infra team Raised by workflow job

@soulitzer
Copy link
Contributor

Failures look real, and it solution seems to be needing a bit more complexity: you don't want to in-place masked_fill_ always because masked_fill_ is not supported by NestedTensor, and in the higher-order gradients case you don't want to mutate saved tensors.

@pushkar-hue
Copy link
Contributor Author

pushkar-hue commented Nov 11, 2025

does my older solution which avoids in place update of recip works? using recip = at::where(denom == 0, at::zeros_like(recip), recip);

or maybe I can do something like this

if (at::GradMode::is_enabled()){
recip = at::where(denom == 0, at::zeros_like(recip), recip);
}
else{
  recip.masked_fill_(denom == 0, 0);
}

let me know if this fix the issue or I am missing something @soulitzer

@pushkar-hue
Copy link
Contributor Author

nvm I tested it locally and the only solution that seems to work is just this

  auto denom = self * self + other * other;
  auto recip = denom.reciprocal();
  recip = at::where(denom == 0, at::zeros_like(recip), recip);

I'm not sure how efficient is this out of place update of recipe, but this is the only solution that seems to work I may have been mistaken.

@soulitzer
Copy link
Contributor

Maybe try

 at::areAnyTensorSubclassLike(...) ||
  at::GradMode::is_enabled()

Let's avoid keeping many temporary tensors around at once

@pytorch-bot pytorch-bot bot removed the ciflow/trunk Trigger trunk jobs on your pull request label Nov 12, 2025
@pushkar-hue
Copy link
Contributor Author

Thanks @soulitzer! Your suggestion worked perfectly. I have pushed the changes it passed all the local test.

@pushkar-hue
Copy link
Contributor Author

sorry to force push it was getting some lint error so i had to rebase locally and resolve some merge conflicts to solve the lint error this should pass now hopefully

Copy link
Contributor

@soulitzer soulitzer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@soulitzer
Copy link
Contributor

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Nov 14, 2025
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

jsuarez5341 pushed a commit to PufferAI/pytorch that referenced this pull request Nov 15, 2025
…h#166787)

Fixes pytorch#165427

## Description of Bug 🐛

As reported in pytorch#165427, When both the input of  `atan2` function is zero the gradient becomes `NaN`. During the forward pass, `atan2` successfully avoids division-by-zero issue, but during backpropagation gradients become `NaN`.

This is because the backward pass calculates `(self * self + other * other).reciprocal()`, which becomes `inf` at `(0, 0)`. The subsequent multiplication by zero `(0 * inf)` results in `NaN`.

## Changes
- Added an `at::where` condition to handle zero denominators in `atan2_backward`.
- If denom is zero return 0 for the reciprocal; otherwise, use the original value.

## Testing
- Added` test_atan2_zero_gradient` in `test/test_autograd.py` to verify `atan2` returns `0.0` gradients for `(0,0)`.

Pull Request resolved: pytorch#166787
Approved by: https://github.com/soulitzer
Silv3S pushed a commit to Silv3S/pytorch that referenced this pull request Nov 18, 2025
…h#166787)

Fixes pytorch#165427

## Description of Bug 🐛

As reported in pytorch#165427, When both the input of  `atan2` function is zero the gradient becomes `NaN`. During the forward pass, `atan2` successfully avoids division-by-zero issue, but during backpropagation gradients become `NaN`.

This is because the backward pass calculates `(self * self + other * other).reciprocal()`, which becomes `inf` at `(0, 0)`. The subsequent multiplication by zero `(0 * inf)` results in `NaN`.

## Changes
- Added an `at::where` condition to handle zero denominators in `atan2_backward`.
- If denom is zero return 0 for the reciprocal; otherwise, use the original value.

## Testing
- Added` test_atan2_zero_gradient` in `test/test_autograd.py` to verify `atan2` returns `0.0` gradients for `(0,0)`.

Pull Request resolved: pytorch#166787
Approved by: https://github.com/soulitzer
@pushkar-hue pushkar-hue deleted the fix-atan2-anomalies branch December 26, 2025 06:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request Merged open source release notes: autograd release notes category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

PyTorch's atan2 function avoids computational anomalies when the input is zero, but the backward gradient becomes NaN.

6 participants