Skip to content

Conversation

@mfkasim1
Copy link
Contributor

Fixes #43405.

This pull request adds a feature of printing all tracebacks if a detect_anomaly mode detects nan in nested backward operations.
The way I did it is by assigning a node as a parent to all nodes it produces during its backward calculation. Then if one of the children produces nan, it will print the traceback from the parent and grand parents (if any).

The parent is assigned in parent_node_ member in Node class which is accessible in C++ by function node->parent() and in Python by node.parent_function.
A node has a parent iff:

  1. it is created from a backward operation, and
  2. created when anomaly mode and grad mode are both enabled.

An example of this feature:

import torch

def example():
    x = torch.tensor(1.0, requires_grad=True)
    y = torch.tensor(1e-8, requires_grad=True)  # small to induce nan in n-th backward
    a = x * y
    b = x * y
    z1 = a / b  # can produce nan in n-th backward as long as #43414 is unsolved
    z = z1 * z1
    gy , = torch.autograd.grad( z , (y,), create_graph=True)
    gy2, = torch.autograd.grad(gy , (y,), create_graph=True)
    gy3, = torch.autograd.grad(gy2, (y,), create_graph=True)
    gy4, = torch.autograd.grad(gy3, (y,), create_graph=True)
    return gy4

with torch.autograd.detect_anomaly():
    gy4 = example()

with output:

example.py:16: UserWarning: Anomaly Detection has been enabled. This mode will increase the runtime and should only be enabled for debugging.
  with torch.autograd.detect_anomaly():
/home/mfkasim/anaconda2/envs/base3/lib/python3.8/site-packages/torch/autograd/__init__.py:190: UserWarning: Error detected in DivBackward0. Traceback of forward call that caused the error:
  File "example.py", line 17, in <module>
    gy4 = example()
  File "example.py", line 12, in example
    gy3, = torch.autograd.grad(gy2, (y,), create_graph=True)
  File "/home/mfkasim/anaconda2/envs/base3/lib/python3.8/site-packages/torch/autograd/__init__.py", line 190, in grad
    return Variable._execution_engine.run_backward(
 (Triggered internally at  ../torch/csrc/autograd/python_anomaly_mode.cpp:61.)
  return Variable._execution_engine.run_backward(
/home/mfkasim/anaconda2/envs/base3/lib/python3.8/site-packages/torch/autograd/__init__.py:190: UserWarning:

Traceback of forward call that induces the previous calculation:
  File "example.py", line 17, in <module>
    gy4 = example()
  File "example.py", line 11, in example
    gy2, = torch.autograd.grad(gy , (y,), create_graph=True)
  File "/home/mfkasim/anaconda2/envs/base3/lib/python3.8/site-packages/torch/autograd/__init__.py", line 190, in grad
    return Variable._execution_engine.run_backward(
 (Triggered internally at  ../torch/csrc/autograd/python_anomaly_mode.cpp:65.)
  return Variable._execution_engine.run_backward(
/home/mfkasim/anaconda2/envs/base3/lib/python3.8/site-packages/torch/autograd/__init__.py:190: UserWarning:

Traceback of forward call that induces the previous calculation:
  File "example.py", line 17, in <module>
    gy4 = example()
  File "example.py", line 8, in example
    z1 = a / b  # can produce nan in n-th backward as long as #43414 is unsolved
 (Triggered internally at  ../torch/csrc/autograd/python_anomaly_mode.cpp:65.)
  return Variable._execution_engine.run_backward(
Traceback (most recent call last):
  File "example.py", line 17, in <module>
    gy4 = example()
  File "example.py", line 13, in example
    gy4, = torch.autograd.grad(gy3, (y,), create_graph=True)
  File "/home/mfkasim/anaconda2/envs/base3/lib/python3.8/site-packages/torch/autograd/__init__.py", line 190, in grad
    return Variable._execution_engine.run_backward(
RuntimeError: Function 'DivBackward0' returned nan values in its 1th output.

cc & thanks to @albanD

@dr-ci
Copy link

dr-ci bot commented Aug 26, 2020

💊 CI failures summary and remediations

As of commit 46b57a1 (more details on the Dr. CI page):


💚 💚 Looks good so far! There are no failures yet. 💚 💚


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 21 times.

Copy link
Collaborator

@albanD albanD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR.
The high level idea looks good.
I think we can simplify a bit the implementation though. Can you check the inline comments below and let me know what you think.

@mfkasim1 mfkasim1 requested a review from albanD August 27, 2020 10:48
Copy link
Collaborator

@albanD albanD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks quite good, a few more comments below.

@mfkasim1 mfkasim1 requested a review from albanD August 27, 2020 18:22
Copy link
Collaborator

@albanD albanD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the c++ looks good!
Just had a chance to take a quick look at the tests. They are good, just a small comment to make sure things are tested and we don't leak memory.
Then it should be good to go!

Copy link
Collaborator

@albanD albanD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Can you just edit the comment mentioned above and this will be good to go!

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@albanD has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@albanD has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@mfkasim1
Copy link
Contributor Author

Thanks for your help @albanD!

@facebook-github-bot
Copy link
Contributor

@albanD merged this pull request in 576880f.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Uninformative forward trace in detect_anomaly for double backward

5 participants