Skip to content

Conversation

@albanD
Copy link
Collaborator

@albanD albanD commented May 18, 2018

This PR contains two main features:

  • A metadata dictionary is associated to every C Function. If not used, this cost a single if statement in the Function destructor. This dictionary will remain even if the python Function object goes out of scope. This is not documented as this can only be used from python by recovering the Function object by traversing the graph which is not documented either.
  • New anomaly detection feature for the autograd. If not used, this cost one if statement on Function creation and on autograd function execution during backward. This can be enabled with a context manager or a global flag switch. When enabled, this has two effect:
    • If any function returns nan during the backward pass, an error will be raised.
    • If an error is raised during the backward pass, will print the traceback for the forward call that created the failing backward function.

Please double check the python refcounting as I'm not used to write this code and I might have missed one.

To be done in a future PR:

  • Add checks for nan during nn.Module forwarding. I can't find a good way to do this check for generic forward that does not use nn.Module especially when no grad are required.
  • Use the metadata to instrument an nn.Module-based forward to allow better graph printing (like collapsing modules).

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

@soumith
Copy link
Contributor

soumith commented May 30, 2018

@pytorchbot retest this please

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

Copy link
Member

@colesbury colesbury left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great! Some comments about the C++:

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

@albanD
Copy link
Collaborator Author

albanD commented Jun 11, 2018

@ezyang I fixed it so that it works without any NO_PYTHON macro.
It still uses std::cout at the moment. I think the right thing to do is introduce a new warning system that this will use and that will corresponding to printing to std::cerr in cpp build and a python warning in python build.
Given that his might create more discussion, we can get this in as is and i'll change it to the warning system as soon as I get it done?

@ezyang ezyang merged commit 78e3259 into pytorch:master Jun 12, 2018
@ezyang
Copy link
Contributor

ezyang commented Jun 12, 2018

Yeah, getting cout to the right place is a discussion in and of itself, and we've got a few other places that would benefit from that treatment. No need to block this PR on it.

@albanD albanD deleted the detect_anomaly branch June 12, 2018 09:22
Copy link
Contributor

@apaszke apaszke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did we do any microbenchmarks before merging this?

bool keep_graph,
bool create_graph,
const edge_list& outputs = {}) override;
virtual std::unique_ptr<AnomalyMetadata> make_anomaly_metadata() override;

This comment was marked as off-topic.

This comment was marked as off-topic.

}

private:
static bool _enabled;

This comment was marked as off-topic.

@albanD
Copy link
Collaborator Author

albanD commented Jun 15, 2018

@apaszke I ran some benchmarks on an op that corresponds to add a constant to a small tensor of size 10. Both before this PR and with this PR with anomaly mode disabled takes 1.08micro s per call.
Note that for such a small op, the overhead is significant when the feature is enabled as one call takes 85micro s per call. But given that we have to acquire the GIL, call into python and gather a whole traceback, that's expected.

@nnop
Copy link

nnop commented Apr 18, 2020

Does this work for Inf values?

@albanD
Copy link
Collaborator Author

albanD commented Apr 20, 2020

No, this only detects NaN.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants