Skip to content

Distributed autograd converts all python exceptions to RuntimeError #32636

@peterbell10

Description

@peterbell10

🐛 Bug

If an instance of torch.autograd.Function raises a python exception during the backward pass, distributed autograd will always convert the type of that exception to RuntimeError.

To Reproduce

See DistAutogradTest.test_backward_autograd_engine_error:

with self.assertRaisesRegex(RuntimeError, 'Simulate error on backward pass'):
# Run backwards, and validate we receive an error.
dist_autograd.backward([val.sum()])

The test explicitly expects RuntimeError even though SimulateBackwardError raises Exception:

def backward(ctx, input):
raise Exception('Simulate error on backward pass')

Expected behavior

The raised exception should have the same type as raised in the python code.

Additional context

This root cause seems to be that autograd raises a python_error exception in C++ and that is translated to RuntimeError by the default pybind11 exception translator. In #30588 I register a new exception translator that treats python_errors correctly. However, this failed CI with some workers crashing (see #30588 (comment))

cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @xush6528 @osalpekar @jjlilley

Metadata

Metadata

Labels

better-engineeringRelatively self-contained tasks for better engineering contributorsmodule: rpcRelated to RPC, distributed autograd, RRef, and distributed optimizertriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions