-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Description
🐛 Bug
If an instance of torch.autograd.Function raises a python exception during the backward pass, distributed autograd will always convert the type of that exception to RuntimeError.
To Reproduce
See DistAutogradTest.test_backward_autograd_engine_error:
pytorch/torch/testing/_internal/distributed/rpc/dist_autograd_test.py
Lines 1055 to 1058 in 6ad9e5c
| with self.assertRaisesRegex(RuntimeError, 'Simulate error on backward pass'): | |
| # Run backwards, and validate we receive an error. | |
| dist_autograd.backward([val.sum()]) | |
The test explicitly expects RuntimeError even though SimulateBackwardError raises Exception:
pytorch/torch/testing/_internal/distributed/rpc/dist_autograd_test.py
Lines 144 to 145 in 6ad9e5c
| def backward(ctx, input): | |
| raise Exception('Simulate error on backward pass') |
Expected behavior
The raised exception should have the same type as raised in the python code.
Additional context
This root cause seems to be that autograd raises a python_error exception in C++ and that is translated to RuntimeError by the default pybind11 exception translator. In #30588 I register a new exception translator that treats python_errors correctly. However, this failed CI with some workers crashing (see #30588 (comment))
cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @xush6528 @osalpekar @jjlilley