-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Description
🐛 Describe the bug
We are seeing non-deterministic segfaults in our CI triggered by e.g. test_foreach.py::TestForeachCUDA::test_parity__foreach_neg_slowpath_outplace_cuda_int64.
Running the test standalone does not reproduce the issue yet and we are unsure how to exactly trigger it.
Our CI shows a failure rate of approx. 10% according to @nWEIdia.
The stacktrace shows:
0 0x0000000000042520 __sigaction() ???:0
1 0x00000000000969fc pthread_kill() ???:0
2 0x0000000000042476 raise() ???:0
3 0x0000000000042520 __sigaction() ???:0
4 0x0000000000044c1d getenv() ???:0
5 0x0000000006f5d869 libkineto::Config::Config() ???:0
6 0x0000000006fa1525 libkineto::ConfigLoader::updateConfigThread() ???:0
7 0x00000000000dc253 std::error_code::default_error_condition() ???:0
8 0x0000000000094ac3 pthread_condattr_setpshared() ???:0
9 0x0000000000126850 __xmknodat() ???:0which might point to this getenv call trying to read kUseDaemonEnvVar, which might be invalid in the failing thread.
These are all speculations and potentially we re seeing a similar lifetime issue previously fixed in pytorch/kineto#696 and pytorch/kineto#965.
We will keep debugging and will update the issue with more information. For now creating it as a placeholder.
Versions
PyTorch source build based on: #132066
cc @msaroufim @robieta @chaekit @aaronenyeshi @guotuofeng @guyang3532 @dzhulgakov @davidberard98 @briancoutinho @sraikund16 @sanrise @crcrpar @mcarilli @janeyx99