Skip to content

test_parity__foreach_* tests segfault in kineto #134596

@ptrblck

Description

@ptrblck

🐛 Describe the bug

We are seeing non-deterministic segfaults in our CI triggered by e.g. test_foreach.py::TestForeachCUDA::test_parity__foreach_neg_slowpath_outplace_cuda_int64.
Running the test standalone does not reproduce the issue yet and we are unsure how to exactly trigger it.
Our CI shows a failure rate of approx. 10% according to @nWEIdia.

The stacktrace shows:

0 0x0000000000042520 __sigaction()  ???:0
 1 0x00000000000969fc pthread_kill()  ???:0
 2 0x0000000000042476 raise()  ???:0
 3 0x0000000000042520 __sigaction()  ???:0
 4 0x0000000000044c1d getenv()  ???:0
 5 0x0000000006f5d869 libkineto::Config::Config()  ???:0
 6 0x0000000006fa1525 libkineto::ConfigLoader::updateConfigThread()  ???:0
 7 0x00000000000dc253 std::error_code::default_error_condition()  ???:0
 8 0x0000000000094ac3 pthread_condattr_setpshared()  ???:0
 9 0x0000000000126850 __xmknodat()  ???:0

which might point to this getenv call trying to read kUseDaemonEnvVar, which might be invalid in the failing thread.

These are all speculations and potentially we re seeing a similar lifetime issue previously fixed in pytorch/kineto#696 and pytorch/kineto#965.

CC @malfet @eqy

We will keep debugging and will update the issue with more information. For now creating it as a placeholder.

Versions

PyTorch source build based on: #132066

cc @msaroufim @robieta @chaekit @aaronenyeshi @guotuofeng @guyang3532 @dzhulgakov @davidberard98 @briancoutinho @sraikund16 @sanrise @crcrpar @mcarilli @janeyx99

Metadata

Metadata

Assignees

No one assigned

    Labels

    module: crashProblem manifests as a hard crash, as opposed to a RuntimeErrormodule: cudaRelated to torch.cuda, and CUDA support in generalmodule: mtaIssues related to multi-tensor apply kernels and foreach functionsoncall: profilerprofiler-related issues (cpu, gpu, kineto)triagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions