Skip to content

Conversation

@bhosmer
Copy link

@bhosmer bhosmer commented Aug 3, 2020

Stack from ghstack:

[ghstack-poisoned]
@bhosmer bhosmer requested a review from apaszke as a code owner August 3, 2020 05:56
bhosmer pushed a commit that referenced this pull request Aug 3, 2020
ghstack-source-id: f97a929
Pull Request resolved: #42436
@facebook-github-bot facebook-github-bot added the oncall: jit Add this issue/PR to JIT oncall triage queue label Aug 3, 2020
@dr-ci
Copy link

dr-ci bot commented Aug 3, 2020

💊 CI failures summary and remediations

As of commit d522aa9 (more details on the Dr. CI page):


  • 4/4 failures possibly* introduced in this PR
    • 1/4 non-CircleCI failure(s)

🕵️ 2 new failures recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

See CircleCI build pytorch_linux_xenial_py3_6_gcc5_4_test (1/2)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun)

Aug 03 07:21:07 test_udf_remote_message_delay_timeout_to_self (__main__.FaultyAgentRpcTestWithSpawn) ... [E request_callback_no_python.cpp:618] Received error while processing request type 5: false INTERNAL ASSERT FAILED at "/var/lib/jenkins/workspace/torch/csrc/distributed/rpc/rref_context.cpp":379, please report a bug to PyTorch. Expected OwnerRRef with id GloballyUniqueId(created_on=0, local_id=0) to be created.
Aug 03 07:20:30 frame #8: <unknown function> + 0xa45a02 (0x7ff8bad5aa02 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so) 
Aug 03 07:20:30 frame #9: c10::ThreadPool::main_loop(unsigned long) + 0x2fb (0x7ff8b9cdd73b in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so) 
Aug 03 07:20:30 frame #10: <unknown function> + 0xc8421 (0x7ff8b0665421 in /opt/conda/lib/libstdc++.so.6) 
Aug 03 07:20:30 frame #11: <unknown function> + 0x76ba (0x7ff8c2ea76ba in /lib/x86_64-linux-gnu/libpthread.so.0) 
Aug 03 07:20:30 frame #12: clone + 0x6d (0x7ff8c2bdd4dd in /lib/x86_64-linux-gnu/libc.so.6) 
Aug 03 07:20:30  
Aug 03 07:20:30 ok (3.222s) 
Aug 03 07:20:54   test_rpc_script_timeout (__main__.FaultyAgentRpcTestWithSpawn) ... ok (8.739s) 
Aug 03 07:20:57   test_rref_to_here_timeout (__main__.FaultyAgentRpcTestWithSpawn) ... ok (3.223s) 
Aug 03 07:21:04   test_udf_remote_message_delay_timeout (__main__.FaultyAgentRpcTestWithSpawn) ... ok (7.233s) 
Aug 03 07:21:07   test_udf_remote_message_delay_timeout_to_self (__main__.FaultyAgentRpcTestWithSpawn) ... [E request_callback_no_python.cpp:618] Received error while processing request type 5: false INTERNAL ASSERT FAILED at "/var/lib/jenkins/workspace/torch/csrc/distributed/rpc/rref_context.cpp":379, please report a bug to PyTorch. Expected OwnerRRef with id GloballyUniqueId(created_on=0, local_id=0) to be created. 
Aug 03 07:21:07 Exception raised from getOwnerRRef at /var/lib/jenkins/workspace/torch/csrc/distributed/rpc/rref_context.cpp:379 (most recent call first): 
Aug 03 07:21:07 frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x69 (0x7fdb33104989 in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so) 
Aug 03 07:21:07 frame #1: torch::distributed::rpc::RRefContext::getOwnerRRef(torch::distributed::rpc::GloballyUniqueId const&, bool) + 0x4a4 (0x7fdb2d0fa084 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so) 
Aug 03 07:21:07 frame #2: torch::distributed::rpc::RequestCallbackImpl::processPythonRemoteCall(torch::distributed::rpc::RpcCommandBase&, std::function<void (torch::distributed::rpc::Message)> const&, long, std::shared_ptr<torch::utils::Future<torch::distributed::rpc::Message> > const&) const + 0x83 (0x7fdb341946f3 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so) 
Aug 03 07:21:07 frame #3: torch::distributed::rpc::RequestCallbackNoPython::processRpc(torch::distributed::rpc::RpcCommandBase&, torch::distributed::rpc::MessageType const&, long, std::shared_ptr<torch::utils::Future<torch::distributed::rpc::Message> > const&) const + 0x7b0 (0x7fdb2d0e7610 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so) 
Aug 03 07:21:07 frame #4: torch::distributed::rpc::RequestCallbackImpl::processRpcWithErrors(torch::distributed::rpc::RpcCommandBase&, torch::distributed::rpc::MessageType const&, long, std::shared_ptr<torch::utils::Future<torch::distributed::rpc::Message> > const&) const + 0x2f (0x7fdb34192b2f in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so) 
Aug 03 07:21:07 frame #5: <unknown function> + 0x35bcb8e (0x7fdb2d0e3b8e in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so) 
Aug 03 07:21:07 frame #6: torch::distributed::rpc::RequestCallbackNoPython::processMessage(torch::distributed::rpc::Message&) const + 0x3b6 (0x7fdb2d0e5806 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so) 
Aug 03 07:21:07 frame #7: torch::distributed::rpc::RequestCallback::operator()(torch::distributed::rpc::Message&) const + 0x1e (0x7fdb2d0e2bbe in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so) 
Aug 03 07:21:07 frame #8: torch::distributed::rpc::ProcessGroupAgent::handleRecv(torch::distributed::rpc::RecvWork&) + 0xc4 (0x7fdb3416f7d4 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so) 

See CircleCI build pytorch_linux_bionic_py3_6_clang9_test (2/2)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun)

Aug 03 07:24:12 test_udf_remote_message_delay_timeout_to_self (__main__.FaultyAgentRpcTestWithSpawn) ... [E request_callback_no_python.cpp:618] Received error while processing request type 5: false INTERNAL ASSERT FAILED at "/var/lib/jenkins/workspace/torch/csrc/distributed/rpc/rref_context.cpp":379, please report a bug to PyTorch. Expected OwnerRRef with id GloballyUniqueId(created_on=0, local_id=0) to be created.
Aug 03 07:23:35 frame #9: c10::ThreadPool::main_loop(unsigned long) + 0x15a (0x7f1ae12deaca in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so) 
Aug 03 07:23:35 frame #10: <unknown function> + 0xc819d (0x7f1ae11f419d in /opt/conda/lib/libstdc++.so.6) 
Aug 03 07:23:35 frame #11: <unknown function> + 0x76db (0x7f1af54ad6db in /lib/x86_64-linux-gnu/libpthread.so.0) 
Aug 03 07:23:35 frame #12: clone + 0x3f (0x7f1af51d6a3f in /lib/x86_64-linux-gnu/libc.so.6) 
Aug 03 07:23:35  
Aug 03 07:23:36 ok (3.224s) 
Aug 03 07:23:50   test_rpc_builtin_timeout (__main__.FaultyAgentRpcTestWithSpawn) ... ok (14.755s) 
Aug 03 07:23:59   test_rpc_script_timeout (__main__.FaultyAgentRpcTestWithSpawn) ... ok (8.739s) 
Aug 03 07:24:02   test_rref_to_here_timeout (__main__.FaultyAgentRpcTestWithSpawn) ... ok (3.224s) 
Aug 03 07:24:10   test_udf_remote_message_delay_timeout (__main__.FaultyAgentRpcTestWithSpawn) ... ok (7.233s) 
Aug 03 07:24:12   test_udf_remote_message_delay_timeout_to_self (__main__.FaultyAgentRpcTestWithSpawn) ... [E request_callback_no_python.cpp:618] Received error while processing request type 5: false INTERNAL ASSERT FAILED at "/var/lib/jenkins/workspace/torch/csrc/distributed/rpc/rref_context.cpp":379, please report a bug to PyTorch. Expected OwnerRRef with id GloballyUniqueId(created_on=0, local_id=0) to be created. 
Aug 03 07:24:12 Exception raised from getOwnerRRef at /var/lib/jenkins/workspace/torch/csrc/distributed/rpc/rref_context.cpp:379 (most recent call first): 
Aug 03 07:24:12 frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x7d (0x7fe4730e85dd in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so) 
Aug 03 07:24:12 frame #1: torch::distributed::rpc::RRefContext::getOwnerRRef(torch::distributed::rpc::GloballyUniqueId const&, bool) + 0x4c3 (0x7fe476b2dcf3 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so) 
Aug 03 07:24:12 frame #2: torch::distributed::rpc::RequestCallbackImpl::processPythonRemoteCall(torch::distributed::rpc::RpcCommandBase&, std::function<void (torch::distributed::rpc::Message)> const&, long, std::shared_ptr<torch::utils::Future<torch::distributed::rpc::Message> > const&) const + 0x71 (0x7fe47e0a8151 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so) 
Aug 03 07:24:12 frame #3: torch::distributed::rpc::RequestCallbackNoPython::processRpc(torch::distributed::rpc::RpcCommandBase&, torch::distributed::rpc::MessageType const&, long, std::shared_ptr<torch::utils::Future<torch::distributed::rpc::Message> > const&) const + 0xcce (0x7fe476b1d09e in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so) 
Aug 03 07:24:12 frame #4: torch::distributed::rpc::RequestCallbackImpl::processRpcWithErrors(torch::distributed::rpc::RpcCommandBase&, torch::distributed::rpc::MessageType const&, long, std::shared_ptr<torch::utils::Future<torch::distributed::rpc::Message> > const&) const + 0x21 (0x7fe47e0aa8a1 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so) 
Aug 03 07:24:12 frame #5: <unknown function> + 0x381e5d2 (0x7fe476b225d2 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so) 
Aug 03 07:24:12 frame #6: torch::distributed::rpc::RequestCallbackNoPython::processMessage(torch::distributed::rpc::Message&) const + 0x211 (0x7fe476b1baf1 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so) 
Aug 03 07:24:12 frame #7: torch::distributed::rpc::RequestCallback::operator()(torch::distributed::rpc::Message&) const + 0xa (0x7fe476b1b67a in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so) 
Aug 03 07:24:12 frame #8: torch::distributed::rpc::ProcessGroupAgent::handleRecv(torch::distributed::rpc::RecvWork&) + 0x9b (0x7fe47e08249b in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so) 

1 failure not recognized by patterns:

Job Step Action
CircleCI pytorch_linux_xenial_py3_6_gcc5_4_ge_config_simple_test Report results 🔁 rerun

ci.pytorch.org: 1 failed


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 3 times.

@bhosmer
Copy link
Author

bhosmer commented Nov 21, 2020

superceded by #42438

@bhosmer bhosmer closed this Nov 21, 2020
@facebook-github-bot facebook-github-bot deleted the gh/bhosmer/34/head branch December 21, 2020 15:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla signed oncall: jit Add this issue/PR to JIT oncall triage queue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants