Skip to content

DISABLED test_backward_ddp_outside (__main__.TensorPipeDdpUnderDistAutogradTestWithSpawn) #45117

@lw

Description

@lw

https://app.circleci.com/pipelines/github/pytorch/pytorch/217196/workflows/15141629-1343-446c-a9cf-230f9c6b4527/jobs/7655526/steps

Probably same issue as #40378, which affects ProcessGroup.

Sep 22 01:48:31   test_backward_ddp_outside (__main__.TensorPipeDdpUnderDistAutogradTestWithSpawn) ... 2020-09-22 01:48:31,377 ddp_under_dist_autograd_test.py:342 INFO p:process 0 t:MainThread: Running the trainer #0...
Sep 22 01:48:31 2020-09-22 01:48:31,377 ddp_under_dist_autograd_test.py:344 INFO p:process 0 t:MainThread: Initing trainer process group by trainer #0 with ranks [0, 1, 2, 3]
Sep 22 01:48:31 2020-09-22 01:48:31,378 ddp_under_dist_autograd_test.py:342 INFO p:process 1 t:MainThread: Running the trainer #1...
Sep 22 01:48:31 2020-09-22 01:48:31,378 ddp_under_dist_autograd_test.py:344 INFO p:process 1 t:MainThread: Initing trainer process group by trainer #1 with ranks [0, 1, 2, 3]
Sep 22 01:48:31 2020-09-22 01:48:31,378 ddp_under_dist_autograd_test.py:328 INFO p:process 4 t:MainThread: The remote worker is running.
Sep 22 01:48:31 2020-09-22 01:48:31,378 ddp_under_dist_autograd_test.py:342 INFO p:process 2 t:MainThread: Running the trainer #2...
Sep 22 01:48:31 2020-09-22 01:48:31,378 ddp_under_dist_autograd_test.py:344 INFO p:process 2 t:MainThread: Initing trainer process group by trainer #2 with ranks [0, 1, 2, 3]
Sep 22 01:48:31 2020-09-22 01:48:31,379 ddp_under_dist_autograd_test.py:362 INFO p:process 5 t:MainThread: Running the master process...
Sep 22 01:48:31 2020-09-22 01:48:31,379 ddp_under_dist_autograd_test.py:342 INFO p:process 3 t:MainThread: Running the trainer #3...
Sep 22 01:48:31 2020-09-22 01:48:31,380 ddp_under_dist_autograd_test.py:344 INFO p:process 3 t:MainThread: Initing trainer process group by trainer #3 with ranks [0, 1, 2, 3]
Sep 22 01:48:31 2020-09-22 01:48:31,383 ddp_under_dist_autograd_test.py:353 INFO p:process 2 t:MainThread: Waiting for shutdown signal on trainer #2...
Sep 22 01:48:31 2020-09-22 01:48:31,383 ddp_under_dist_autograd_test.py:353 INFO p:process 3 t:MainThread: Waiting for shutdown signal on trainer #3...
Sep 22 01:48:31 2020-09-22 01:48:31,384 ddp_under_dist_autograd_test.py:375 INFO p:process 5 t:MainThread: Created remote rrefs on master
Sep 22 01:48:31 2020-09-22 01:48:31,385 ddp_under_dist_autograd_test.py:119 INFO p:process 4 t:Dummy-1: Initing RemoteNet with 5 3
Sep 22 01:48:31 2020-09-22 01:48:31,387 ddp_under_dist_autograd_test.py:93 INFO p:process 4 t:Dummy-2: Initing RemoteEM with 2 3
Sep 22 01:48:31 2020-09-22 01:48:31,409 ddp_under_dist_autograd_test.py:353 INFO p:process 1 t:MainThread: Waiting for shutdown signal on trainer #1...
Sep 22 01:48:31 2020-09-22 01:48:31,409 ddp_under_dist_autograd_test.py:353 INFO p:process 0 t:MainThread: Waiting for shutdown signal on trainer #0...
Sep 22 01:48:31 ERROR:root:Caught exception: 
Sep 22 01:48:31 Traceback (most recent call last):
Sep 22 01:48:31   File "/Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/torch/testing/_internal/common_distributed.py", line 246, in wrapper
Sep 22 01:48:31     fn()
Sep 22 01:48:31   File "/Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/torch/testing/_internal/dist_utils.py", line 70, in new_test_method
Sep 22 01:48:31     return_value = old_test_method(self, *arg, **kwargs)
Sep 22 01:48:31   File "/Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/torch/testing/_internal/distributed/ddp_under_dist_autograd_test.py", line 471, in test_backward_ddp_outside
Sep 22 01:48:31     self._do_test(DdpMode.OUTSIDE)
Sep 22 01:48:31   File "/Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/torch/testing/_internal/distributed/ddp_under_dist_autograd_test.py", line 455, in _do_test
Sep 22 01:48:31     self._master_process(ddp_mode, simulate_uneven_inputs)
Sep 22 01:48:31   File "/Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/torch/testing/_internal/distributed/ddp_under_dist_autograd_test.py", line 377, in _master_process
Sep 22 01:48:31     ddp_mode, simulate_uneven_inputs, remote_em_rref, remote_net_rref
Sep 22 01:48:31   File "/Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/torch/testing/_internal/distributed/ddp_under_dist_autograd_test.py", line 423, in do_test_on_master
Sep 22 01:48:31     ddp_grads, non_ddp_grads = future.wait()
Sep 22 01:48:31   File "/Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/torch/distributed/rpc/internal.py", line 177, in _handle_exception
Sep 22 01:48:31     raise result.exception_type(result.msg)
Sep 22 01:48:31 AssertionError: On WorkerInfo(id=0, name=worker0):
Sep 22 01:48:31 AssertionError('On WorkerInfo(id=0, name=worker0):\nAssertionError(\'Default process group is not initialized\')\nTraceback (most recent call last):\n  File "/Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/torch/distributed/rpc/internal.py", line 164, in _run_function\n    result = python_udf.func(*python_udf.args, **python_udf.kwargs)\n  File "/Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/torch/testing/_internal/distributed/ddp_under_dist_autograd_test.py", line 186, in __init__\n    if ddp_mode in (DdpMode.INSIDE, DdpMode.OUTSIDE)\n  File "/Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1979, in new_group\n    _check_default_pg()\n  File "/Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 211, in _check_default_pg\n    "Default process group is not initialized"\nAssertionError: Default process group is not initialized\n')
Sep 22 01:48:31 Traceback (most recent call last):
Sep 22 01:48:31   File "/Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/torch/distributed/rpc/internal.py", line 164, in _run_function
Sep 22 01:48:31     result = python_udf.func(*python_udf.args, **python_udf.kwargs)
Sep 22 01:48:31   File "/Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/torch/testing/_internal/distributed/ddp_under_dist_autograd_test.py", line 78, in _call_method
Sep 22 01:48:31     return method(rref.local_value(), *args, **kwargs)
Sep 22 01:48:31   File "/Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/torch/distributed/rpc/internal.py", line 177, in _handle_exception
Sep 22 01:48:31     raise result.exception_type(result.msg)
Sep 22 01:48:31 AssertionError: On WorkerInfo(id=0, name=worker0):
Sep 22 01:48:31 AssertionError('Default process group is not initialized')
Sep 22 01:48:31 Traceback (most recent call last):
Sep 22 01:48:31   File "/Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/torch/distributed/rpc/internal.py", line 164, in _run_function
Sep 22 01:48:31     result = python_udf.func(*python_udf.args, **python_udf.kwargs)
Sep 22 01:48:31   File "/Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/torch/testing/_internal/distributed/ddp_under_dist_autograd_test.py", line 186, in __init__
Sep 22 01:48:31     if ddp_mode in (DdpMode.INSIDE, DdpMode.OUTSIDE)
Sep 22 01:48:31   File "/Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1979, in new_group
Sep 22 01:48:31     _check_default_pg()
Sep 22 01:48:31   File "/Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 211, in _check_default_pg
Sep 22 01:48:31     "Default process group is not initialized"
Sep 22 01:48:31 AssertionError: Default process group is not initialized
Sep 22 01:48:31 
Sep 22 01:48:31 
Sep 22 01:48:31 exiting process with exit code: 10
Sep 22 01:48:31 [W tensorpipe_agent.cpp:577] RPC agent for worker3 encountered error when reading incoming request from worker5: EOF: end of file (this is expected to happen during shutdown)
Sep 22 01:48:31 [W tensorpipe_agent.cpp:577] RPC agent for worker2 encountered error when reading incoming request from worker5: EOF: end of file (this is expected to happen during shutdown)
Sep 22 01:48:31 [W tensorpipe_agent.cpp:577] RPC agent for worker1 encountered error when reading incoming request from worker5: EOF: end of file (this is expected to happen during shutdown)
Sep 22 01:48:31 [W tensorpipe_agent.cpp:577] RPC agent for worker0 encountered error when reading incoming request from worker5: EOF: end of file (this is expected to happen during shutdown)
Sep 22 01:48:31 [W tensorpipe_agent.cpp:577] RPC agent for worker4 encountered error when reading incoming request from worker5: EOF: end of file (this is expected to happen during shutdown)
Sep 22 01:48:31 Process 5 terminated with exit code 10, terminating remaining processes.
Sep 22 01:48:31 ERROR (2.373s)

cc @ezyang @gchanan @zou3519 @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @xush6528 @osalpekar @jiayisuse @agolynski @jjlilley

Metadata

Metadata

Assignees

No one assigned

    Labels

    high prioritymodule: flaky-testsProblem is a flaky test in CImodule: rpcRelated to RPC, distributed autograd, RRef, and distributed optimizeroncall: distributedAdd this issue/PR to distributed oncall triage queuetriage reviewtriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions