Skip to content

RPC should destroy/deinitialize dist autograd container when rpc is shutdown #39340

@rohan-varma

Description

@rohan-varma

🚀 Feature/Enhnacement

During RPC initialization, each participating process initializes a singleton DistAutogradContainer. During shutdown, we cleanup items such as the RRef context and RPC agent, but do not reset this singleton dist autograd container. This prevents RPC from being reinitialized (i.e. a group of processes cannot call rpc.shutdown() ; dist.barrier() ; rpc.init_rpc(...). It will fail with the following error:

/test/distributed/rpc/faulty_agent/rpc_spawn_faulty#binary,link-tree
/torch/distributed/rpc/__init__.py", line 85, in init_rpc
>     dist_autograd._init(rank)
> RuntimeError: Container is already initialized! Cannot initialize it twice!

Motivation

I was looking into seeing if we could reuse spawned subprocesses across our RPC tests, which would decrease the amount of time it takes to run the RPC test suite; however, this would require processes to be able to re-init RPC, which is blocked by this error. Also, I was writing a set of similar tests in #38590, and wanted to use this pattern to reduce code duplication, but couldn't (the tests need to use different initializations of RPC since they test the RRef deletion behavior triggered by the shutdown sequence).

Pitch

I don't know if we necessarily need to destroy the container, maybe we could just de-initialize it, and then re-init if RPC is being started back up again.

cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @rohan-varma @xush6528 @jjlilley @osalpekar

Metadata

Metadata

Assignees

No one assigned

    Labels

    better-engineeringRelatively self-contained tasks for better engineering contributorsfeatureA request for a proper, new feature.module: rpcRelated to RPC, distributed autograd, RRef, and distributed optimizertriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions