-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Description
🚀 Feature/Enhnacement
During RPC initialization, each participating process initializes a singleton DistAutogradContainer. During shutdown, we cleanup items such as the RRef context and RPC agent, but do not reset this singleton dist autograd container. This prevents RPC from being reinitialized (i.e. a group of processes cannot call rpc.shutdown() ; dist.barrier() ; rpc.init_rpc(...). It will fail with the following error:
/test/distributed/rpc/faulty_agent/rpc_spawn_faulty#binary,link-tree
/torch/distributed/rpc/__init__.py", line 85, in init_rpc
> dist_autograd._init(rank)
> RuntimeError: Container is already initialized! Cannot initialize it twice!
Motivation
I was looking into seeing if we could reuse spawned subprocesses across our RPC tests, which would decrease the amount of time it takes to run the RPC test suite; however, this would require processes to be able to re-init RPC, which is blocked by this error. Also, I was writing a set of similar tests in #38590, and wanted to use this pattern to reduce code duplication, but couldn't (the tests need to use different initializations of RPC since they test the RRef deletion behavior triggered by the shutdown sequence).
Pitch
I don't know if we necessarily need to destroy the container, maybe we could just de-initialize it, and then re-init if RPC is being started back up again.
cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @rohan-varma @xush6528 @jjlilley @osalpekar