feat: move async rollout worker to separate process#5749
Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
cf45c51 to
f344029
Compare
wadeKeith
left a comment
There was a problem hiding this comment.
Nice refactor! Moving the async rollout worker to a separate process with dedicated weight transfer improves isolation and avoids GIL contention. The weight_transfer module is cleanly separated. LGTM! Reviewed by Hermes Agent.
|
this commit fixes :
|
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 347b9b2. Configure here.
| lambda: self._post("/v1/completions", payload, self.request_timeout), | ||
| max_attempts=30, | ||
| label="vllm /v1/completions", | ||
| ) |
There was a problem hiding this comment.
Nested retry creates up to 90 redundant requests
Medium Severity
_generate_one_turn wraps self._post(...) in _retry_on_http_error with max_attempts=30, but _post itself already calls _retry_on_http_error internally with max_attempts=3. This creates nested retries: each of the 30 outer attempts exhausts all 3 inner attempts (with exponential backoff) before the outer backoff kicks in, leading to up to 90 total HTTP requests per completion. The inner retries add ~3–7s of wasted delay per outer iteration and produce duplicate warning logs that obscure the actual failure. The old code had clear separation — _post only retried timeouts, while the outer loop retried connection drops.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 347b9b2. Configure here.


What does this PR do?
Rollout generate and score loops now run in a spawned child process instead of a thread inside the trainer. The trainer's autograd engine no longer competes with
recursive_parse/accuracy_rewardfor the GIL.AsyncRolloutWorker(parent): spawns the child rollout process. owns the sharedmp.Queue/mp.Value/mp.Events._AsyncRolloutLoophandles the rollout logic (private, child-only): tokenizer, dataset iter, reward funcs, asyncio loops.WeightTransferClient(new,weight_transfer.py): NCCL group with vLLM +/pause,/resume,/init_weight_transfer_engine,/update_weights. The rollout worker only talks to/v1/completions.Two callbacks register in
on_train_begin(post-accelerator.prepare), guarded with_firedflags:_InitialWeightSyncCallback:init_weight_transfer()+ cold_sync_weight()_StartRolloutWorkerCallback:rollout_worker.start()(registration order enforces "sync before start")remove_callback(type(self))(PR #5319's pattern) drops the next callback in line when used with two callbacks, becauseCallbackHandler.call_eventiterates the list directly._firedis safe.Two correctness fixes folded in because they touch the same
_AsyncRolloutLoopmethods this PR moves (would conflict otherwise):(ServerDisconnectedError, ClientConnectionError, ClientResponseError)catch in_generate_one_turnand_postmissedClientPayloadError(which vLLM's async engine fires whenasyncio.CancelledErrorinterrupts a long request mid-stream), killing the trainer on a single transient network blip. Broadened to(aiohttp.ClientError, asyncio.TimeoutError, TimeoutError, ConnectionResetError)with bounded exponential backoff. Shared_retry_on_http_errorhelper used by both call sites._score_group:np.nansumon an all-NaN column silently returns 0, so a completion where every reward func returnedNone(gold unparseable bymath_verify~30% of DeepMath / OpenR1-Math rows) ended up with reward 0 and a real advantage signal that pushed the policy away from actually-correct text answers. Mark all-NaN columns NaN, compute advantage on the scorable subset only, unscorable advantages = 0.Why
Training a Qwen3-30B-A3B @ 16k completion length, rollout's per-completion Python work holds the GIL in 1-5s bursts. Rank 0's autograd engine starves especially with deepspeed backend so ranks 1-7 timeout on the next NCCL collective.
I used both
py-spyand NCCL flight recorder info to confirmMainThreadidle on_engine_run_backwardwhilerecursive_parseran active and holds the GIL.Also moved the weight init to a callback structure to also fixes the DS-Z2 multi-node crash where the NCCL StatelessProcessGroup's IPC pages were bound before
Stage1And2ZeroOptimizer.__init__ → empty_cache(), as I was gettingcudaErrorIllegalAddress.Validation
train_runtimeexamples/scripts/async_grpo.py), 1 node 2 GPUsNo
cudaErrorIllegalAddress, no SIGBUS, no NCCL watchdog timeouts.Test plan
examples/scripts/async_grpo.pyend-to-endtests/experimental/test_async_grpo_trainer.pyagainst stub worker (stub trimmed to new protocol)Note
Medium Risk
Moderate risk due to a large refactor of the async rollout/weight-sync lifecycle (thread→process, new callbacks, new NCCL weight-transfer client) that can affect training stability and distributed synchronization.
Overview
Moves async GRPO rollouts off the trainer thread and into a spawned child process.
AsyncRolloutWorkerbecomes a parent-side controller that manages anmp.Processrunning_AsyncRolloutLoop, communicating via sharedmp.Queue(samples) andmp.Value(model version) to reduce GIL contention.Splits vLLM weight syncing into a new
WeightTransferClientand changes when syncing happens. The trainer now initializes NCCL + performs an initial cold sync viaon_train_begincallbacks (then starts the rollout worker), and periodic syncs callpause/send_weights/resumeon the new client; cleanup also destroys the NCCL group.Improves rollout robustness/correctness. HTTP calls to vLLM use a shared retry helper with bounded backoff, and reward aggregation preserves all-
Nonereward rows as NaN (advantages set to 0 and normalization computed only on scorable samples). Tests update the stub worker/protocol to match the new interface (no pause/resume/send_weights).Reviewed by Cursor Bugbot for commit 21415fd. Bugbot is set up for automated code reviews on this repo. Configure here.