Skip to content

feat: move async rollout worker to separate process#5749

Open
AmineDiro wants to merge 14 commits into
mainfrom
feat/async-rollout-mp-worker
Open

feat: move async rollout worker to separate process#5749
AmineDiro wants to merge 14 commits into
mainfrom
feat/async-rollout-mp-worker

Conversation

@AmineDiro
Copy link
Copy Markdown
Member

@AmineDiro AmineDiro commented May 11, 2026

What does this PR do?

Rollout generate and score loops now run in a spawned child process instead of a thread inside the trainer. The trainer's autograd engine no longer competes with recursive_parse / accuracy_reward for the GIL.

  • AsyncRolloutWorker (parent): spawns the child rollout process. owns the shared mp.Queue / mp.Value / mp.Events.
  • _AsyncRolloutLoop handles the rollout logic (private, child-only): tokenizer, dataset iter, reward funcs, asyncio loops.
  • WeightTransferClient (new, weight_transfer.py): NCCL group with vLLM + /pause, /resume, /init_weight_transfer_engine, /update_weights. The rollout worker only talks to /v1/completions.

Two callbacks register in on_train_begin (post-accelerator.prepare), guarded with _fired flags:

  • _InitialWeightSyncCallback: init_weight_transfer() + cold _sync_weight()
  • _StartRolloutWorkerCallback: rollout_worker.start() (registration order enforces "sync before start")

remove_callback(type(self)) (PR #5319's pattern) drops the next callback in line when used with two callbacks, because CallbackHandler.call_event iterates the list directly. _fired is safe.

Two correctness fixes folded in because they touch the same _AsyncRolloutLoop methods this PR moves (would conflict otherwise):

  • widen aiohttp retry: the narrow (ServerDisconnectedError, ClientConnectionError, ClientResponseError) catch in _generate_one_turn and _post missed ClientPayloadError (which vLLM's async engine fires when asyncio.CancelledError interrupts a long request mid-stream), killing the trainer on a single transient network blip. Broadened to (aiohttp.ClientError, asyncio.TimeoutError, TimeoutError, ConnectionResetError) with bounded exponential backoff. Shared _retry_on_http_error helper used by both call sites.
  • preserve NaN in _score_group: np.nansum on an all-NaN column silently returns 0, so a completion where every reward func returned None (gold unparseable by math_verify ~30% of DeepMath / OpenR1-Math rows) ended up with reward 0 and a real advantage signal that pushed the policy away from actually-correct text answers. Mark all-NaN columns NaN, compute advantage on the scorable subset only, unscorable advantages = 0.

Why

Training a Qwen3-30B-A3B @ 16k completion length, rollout's per-completion Python work holds the GIL in 1-5s bursts. Rank 0's autograd engine starves especially with deepspeed backend so ranks 1-7 timeout on the next NCCL collective.

I used bothpy-spy and NCCL flight recorder info to confirm MainThread idle on _engine_run_backward while recursive_parse ran active and holds the GIL.

Also moved the weight init to a callback structure to also fixes the DS-Z2 multi-node crash where the NCCL StatelessProcessGroup's IPC pages were bound before Stage1And2ZeroOptimizer.__init__ → empty_cache(), as I was gettingcudaErrorIllegalAddress.

Validation

Config Steps train_runtime
Qwen3-30B-A3B + DS-Z2 + EP=8 + 16k DAPO, 2 trainer nodes + 1 vLLM 5 924.3s
Qwen3-4B + OpenR1-Math (examples/scripts/async_grpo.py), 1 node 2 GPUs 10 229.8s

No cudaErrorIllegalAddress, no SIGBUS, no NCCL watchdog timeouts.

Test plan

  • Multi-node DS-Z2 + EP=8 + 30B end-to-end
  • Single-node examples/scripts/async_grpo.py end-to-end
  • tests/experimental/test_async_grpo_trainer.py against stub worker (stub trimmed to new protocol)

Note

Medium Risk
Moderate risk due to a large refactor of the async rollout/weight-sync lifecycle (thread→process, new callbacks, new NCCL weight-transfer client) that can affect training stability and distributed synchronization.

Overview
Moves async GRPO rollouts off the trainer thread and into a spawned child process. AsyncRolloutWorker becomes a parent-side controller that manages an mp.Process running _AsyncRolloutLoop, communicating via shared mp.Queue (samples) and mp.Value (model version) to reduce GIL contention.

Splits vLLM weight syncing into a new WeightTransferClient and changes when syncing happens. The trainer now initializes NCCL + performs an initial cold sync via on_train_begin callbacks (then starts the rollout worker), and periodic syncs call pause/send_weights/resume on the new client; cleanup also destroys the NCCL group.

Improves rollout robustness/correctness. HTTP calls to vLLM use a shared retry helper with bounded backoff, and reward aggregation preserves all-None reward rows as NaN (advantages set to 0 and normalization computed only on scorable samples). Tests update the stub worker/protocol to match the new interface (no pause/resume/send_weights).

Reviewed by Cursor Bugbot for commit 21415fd. Bugbot is set up for automated code reviews on this repo. Configure here.

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Comment thread trl/experimental/async_grpo/async_rollout_worker.py
@AmineDiro AmineDiro force-pushed the feat/async-rollout-mp-worker branch from cf45c51 to f344029 Compare May 12, 2026 08:02
Copy link
Copy Markdown

@wadeKeith wadeKeith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice refactor! Moving the async rollout worker to a separate process with dedicated weight transfer improves isolation and avoids GIL contention. The weight_transfer module is cleanly separated. LGTM! Reviewed by Hermes Agent.

@AmineDiro
Copy link
Copy Markdown
Member Author

347b9b2

this commit fixes :

  • widen aiohttp retry: the narrow (ServerDisconnectedError, ClientConnectionError, ClientResponseError) catch in _generate_one_turn and _post missed ClientPayloadError (which vLLM's async engine fires when asyncio.CancelledError interrupts a long request mid-stream), killing the trainer on a single transient network blip. Broadened to (aiohttp.ClientError, asyncio.TimeoutError, TimeoutError, ConnectionResetError) with bounded exponential backoff. Shared _retry_on_http_error helper used by both call sites.
  • preserve NaN in _score_group: np.nansum on an all-NaN column silently returns 0, so a completion where every reward func returned None (gold unparseable by math_verify ~30% of DeepMath / OpenR1-Math rows) ended up with reward 0 and a real advantage signal that pushed the policy away from actually-correct text answers. Mark all-NaN columns NaN, compute advantage on the scorable subset only, unscorable advantages = 0.

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 347b9b2. Configure here.

lambda: self._post("/v1/completions", payload, self.request_timeout),
max_attempts=30,
label="vllm /v1/completions",
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nested retry creates up to 90 redundant requests

Medium Severity

_generate_one_turn wraps self._post(...) in _retry_on_http_error with max_attempts=30, but _post itself already calls _retry_on_http_error internally with max_attempts=3. This creates nested retries: each of the 30 outer attempts exhausts all 3 inner attempts (with exponential backoff) before the outer backoff kicks in, leading to up to 90 total HTTP requests per completion. The inner retries add ~3–7s of wasted delay per outer iteration and produce duplicate warning logs that obscure the actual failure. The old code had clear separation — _post only retried timeouts, while the outer loop retried connection drops.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 347b9b2. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants