feat: move async rollout worker to separate process by AmineDiro · Pull Request #5749 · huggingface/trl

AmineDiro · 2026-05-11T21:03:49Z

What does this PR do?

Rollout generate and score loops now run in a spawned child process instead of a thread inside the trainer. The trainer's autograd engine no longer competes with recursive_parse / accuracy_reward for the GIL.

AsyncRolloutWorker (parent): spawns the child rollout process. owns the shared mp.Queue / mp.Value / mp.Events.
_AsyncRolloutLoop handles the rollout logic (private, child-only): tokenizer, dataset iter, reward funcs, asyncio loops.
WeightTransferClient (new, weight_transfer.py): NCCL group with vLLM + /pause, /resume, /init_weight_transfer_engine, /update_weights. The rollout worker only talks to /v1/completions.

Two callbacks register in on_train_begin (post-accelerator.prepare), guarded with _fired flags:

_InitialWeightSyncCallback: init_weight_transfer() + cold _sync_weight()
_StartRolloutWorkerCallback: rollout_worker.start() (registration order enforces "sync before start")

remove_callback(type(self)) (PR #5319's pattern) drops the next callback in line when used with two callbacks, because CallbackHandler.call_event iterates the list directly. _fired is safe.

Two correctness fixes folded in because they touch the same _AsyncRolloutLoop methods this PR moves (would conflict otherwise):

widen aiohttp retry: the narrow (ServerDisconnectedError, ClientConnectionError, ClientResponseError) catch in _generate_one_turn and _post missed ClientPayloadError (which vLLM's async engine fires when asyncio.CancelledError interrupts a long request mid-stream), killing the trainer on a single transient network blip. Broadened to (aiohttp.ClientError, asyncio.TimeoutError, TimeoutError, ConnectionResetError) with bounded exponential backoff. Shared _retry_on_http_error helper used by both call sites.
preserve NaN in _score_group: np.nansum on an all-NaN column silently returns 0, so a completion where every reward func returned None (gold unparseable by math_verify ~30% of DeepMath / OpenR1-Math rows) ended up with reward 0 and a real advantage signal that pushed the policy away from actually-correct text answers. Mark all-NaN columns NaN, compute advantage on the scorable subset only, unscorable advantages = 0.

Why

Training a Qwen3-30B-A3B @ 16k completion length, rollout's per-completion Python work holds the GIL in 1-5s bursts. Rank 0's autograd engine starves especially with deepspeed backend so ranks 1-7 timeout on the next NCCL collective.

I used bothpy-spy and NCCL flight recorder info to confirm MainThread idle on _engine_run_backward while recursive_parse ran active and holds the GIL.

Also moved the weight init to a callback structure to also fixes the DS-Z2 multi-node crash where the NCCL StatelessProcessGroup's IPC pages were bound before Stage1And2ZeroOptimizer.__init__ → empty_cache(), as I was gettingcudaErrorIllegalAddress.

Validation

Config	Steps	`train_runtime`
Qwen3-30B-A3B + DS-Z2 + EP=8 + 16k DAPO, 2 trainer nodes + 1 vLLM	5	924.3s
Qwen3-4B + OpenR1-Math (`examples/scripts/async_grpo.py`), 1 node 2 GPUs	10	229.8s

No cudaErrorIllegalAddress, no SIGBUS, no NCCL watchdog timeouts.

Test plan

Multi-node DS-Z2 + EP=8 + 30B end-to-end
Single-node examples/scripts/async_grpo.py end-to-end
tests/experimental/test_async_grpo_trainer.py against stub worker (stub trimmed to new protocol)

Note

Medium Risk
Moderate risk due to a large refactor of the async rollout/weight-sync lifecycle (thread→process, new callbacks, new NCCL weight-transfer client) that can affect training stability and distributed synchronization.

Overview
Moves async GRPO rollouts off the trainer thread and into a spawned child process. AsyncRolloutWorker becomes a parent-side controller that manages an mp.Process running _AsyncRolloutLoop, communicating via shared mp.Queue (samples) and mp.Value (model version) to reduce GIL contention.

Splits vLLM weight syncing into a new WeightTransferClient and changes when syncing happens. The trainer now initializes NCCL + performs an initial cold sync via on_train_begin callbacks (then starts the rollout worker), and periodic syncs call pause/send_weights/resume on the new client; cleanup also destroys the NCCL group.

Improves rollout robustness/correctness. HTTP calls to vLLM use a shared retry helper with bounded backoff, and reward aggregation preserves all-None reward rows as NaN (advantages set to 0 and normalization computed only on scorable samples). Tests update the stub worker/protocol to match the new interface (no pause/resume/send_weights).

^{Reviewed by Cursor Bugbot for commit 21415fd. Bugbot is set up for automated code reviews on this repo. Configure here.}

…tion

…tter

…TransferClient)

HuggingFaceDocBuilderDev · 2026-05-11T21:06:40Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

…cher race

…g quoting)

wadeKeith

Nice refactor! Moving the async rollout worker to a separate process with dedicated weight transfer improves isolation and avoids GIL contention. The weight_transfer module is cleanly separated. LGTM! Reviewed by Hermes Agent.

AmineDiro · 2026-05-12T09:40:59Z

347b9b2

this commit fixes :

widen aiohttp retry: the narrow (ServerDisconnectedError, ClientConnectionError, ClientResponseError) catch in _generate_one_turn and _post missed ClientPayloadError (which vLLM's async engine fires when asyncio.CancelledError interrupts a long request mid-stream), killing the trainer on a single transient network blip. Broadened to (aiohttp.ClientError, asyncio.TimeoutError, TimeoutError, ConnectionResetError) with bounded exponential backoff. Shared _retry_on_http_error helper used by both call sites.
preserve NaN in _score_group: np.nansum on an all-NaN column silently returns 0, so a completion where every reward func returned None (gold unparseable by math_verify ~30% of DeepMath / OpenR1-Math rows) ended up with reward 0 and a real advantage signal that pushed the policy away from actually-correct text answers. Mark all-NaN columns NaN, compute advantage on the scorable subset only, unscorable advantages = 0.

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 347b9b2. Configure here.}

cursor · 2026-05-12T09:44:29Z

+            lambda: self._post("/v1/completions", payload, self.request_timeout),
+            max_attempts=30,
+            label="vllm /v1/completions",
+        )


Nested retry creates up to 90 redundant requests

Medium Severity

_generate_one_turn wraps self._post(...) in _retry_on_http_error with max_attempts=30, but _post itself already calls _retry_on_http_error internally with max_attempts=3. This creates nested retries: each of the 30 outer attempts exhausts all 3 inner attempts (with exponential backoff) before the outer backoff kicks in, leading to up to 90 total HTTP requests per completion. The inner retries add ~3–7s of wasted delay per outer iteration and produce duplicate warning logs that obscure the actual failure. The old code had clear separation — _post only retried timeouts, while the outer loop retried connection drops.

Additional Locations (1)

trl/experimental/async_grpo/async_rollout_worker.py#L656-L668

^{Reviewed by Cursor Bugbot for commit 347b9b2. Configure here.}

AmineDiro added 9 commits May 11, 2026 20:25

Move AsyncGRPO rollout worker to a spawned child process

56eb1e9

cleanup

c0cbeec

Split _child_main into named helpers

596f581

Drop _MPQueue*Shim — use mp.Queue directly

b221018

Split weight transfer into WeightTransferClient

3ac0680

Split AsyncRolloutWorker into parent controller + _AsyncRolloutLoop

f2dc49e

Fix: don't self-remove from callback list during on_train_begin itera…

843bb2a

…tion

tweak: trim _child_main docstring, atomicity note on model_version se…

499b5ac

…tter

test: drop dead stub methods (pause/resume/send_weights now on Weight…

50391ad

…TransferClient)

AmineDiro requested review from albertvillanova and qgallouedec May 11, 2026 21:03

drop string-quoted mp.* and forward-ref annotations on private helpers

542843d

cursor Bot reviewed May 11, 2026

View reviewed changes

Comment thread trl/experimental/async_grpo/async_rollout_worker.py

AmineDiro added 2 commits May 12, 2026 07:53

fix: create asyncio loop and stop event in __init__ to close stop-wat…

25f4d14

…cher race

type-hint mp.Queue/Event/Value params with the real classes (no strin…

f344029

…g quoting)

AmineDiro force-pushed the feat/async-rollout-mp-worker branch from cf45c51 to f344029 Compare May 12, 2026 08:02

wadeKeith reviewed May 12, 2026

View reviewed changes

async_grpo: widen aiohttp retry + preserve NaN in _score_group

347b9b2

cursor Bot reviewed May 12, 2026

View reviewed changes

Merge branch 'main' into feat/async-rollout-mp-worker

21415fd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: move async rollout worker to separate process#5749

feat: move async rollout worker to separate process#5749
AmineDiro wants to merge 14 commits into
mainfrom
feat/async-rollout-mp-worker

AmineDiro commented May 11, 2026 •

edited by cursor Bot

Loading

Uh oh!

HuggingFaceDocBuilderDev commented May 11, 2026

Uh oh!

Uh oh!

wadeKeith left a comment

Uh oh!

AmineDiro commented May 12, 2026

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

AmineDiro commented May 11, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Why

Validation

Test plan

Uh oh!

HuggingFaceDocBuilderDev commented May 11, 2026

Uh oh!

Uh oh!

wadeKeith left a comment

Choose a reason for hiding this comment

Uh oh!

AmineDiro commented May 12, 2026

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot May 12, 2026

Choose a reason for hiding this comment

Nested retry creates up to 90 redundant requests

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

AmineDiro commented May 11, 2026 •

edited by cursor Bot

Loading