Skip to content

fix(cli): close aibridge daemon before WebSocket shutdown wait#25719

Merged
ethanndickson merged 1 commit into
mainfrom
ci-windows-2qg0
May 27, 2026
Merged

fix(cli): close aibridge daemon before WebSocket shutdown wait#25719
ethanndickson merged 1 commit into
mainfrom
ci-windows-2qg0

Conversation

@ethanndickson

@ethanndickson ethanndickson commented May 27, 2026

Copy link
Copy Markdown
Member

Warning

The investigation and solution in this PR were done with Mux. I've reviewed the investigation methodology, evidence and solution, and it all appears sound.

Summary

PR #25570 (refactor: move aibridged out of enterprise to AGPL, merged 2026-05-22) added an in-memory aibridge DRPC server in coderd/aibridged.go that does api.WebsocketWaitGroup.Add(1) and only releases Done() when its client session is closed. PR #25575 then flipped CODER_AI_GATEWAY_ENABLED to default to true, so every cli.Server() invocation now spins up that goroutine.

In cli/server.go, the only call to aibridgeDaemon.Close() was a defer scheduled at function return. During graceful shutdown the code first calls coderAPICloser.Close(), which waits on api.WebsocketWaitGroup. That wait sits for the full 10s timeout in coderd/coderd.go (websocket shutdown timed out after 10 seconds), then returns, then the function unwinds, and only then does the deferred aibridgeDaemon.Close() fire and let the goroutine call Done().

The 10s tax was previously latent (aibridged was enterprise-only and opt-in). After the two May 22 PRs it hit every cli.Server() test. On Linux/macOS CI it just makes the suite slower; on the Depot Windows runner, the ramdisk reservation leaves only ~17 GiB of headroom and the ~10s shutdown tails of multiple concurrent package binaries overlap into an OOM, presenting as test-go-pg (windows-2022) jobs that die silently at the ~600s watchdog with an empty steps array.

See Slack: https://codercom.slack.com/archives/C05AE94121Z/p1779807717764189

Fix

Close aibridgeDaemon explicitly during graceful shutdown, before coderAPICloser.Close() waits on the WebSocket wait group. This matches the existing ordered-shutdown pattern used for tunnel and notificationsManager. The deferred aibridgeDaemon.Close() is retained as a safety net for early-return paths, and is safe to double-call because aibridged.Server.Close() is already idempotent via shutdownOnce in coderd/aibridged/aibridged.go.

Regression test

TestServer_AIGatewayShutdownOrdering boots a real coder server with --ai-gateway-enabled=true, cancels its context, and asserts graceful shutdown finishes in under 8s. With the fix the test runs in ~0.1s; without the fix it fails deterministically at ~10.0s. The flag is passed explicitly so the test continues to guard the ordering even if the deployment default is ever flipped back.

Evidence this fixes the OOM

On Linux the patched cli test package drops from 114 s back to its pre-regression 30 s wall time at the same single-process peak RSS (~7.6 GiB), and the websocket shutdown timed out after 10 seconds log line disappears from every server-test run. Since the Windows OOM is the sum of multiple concurrent 10 s shutdown tails overlapping past the runner's ~17 GiB headroom, removing those tails returns the concurrent-RSS budget to its pre-regression level. The Windows OOM was intermittent (a handful of hits across many runs since May 22), so a single green test-go-pg (windows-2022) job on this PR is not by itself proof; confirmation will come from watching Windows runs on main over the next several days and seeing the ~600 s silent-kill fingerprint stop recurring.

Relates to ENG-2771

PR #25570 added an in-memory aibridge DRPC server in coderd/aibridged.go
that registers a goroutine in api.WebsocketWaitGroup and only releases
Done() when its client session is closed. PR #25575 then defaulted
CODER_AI_GATEWAY_ENABLED to true, so every cli.Server() invocation now
spins up that goroutine.

The previous shutdown sequence in cli/server.go closed aibridgeDaemon
via a deferred call scheduled at function return, but graceful shutdown
called coderAPICloser.Close() first, which waits on WebsocketWaitGroup.
That wait hit the full 10s timeout in coderd/coderd.go on every server
shutdown, logging "websocket shutdown timed out after 10 seconds".

On the Depot Windows runner the ramdisk reservation leaves ~17 GiB of
headroom, and the 10s shutdown tails of multiple concurrent package
binaries overlap into an OOM, presenting as test-go-pg (windows-2022)
jobs that die silently at the ~600s watchdog with empty steps arrays.

Close aibridgeDaemon explicitly during graceful shutdown, before
coderAPICloser.Close() waits on the WebSocket wait group. This matches
the existing ordered-shutdown pattern used for tunnel. The deferred
Close is retained as a safety net for early-return paths and is safe to
double-call because aibridged.Server.Close() is idempotent via
shutdownOnce.

Add TestServer_AIGatewayShutdownOrdering which fails deterministically
at ~10s without this fix and passes in ~0.1s with it.
@coder-tasks

coder-tasks Bot commented May 27, 2026

Copy link
Copy Markdown
Contributor

No documentation changes needed. This is an internal bug fix for shutdown ordering that does not expose user-facing features or API changes.

@dannykopping dannykopping left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@ethanndickson ethanndickson merged commit e91bec8 into main May 27, 2026
77 checks passed
@ethanndickson ethanndickson deleted the ci-windows-2qg0 branch May 27, 2026 07:33
@github-actions github-actions Bot locked and limited conversation to collaborators May 27, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants