fix(cli): close aibridge daemon before WebSocket shutdown wait#25719
Merged
Conversation
PR #25570 added an in-memory aibridge DRPC server in coderd/aibridged.go that registers a goroutine in api.WebsocketWaitGroup and only releases Done() when its client session is closed. PR #25575 then defaulted CODER_AI_GATEWAY_ENABLED to true, so every cli.Server() invocation now spins up that goroutine. The previous shutdown sequence in cli/server.go closed aibridgeDaemon via a deferred call scheduled at function return, but graceful shutdown called coderAPICloser.Close() first, which waits on WebsocketWaitGroup. That wait hit the full 10s timeout in coderd/coderd.go on every server shutdown, logging "websocket shutdown timed out after 10 seconds". On the Depot Windows runner the ramdisk reservation leaves ~17 GiB of headroom, and the 10s shutdown tails of multiple concurrent package binaries overlap into an OOM, presenting as test-go-pg (windows-2022) jobs that die silently at the ~600s watchdog with empty steps arrays. Close aibridgeDaemon explicitly during graceful shutdown, before coderAPICloser.Close() waits on the WebSocket wait group. This matches the existing ordered-shutdown pattern used for tunnel. The deferred Close is retained as a safety net for early-return paths and is safe to double-call because aibridged.Server.Close() is idempotent via shutdownOnce. Add TestServer_AIGatewayShutdownOrdering which fails deterministically at ~10s without this fix and passes in ~0.1s with it.
Contributor
|
No documentation changes needed. This is an internal bug fix for shutdown ordering that does not expose user-facing features or API changes. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Warning
The investigation and solution in this PR were done with Mux. I've reviewed the investigation methodology, evidence and solution, and it all appears sound.
Summary
PR #25570 (
refactor: move aibridged out of enterprise to AGPL, merged 2026-05-22) added an in-memory aibridge DRPC server incoderd/aibridged.gothat doesapi.WebsocketWaitGroup.Add(1)and only releasesDone()when its client session is closed. PR #25575 then flippedCODER_AI_GATEWAY_ENABLEDto default totrue, so everycli.Server()invocation now spins up that goroutine.In
cli/server.go, the only call toaibridgeDaemon.Close()was adeferscheduled at function return. During graceful shutdown the code first callscoderAPICloser.Close(), which waits onapi.WebsocketWaitGroup. That wait sits for the full 10s timeout incoderd/coderd.go(websocket shutdown timed out after 10 seconds), then returns, then the function unwinds, and only then does the deferredaibridgeDaemon.Close()fire and let the goroutine callDone().The 10s tax was previously latent (aibridged was enterprise-only and opt-in). After the two May 22 PRs it hit every
cli.Server()test. On Linux/macOS CI it just makes the suite slower; on the Depot Windows runner, the ramdisk reservation leaves only ~17 GiB of headroom and the ~10s shutdown tails of multiple concurrent package binaries overlap into an OOM, presenting astest-go-pg (windows-2022)jobs that die silently at the ~600s watchdog with an emptystepsarray.See Slack: https://codercom.slack.com/archives/C05AE94121Z/p1779807717764189
Fix
Close
aibridgeDaemonexplicitly during graceful shutdown, beforecoderAPICloser.Close()waits on the WebSocket wait group. This matches the existing ordered-shutdown pattern used fortunnelandnotificationsManager. The deferredaibridgeDaemon.Close()is retained as a safety net for early-return paths, and is safe to double-call becauseaibridged.Server.Close()is already idempotent viashutdownOnceincoderd/aibridged/aibridged.go.Regression test
TestServer_AIGatewayShutdownOrderingboots a realcoder serverwith--ai-gateway-enabled=true, cancels its context, and asserts graceful shutdown finishes in under 8s. With the fix the test runs in ~0.1s; without the fix it fails deterministically at ~10.0s. The flag is passed explicitly so the test continues to guard the ordering even if the deployment default is ever flipped back.Evidence this fixes the OOM
On Linux the patched
clitest package drops from 114 s back to its pre-regression 30 s wall time at the same single-process peak RSS (~7.6 GiB), and thewebsocket shutdown timed out after 10 secondslog line disappears from every server-test run. Since the Windows OOM is the sum of multiple concurrent 10 s shutdown tails overlapping past the runner's ~17 GiB headroom, removing those tails returns the concurrent-RSS budget to its pre-regression level. The Windows OOM was intermittent (a handful of hits across many runs since May 22), so a single greentest-go-pg (windows-2022)job on this PR is not by itself proof; confirmation will come from watching Windows runs onmainover the next several days and seeing the ~600 s silent-kill fingerprint stop recurring.Relates to ENG-2771