Skip to content

perf(coderd): reduce chat streaming latency with event-driven acquisition#23745

Merged
kylecarbs merged 7 commits intomainfrom
kylecarbs/reduce-chat-streaming-latency
Mar 28, 2026
Merged

perf(coderd): reduce chat streaming latency with event-driven acquisition#23745
kylecarbs merged 7 commits intomainfrom
kylecarbs/reduce-chat-streaming-latency

Conversation

@kylecarbs
Copy link
Copy Markdown
Member

@kylecarbs kylecarbs commented Mar 28, 2026

Previously, when a user sent a message, there was a 0–1000ms (avg ~500ms) polling delay before processing began. SendMessage/CreateChat/EditMessage set status='pending' in the DB and returned, but nothing woke the processing loop — it was a blind 1-second ticker.

Changes

Event-driven acquisition (main change): Adds a wakeCh channel to the chatd Server. CreateChat, SendMessage, EditMessage, and PromoteQueued call signalWake() after committing their transactions, which wakes the run loop to call processOnce immediately. The 1-second ticker remains as a fallback safety net for edge cases (stale recovery, missed signals).

Buffer WebSocket write channel: Changes the OneWayWebSocketEventSender event channel from unbuffered to buffered (64), decoupling the event producer from WebSocket write speed. The existing 10s write timeout guards against stuck connections.

Implementation plan & analysis

The full latency analysis identified these sources of delay in the streaming pipeline:

  1. Chat acquisition polling — 0–1000ms (avg 500ms) dead time per message. Fixed by wake channel.
  2. Unbuffered WebSocket write channel — each token blocked on the previous WS write completing. Fixed by buffering.
  3. PersistStep DB transaction per stepFOR UPDATE lock + batch insert. Not addressed in this PR (medium risk, would overlap DB write with next provider TTFB).
  4. Multi-hop channel pipeline — 4 channel hops per token. Not addressed (medium complexity).
Test stabilization notes

signalWake() causes the chatd daemon to process chats immediately after creation/send/edit, which exposed timing assumptions in several tests that expected chats to remain in pending status long enough to assert on. These tests were updated with require.Eventually + WaitUntilIdleForTest patterns to wait for processing to settle before asserting.

The race detector (test-go-race-pg) shows failures in TestCreateWorkspaceTool_EndToEnd and TestAwaitSubagentCompletion — these appear to be pre-existing races in the end-to-end chat flow that are now exercised more aggressively because processing starts immediately instead of after a 1s delay. Main branch CI (race detector) passes without these changes.

…tion

Previously, when a user sent a message, there was a 0-1000ms (avg 500ms)
polling delay before processing began. SendMessage/CreateChat/EditMessage
set status='pending' in the DB and returned, but nothing woke the
processing loop — it was a blind 1-second ticker.

This change eliminates the polling delay by adding a wake channel that
signals the run loop to call processOnce immediately. The ticker remains
as a fallback safety net.

Additional latency reductions:
- Buffer the WebSocket write channel (64) in OneWayWebSocketEventSender
  to decouple producers from write speed.
- Reduce chat stream batch size from 256 to 32 so the client gets
  smaller JSON payloads to parse sooner.
@kylecarbs kylecarbs requested a review from ibetitsmike March 28, 2026 16:01
@kylecarbs kylecarbs marked this pull request as ready for review March 28, 2026 16:01
…g races

CreateChat now signals the wake channel immediately, causing the chatd
daemon to process chats before TestListChats can assert ordering.
Add require.Eventually loops to wait for chats to reach terminal status
before asserting list ordering and pagination.
…learsQueue

EditMessage calls signalWake() after committing, causing the chatd
daemon to process the chat immediately. The DB assertion for pending
status raced with this processing. Use require.Eventually + WaitUntilIdleForTest
to wait for the processing to settle before asserting.
…iaPubsub

CreateChat calls signalWake(), causing the daemon to process chats
immediately. The race detector exposed a race where both parent and
child chats were processed (and errored) before the test could set
up the pubsub scenario. Wait for inflight processing to settle and
reset chat statuses before proceeding with the test.
- Move WaitUntilIdleForTest out of require.Eventually goroutine to
  avoid WaitGroup Add/Wait race in TestEditMessage test
- Remove stale creation-time UpdatedAt ordering check in
  TestListChats/Success — the descending-sort invariant is already
  verified by the adjacent loop
@kylecarbs kylecarbs merged commit 386b449 into main Mar 28, 2026
26 checks passed
@kylecarbs kylecarbs deleted the kylecarbs/reduce-chat-streaming-latency branch March 28, 2026 19:26
@github-actions github-actions bot locked and limited conversation to collaborators Mar 28, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants