Skip to content

feat(chatd): add LLM stream retry with exponential backoff#22418

Merged
kylecarbs merged 4 commits intomainfrom
feat/llm-stream-retry
Feb 27, 2026
Merged

feat(chatd): add LLM stream retry with exponential backoff#22418
kylecarbs merged 4 commits intomainfrom
feat/llm-stream-retry

Conversation

@kylecarbs
Copy link
Copy Markdown
Member

Summary

Adds automatic retry with exponential backoff for transient LLM errors during chat streaming and title generation. Inspired by coder/mux's retry mechanism.

Key Behaviors

  • Infinite retries with exponential backoff: 1s → 2s → 4s → ... → 60s cap
  • Deterministic delays (no jitter)
  • Error classification: retryable (429, 5xx, overloaded, rate limit, network errors) vs non-retryable (auth, quota, context exceeded, model not found, canceled)
  • Retry status published to SSE stream so frontend can show "Retrying in Xs..." UI
  • Title generation retries silently (best-effort, nil onRetry callback)

New Package: coderd/chatd/chatretry/

File Purpose
classify.go IsRetryable(err) and StatusCodeRetryable(code)
backoff.go Delay(attempt) — exponential doubling with 60s cap
retry.go Retry(ctx, fn, onRetry) — infinite loop with context-aware timer

Test Helpers: coderd/chatd/chattest/errors.go

Anthropic and OpenAI error response builders for use in chattest providers:

  • AnthropicErrorResponse(), AnthropicOverloadedResponse(), AnthropicRateLimitResponse()
  • OpenAIErrorResponse(), OpenAIRateLimitResponse(), OpenAIServerErrorResponse()

SDK Changes: codersdk/chats.go

  • New ChatStreamEventType: "retry"
  • New ChatStreamRetry struct with Attempt, DelayMs, Error, RetryingAt fields
  • TypeScript types auto-generated

Changed Files

  • coderd/chatd/chatloop/chatloop.go — wraps agent.Stream() in chatretry.Retry()
  • coderd/chatd/chatd.go — publishes retry events to SSE stream with logging
  • coderd/chatd/title.go — wraps model.Generate() in silent retry
  • coderd/chatd/chattest/anthropic.go / openai.go — error injection support

Tests

42 tests covering classification (33), backoff (9), and retry scenarios (8).

Adds automatic retry with exponential backoff for transient LLM errors
during chat streaming and title generation. Inspired by coder/mux's
retry mechanism.

Key behaviors:
- Infinite retries with exponential backoff: 1s, 2s, 4s, ..., 60s cap
- Deterministic delays (no jitter)
- Error classification: retryable (429, 5xx, overloaded, rate limit,
  network errors) vs non-retryable (auth, quota, context exceeded,
  model not found, canceled)
- Retry status published to SSE stream so frontend can show
  "Retrying in Xs..." UI
- Title generation retries silently (best-effort)

New package: coderd/chatd/chatretry/
- classify.go: IsRetryable() and StatusCodeRetryable()
- backoff.go: Delay() with exponential doubling and 60s cap
- retry.go: Retry() infinite loop with context-aware timer

Test helpers: coderd/chatd/chattest/errors.go
- Anthropic and OpenAI error response builders for testing

42 tests covering classification, backoff, and retry scenarios.
Consumes the 'retry' SSE event in the ChatContext store and
displays 'Thinking... attempt N' in the streaming placeholder
when the server is retrying a failed LLM call. The attempt
indicator uses a muted style next to the shimmer text.

Changes:
- ChatContext.ts: add retryState to store, handle 'retry' SSE
  events, clear retry state on status transitions
- AgentDetail.tsx: thread retryState to ConversationTimeline
- ConversationTimeline.tsx: export StreamingOutput, add retryState
  prop, render 'attempt N' next to shimmer
- StreamingOutput.stories.tsx: 6 stories covering placeholder,
  retry attempts 1/3/12, streaming text, and post-retry states
@kylecarbs kylecarbs merged commit 2bdacae into main Feb 27, 2026
28 checks passed
@kylecarbs kylecarbs deleted the feat/llm-stream-retry branch February 27, 2026 23:34
@github-actions github-actions bot locked and limited conversation to collaborators Feb 27, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant