Add dedicated streaming server for fast GGUF SSE inference#4635
Add dedicated streaming server for fast GGUF SSE inference#4635danielhanchen wants to merge 2 commits intomainfrom
Conversation
Adds a separate streaming server (Option B) that bypasses the main
app's asyncio event loop for GGUF token streaming.
When UNSLOTH_FAST_SSE=1 or UNSLOTH_STREAM_SERVER=1 is set, a
lightweight FastAPI app starts in a daemon thread on a random port.
The frontend negotiates a one-time token via /api/inference/stream-url
and streams directly from this server, falling back to the baseline
/v1/chat/completions transparently on failure.
Architecture:
- streaming_server.py: standalone FastAPI app with three paths:
- Path A (hot): async httpx.AsyncClient streaming direct to
llama-server. No cumulative-to-delta conversion needed since
llama-server sends delta tokens natively via OpenAI SSE.
- Path B: tool calling via asyncio.to_thread (tool execution
is the bottleneck, not the streaming proxy).
- Path C: non-streaming one-shot JSON response.
- stream_token_store.py: thread-safe one-time token store (10s TTL)
shared between the main app and streaming server.
- routes/inference.py: /stream-url endpoint for token issuance,
/direct-stream for Option C, /internal/consume-stream-token for
future subprocess mode.
- llama_cpp.py: --api-key support so the streaming server can
authenticate to llama-server directly.
- chat-api.ts: transparent fast-path negotiation with silent fallback.
Handles: vision (image_url + legacy image_base64), thinking models
(reasoning_content -> <think> tags), CORS preflight, friendly error
messages, stream_options usage/timings passthrough.
Only sends repeat_penalty when the client explicitly provides it,
avoiding the 24% TPS penalty from repetition scanning.
for more information, see https://pre-commit.ci
| yield f"data: {json.dumps({'error': {'message': _friendly_error(e), 'type': 'server_error'}})}\n\n" | ||
|
|
||
| return StreamingResponse( | ||
| sse_generator(), media_type = "text/event-stream", headers = _SSE_HEADERS |
Check warning
Code scanning / CodeQL
Information exposure through an exception Medium
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI 2 days ago
In general, to fix information exposure through exceptions, never send raw exception messages or stack traces to clients. Instead, log detailed error information on the server side, and return a generic, non-sensitive message to the client. If you need user-friendly messages, base them on known-safe patterns rather than arbitrary exception text.
For this file, the best targeted fix is to modify _friendly_error so that it no longer returns text dynamically constructed from str(e) (such as interpolating token counts from a backend error message). Instead, it should inspect the message only to decide which predefined generic text to return. That way, even if the backend error message contains sensitive details, these never flow back to the client. Concretely:
- Keep using
msg = str(e)and regex checks to detect specific error cases, but avoid returningm.group(1)orm.group(2)to the user. - Replace the current context-window error string with a static, generic explanation without embedding numeric values from the backend.
- Keep the existing generic lost-connection message; it already does not reuse raw backend text.
- Keep the default fallback
"An internal error occurred".
All other logic stays the same. No changes are required at the call site (the except Exception as e: block and the yield with _friendly_error(e) remain). No new imports or helper methods are needed.
| @@ -43,15 +43,15 @@ | ||
| def _friendly_error(e): | ||
| """Convert raw exception messages to user-readable strings.""" | ||
| msg = str(e) | ||
| m = re.search( | ||
| # Detect specific known error patterns, but do not expose raw values from the | ||
| # backend error message to the client. Only return generic, pre-defined text. | ||
| if re.search( | ||
| r"request \((\d+) tokens?\) exceeds the available context size \((\d+) tokens?\)", | ||
| msg, | ||
| ) | ||
| if m: | ||
| ): | ||
| return ( | ||
| f"Message too long: {m.group(1)} tokens exceeds the {m.group(2)}-token " | ||
| f"context window. Try increasing the Context Length in Model settings, " | ||
| f"or shorten the conversation." | ||
| "Message too long for the current context window. " | ||
| "Try increasing the Context Length in Model settings, or shorten the conversation." | ||
| ) | ||
| if "Lost connection to llama-server" in msg: | ||
| return "Lost connection to the model server. It may have crashed -- try reloading the model." |
| yield f"data: {json.dumps({'error': {'message': _friendly_error(e), 'type': 'server_error'}})}\n\n" | ||
|
|
||
| return StreamingResponse( | ||
| tool_sse(), media_type = "text/event-stream", headers = _SSE_HEADERS |
Check warning
Code scanning / CodeQL
Information exposure through an exception Medium
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI 2 days ago
In general terms, the fix is to ensure that user-facing error messages are generic and do not depend on the raw exception text. The exception (and any stack trace or detailed message) should be logged server-side only, while the client receives a high-level, non-sensitive description. Helper functions such as _friendly_error should avoid calling str(e) for arbitrary exceptions when constructing client-visible messages.
The best focused fix here is to change _friendly_error so that it only uses raw exception messages for very narrow, known-safe patterns (or not at all), and otherwise always returns a generic string without leaking str(e). In this code, _friendly_error already returns a generic "An internal error occurred" for the default case; the problematic bit is using str(e) unconditionally to do regex and substring matches. To avoid exposing arbitrary exception details while retaining functionality, we can still call str(e) internally but constrain what we return: for the context-window regex we only surface the numeric token counts (derived via regex groups), and for the connection-loss case we return a static string that does not echo the original message. We should not fall back to returning msg itself anywhere.
Concretely, in studio/backend/streaming_server.py, we keep the structure of _friendly_error but ensure it never returns raw or formatted exception messages or stack traces, only predefined strings. We do not need to touch the tool_sse function or the StreamingResponse construction; once _friendly_error is safe, the streamed JSON error notice is also safe. No new imports or helper methods are required.
| @@ -41,20 +41,23 @@ | ||
|
|
||
|
|
||
| def _friendly_error(e): | ||
| """Convert raw exception messages to user-readable strings.""" | ||
| """Convert raw exception messages to user-readable strings without leaking internals.""" | ||
| msg = str(e) | ||
| m = re.search( | ||
| r"request \((\d+) tokens?\) exceeds the available context size \((\d+) tokens?\)", | ||
| msg, | ||
| ) | ||
| if m: | ||
| # Only expose the numeric limits, not the original error text. | ||
| return ( | ||
| f"Message too long: {m.group(1)} tokens exceeds the {m.group(2)}-token " | ||
| f"context window. Try increasing the Context Length in Model settings, " | ||
| f"or shorten the conversation." | ||
| ) | ||
| if "Lost connection to llama-server" in msg: | ||
| # Return a generic connectivity message, without including the raw exception. | ||
| return "Lost connection to the model server. It may have crashed -- try reloading the model." | ||
| # For all other errors, avoid exposing any exception details. | ||
| return "An internal error occurred" | ||
|
|
||
|
|
There was a problem hiding this comment.
Code Review
This pull request implements a high-performance "fast-path" streaming architecture for the inference backend. It introduces a dedicated FastAPI streaming server (Option B) that runs in a separate thread to eliminate asyncio contention, supported by a short-lived, one-time token store for authentication. Additionally, it adds support for direct client access to the underlying llama-server via API keys (Option C). The frontend is updated to automatically negotiate these fast-paths with a silent fallback to the standard OpenAI-compatible endpoint. Review feedback identifies a logic flaw in the streaming server's state management for thinking tags and suggests consistent use of None for optional authorization headers in the backend.
I am having trouble creating individual review comments. Click here to see my feedback.
studio/backend/streaming_server.py (254-258)
The logic for handling the end of the stream when in_thinking is true seems flawed.
- The condition
if has_content_tokens:insideif in_thinking:is unreachable becausein_thinkingis set tofalseas soon as a content token is processed. - The
elseblock incorrectly re-yields the entirereasoning_text, causing duplicated content in the output.
When the stream ends and only reasoning tokens have been sent (in_thinking is true), the correct action is to simply close the tag.
if in_thinking:
# Only reasoning tokens were received, so close the <think> tag.
yield f"data: {json.dumps({'id': completion_id, 'object': 'chat.completion.chunk', 'created': created, 'model': model_name, 'choices': [{'index': 0, 'delta': {'content': '</think>'}, 'finish_reason': None}]}, separators=_SEP)}\n\n"studio/backend/core/inference/llama_cpp.py (2098)
For consistency with other parts of this file (e.g., line 1564), it's better to use None instead of an empty dictionary {} when self._api_key is not present. This pattern is also repeated on line 2217. While httpx handles both correctly, using None is more idiomatic and consistent with the other changes in this PR.
_auth_headers = {"Authorization": f"Bearer {self._api_key}"} if self._api_key else None
References
- Using None for optional dictionary parameters is more idiomatic in Python and maintains consistency within the codebase. (link)
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e945f43652
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| const response = useFastPath | ||
| ? await fetch(fastUrl, { | ||
| method: "POST", |
There was a problem hiding this comment.
Fall back to baseline when fast stream request fails
When useFastPath is true, the fast-path POST is awaited directly and any network error or non-2xx response bubbles out as a hard failure instead of retrying /v1/chat/completions. This breaks chats whenever the dedicated server is briefly unavailable (startup race, token expiry, port refusal, etc.), even though the fast path is meant to be optional and silently degradable.
Useful? React with 👍 / 👎.
| token = create_stream_token(current_subject) | ||
| return { | ||
| "supported": True, | ||
| "stream_url": f"http://127.0.0.1:{stream_port}/stream", |
There was a problem hiding this comment.
Build stream URL from request host instead of 127.0.0.1
Returning a hardcoded loopback URL makes the browser connect to its own localhost, not the backend host, whenever Studio is accessed remotely (or through a proxy). In those environments the fast path becomes unreachable (and can also hit mixed-content issues due to hardcoded http), so enabling fast SSE can regress normal streaming behavior.
Useful? React with 👍 / 👎.
| else: | ||
| self._api_key = None | ||
|
|
||
| logger.info(f"Starting llama-server: {' '.join(cmd)}") |
There was a problem hiding this comment.
Redact direct-stream API key from startup logs
When UNSLOTH_DIRECT_STREAM=1, cmd includes --api-key <secret> and this log statement prints the full command line, exposing the bearer key in logs. Anyone with log access can reuse that key to call llama-server directly, which defeats the purpose of protecting the direct endpoint with an API key.
Useful? React with 👍 / 👎.
Summary
UNSLOTH_FAST_SSE=1is set, the frontend transparently negotiates a fast path via one-time tokens/v1/chat/completionssilently on any failure (no user-visible changes)How it works
main.pyspawns a lightweight FastAPI app on a random port in a daemon threadGET /api/inference/stream-urlto get a one-time token (10s TTL)http://127.0.0.1:{port}/streamwith the token inX-Stream-Tokenheaderhttpx.AsyncClient/v1/chat/completionsendpointThree streaming paths
asyncio.to_thread(tool execution is the bottleneck)Feature parity
image_urlcontent parts and legacyimage_base64field, PNG conversionreasoning_contentin delta wrapped in<think>tagstool_status,tool_start,tool_endSSE eventsstream_options.include_usagepassthrough from llama-serverFiles changed
streaming_server.pystream_token_store.pymain.pyUNSLOTH_FAST_SSE=1routes/inference.py/stream-url,/direct-stream,/internal/consume-stream-tokenendpointscore/inference/llama_cpp.py--api-keysupport, auth header passthrough in all streaming methodschat-api.tsKey design decision:
repeat_penaltyThe streaming server only sends
repeat_penaltyto llama-server when the client explicitly providesrepetition_penalty. Without it, llama-server uses its own default (1.0, no penalty). This avoids a 24% TPS cost from repetition scanning that was previously always applied. See #4634 for the related Pydantic default fix.Test plan
UNSLOTH_FAST_SSE=1, verify streaming works through the fast pathUNSLOTH_FAST_SSE, verify baseline streaming still works (no regression)stream: false, verify JSON response<think>tags)