Skip to content

Add dedicated streaming server for fast GGUF SSE inference#4635

Open
danielhanchen wants to merge 2 commits intomainfrom
feat/studio-dedicated-streaming-server
Open

Add dedicated streaming server for fast GGUF SSE inference#4635
danielhanchen wants to merge 2 commits intomainfrom
feat/studio-dedicated-streaming-server

Conversation

@danielhanchen
Copy link
Copy Markdown
Contributor

Summary

  • Adds a dedicated streaming server that runs in a separate thread with its own event loop
  • When UNSLOTH_FAST_SSE=1 is set, the frontend transparently negotiates a fast path via one-time tokens
  • Falls back to the baseline /v1/chat/completions silently on any failure (no user-visible changes)

How it works

  1. On startup, main.py spawns a lightweight FastAPI app on a random port in a daemon thread
  2. Frontend calls GET /api/inference/stream-url to get a one-time token (10s TTL)
  3. Frontend streams from http://127.0.0.1:{port}/stream with the token in X-Stream-Token header
  4. The streaming server proxies directly to llama-server using httpx.AsyncClient
  5. If anything fails, the frontend falls back to the standard /v1/chat/completions endpoint

Three streaming paths

Path Use case Method
A (hot path) Standard streaming Async httpx direct to llama-server
B Tool calling asyncio.to_thread (tool execution is the bottleneck)
C Non-streaming One-shot JSON response

Feature parity

  • Vision: image_url content parts and legacy image_base64 field, PNG conversion
  • Thinking models: reasoning_content in delta wrapped in <think> tags
  • Tool calling: tool_status, tool_start, tool_end SSE events
  • CORS: OPTIONS preflight handler
  • Error handling: friendly messages for context overflow, connection loss
  • Usage/timings: stream_options.include_usage passthrough from llama-server
  • Auth: one-time tokens with 10s TTL, consumed on use

Files changed

File Change
streaming_server.py New. Standalone FastAPI app with all three streaming paths
stream_token_store.py New. Thread-safe one-time token store
main.py Start streaming server daemon thread on UNSLOTH_FAST_SSE=1
routes/inference.py /stream-url, /direct-stream, /internal/consume-stream-token endpoints
core/inference/llama_cpp.py --api-key support, auth header passthrough in all streaming methods
chat-api.ts Transparent fast-path negotiation with silent fallback

Key design decision: repeat_penalty

The streaming server only sends repeat_penalty to llama-server when the client explicitly provides repetition_penalty. Without it, llama-server uses its own default (1.0, no penalty). This avoids a 24% TPS cost from repetition scanning that was previously always applied. See #4634 for the related Pydantic default fix.

Test plan

  • Set UNSLOTH_FAST_SSE=1, verify streaming works through the fast path
  • Unset UNSLOTH_FAST_SSE, verify baseline streaming still works (no regression)
  • Send stream: false, verify JSON response
  • Enable tools, verify tool calling SSE events
  • Test with a vision model and image payload
  • Test with a thinking/reasoning model (verify <think> tags)
  • Send an expired/invalid token, verify 401 response
  • Kill the streaming server mid-stream, verify frontend falls back gracefully

Adds a separate streaming server (Option B) that bypasses the main
app's asyncio event loop for GGUF token streaming.

When UNSLOTH_FAST_SSE=1 or UNSLOTH_STREAM_SERVER=1 is set, a
lightweight FastAPI app starts in a daemon thread on a random port.
The frontend negotiates a one-time token via /api/inference/stream-url
and streams directly from this server, falling back to the baseline
/v1/chat/completions transparently on failure.

Architecture:
- streaming_server.py: standalone FastAPI app with three paths:
  - Path A (hot): async httpx.AsyncClient streaming direct to
    llama-server. No cumulative-to-delta conversion needed since
    llama-server sends delta tokens natively via OpenAI SSE.
  - Path B: tool calling via asyncio.to_thread (tool execution
    is the bottleneck, not the streaming proxy).
  - Path C: non-streaming one-shot JSON response.
- stream_token_store.py: thread-safe one-time token store (10s TTL)
  shared between the main app and streaming server.
- routes/inference.py: /stream-url endpoint for token issuance,
  /direct-stream for Option C, /internal/consume-stream-token for
  future subprocess mode.
- llama_cpp.py: --api-key support so the streaming server can
  authenticate to llama-server directly.
- chat-api.ts: transparent fast-path negotiation with silent fallback.

Handles: vision (image_url + legacy image_base64), thinking models
(reasoning_content -> <think> tags), CORS preflight, friendly error
messages, stream_options usage/timings passthrough.

Only sends repeat_penalty when the client explicitly provides it,
avoiding the 24% TPS penalty from repetition scanning.
yield f"data: {json.dumps({'error': {'message': _friendly_error(e), 'type': 'server_error'}})}\n\n"

return StreamingResponse(
sse_generator(), media_type = "text/event-stream", headers = _SSE_HEADERS

Check warning

Code scanning / CodeQL

Information exposure through an exception Medium

Stack trace information
flows to this location and may be exposed to an external user.

Copilot Autofix

AI 2 days ago

In general, to fix information exposure through exceptions, never send raw exception messages or stack traces to clients. Instead, log detailed error information on the server side, and return a generic, non-sensitive message to the client. If you need user-friendly messages, base them on known-safe patterns rather than arbitrary exception text.

For this file, the best targeted fix is to modify _friendly_error so that it no longer returns text dynamically constructed from str(e) (such as interpolating token counts from a backend error message). Instead, it should inspect the message only to decide which predefined generic text to return. That way, even if the backend error message contains sensitive details, these never flow back to the client. Concretely:

  • Keep using msg = str(e) and regex checks to detect specific error cases, but avoid returning m.group(1) or m.group(2) to the user.
  • Replace the current context-window error string with a static, generic explanation without embedding numeric values from the backend.
  • Keep the existing generic lost-connection message; it already does not reuse raw backend text.
  • Keep the default fallback "An internal error occurred".

All other logic stays the same. No changes are required at the call site (the except Exception as e: block and the yield with _friendly_error(e) remain). No new imports or helper methods are needed.

Suggested changeset 1
studio/backend/streaming_server.py

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/studio/backend/streaming_server.py b/studio/backend/streaming_server.py
--- a/studio/backend/streaming_server.py
+++ b/studio/backend/streaming_server.py
@@ -43,15 +43,15 @@
 def _friendly_error(e):
     """Convert raw exception messages to user-readable strings."""
     msg = str(e)
-    m = re.search(
+    # Detect specific known error patterns, but do not expose raw values from the
+    # backend error message to the client. Only return generic, pre-defined text.
+    if re.search(
         r"request \((\d+) tokens?\) exceeds the available context size \((\d+) tokens?\)",
         msg,
-    )
-    if m:
+    ):
         return (
-            f"Message too long: {m.group(1)} tokens exceeds the {m.group(2)}-token "
-            f"context window. Try increasing the Context Length in Model settings, "
-            f"or shorten the conversation."
+            "Message too long for the current context window. "
+            "Try increasing the Context Length in Model settings, or shorten the conversation."
         )
     if "Lost connection to llama-server" in msg:
         return "Lost connection to the model server. It may have crashed -- try reloading the model."
EOF
@@ -43,15 +43,15 @@
def _friendly_error(e):
"""Convert raw exception messages to user-readable strings."""
msg = str(e)
m = re.search(
# Detect specific known error patterns, but do not expose raw values from the
# backend error message to the client. Only return generic, pre-defined text.
if re.search(
r"request \((\d+) tokens?\) exceeds the available context size \((\d+) tokens?\)",
msg,
)
if m:
):
return (
f"Message too long: {m.group(1)} tokens exceeds the {m.group(2)}-token "
f"context window. Try increasing the Context Length in Model settings, "
f"or shorten the conversation."
"Message too long for the current context window. "
"Try increasing the Context Length in Model settings, or shorten the conversation."
)
if "Lost connection to llama-server" in msg:
return "Lost connection to the model server. It may have crashed -- try reloading the model."
Copilot is powered by AI and may make mistakes. Always verify output.
yield f"data: {json.dumps({'error': {'message': _friendly_error(e), 'type': 'server_error'}})}\n\n"

return StreamingResponse(
tool_sse(), media_type = "text/event-stream", headers = _SSE_HEADERS

Check warning

Code scanning / CodeQL

Information exposure through an exception Medium

Stack trace information
flows to this location and may be exposed to an external user.

Copilot Autofix

AI 2 days ago

In general terms, the fix is to ensure that user-facing error messages are generic and do not depend on the raw exception text. The exception (and any stack trace or detailed message) should be logged server-side only, while the client receives a high-level, non-sensitive description. Helper functions such as _friendly_error should avoid calling str(e) for arbitrary exceptions when constructing client-visible messages.

The best focused fix here is to change _friendly_error so that it only uses raw exception messages for very narrow, known-safe patterns (or not at all), and otherwise always returns a generic string without leaking str(e). In this code, _friendly_error already returns a generic "An internal error occurred" for the default case; the problematic bit is using str(e) unconditionally to do regex and substring matches. To avoid exposing arbitrary exception details while retaining functionality, we can still call str(e) internally but constrain what we return: for the context-window regex we only surface the numeric token counts (derived via regex groups), and for the connection-loss case we return a static string that does not echo the original message. We should not fall back to returning msg itself anywhere.

Concretely, in studio/backend/streaming_server.py, we keep the structure of _friendly_error but ensure it never returns raw or formatted exception messages or stack traces, only predefined strings. We do not need to touch the tool_sse function or the StreamingResponse construction; once _friendly_error is safe, the streamed JSON error notice is also safe. No new imports or helper methods are required.

Suggested changeset 1
studio/backend/streaming_server.py

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/studio/backend/streaming_server.py b/studio/backend/streaming_server.py
--- a/studio/backend/streaming_server.py
+++ b/studio/backend/streaming_server.py
@@ -41,20 +41,23 @@
 
 
 def _friendly_error(e):
-    """Convert raw exception messages to user-readable strings."""
+    """Convert raw exception messages to user-readable strings without leaking internals."""
     msg = str(e)
     m = re.search(
         r"request \((\d+) tokens?\) exceeds the available context size \((\d+) tokens?\)",
         msg,
     )
     if m:
+        # Only expose the numeric limits, not the original error text.
         return (
             f"Message too long: {m.group(1)} tokens exceeds the {m.group(2)}-token "
             f"context window. Try increasing the Context Length in Model settings, "
             f"or shorten the conversation."
         )
     if "Lost connection to llama-server" in msg:
+        # Return a generic connectivity message, without including the raw exception.
         return "Lost connection to the model server. It may have crashed -- try reloading the model."
+    # For all other errors, avoid exposing any exception details.
     return "An internal error occurred"
 
 
EOF
@@ -41,20 +41,23 @@


def _friendly_error(e):
"""Convert raw exception messages to user-readable strings."""
"""Convert raw exception messages to user-readable strings without leaking internals."""
msg = str(e)
m = re.search(
r"request \((\d+) tokens?\) exceeds the available context size \((\d+) tokens?\)",
msg,
)
if m:
# Only expose the numeric limits, not the original error text.
return (
f"Message too long: {m.group(1)} tokens exceeds the {m.group(2)}-token "
f"context window. Try increasing the Context Length in Model settings, "
f"or shorten the conversation."
)
if "Lost connection to llama-server" in msg:
# Return a generic connectivity message, without including the raw exception.
return "Lost connection to the model server. It may have crashed -- try reloading the model."
# For all other errors, avoid exposing any exception details.
return "An internal error occurred"


Copilot is powered by AI and may make mistakes. Always verify output.
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements a high-performance "fast-path" streaming architecture for the inference backend. It introduces a dedicated FastAPI streaming server (Option B) that runs in a separate thread to eliminate asyncio contention, supported by a short-lived, one-time token store for authentication. Additionally, it adds support for direct client access to the underlying llama-server via API keys (Option C). The frontend is updated to automatically negotiate these fast-paths with a silent fallback to the standard OpenAI-compatible endpoint. Review feedback identifies a logic flaw in the streaming server's state management for thinking tags and suggests consistent use of None for optional authorization headers in the backend.

I am having trouble creating individual review comments. Click here to see my feedback.

studio/backend/streaming_server.py (254-258)

critical

The logic for handling the end of the stream when in_thinking is true seems flawed.

  1. The condition if has_content_tokens: inside if in_thinking: is unreachable because in_thinking is set to false as soon as a content token is processed.
  2. The else block incorrectly re-yields the entire reasoning_text, causing duplicated content in the output.

When the stream ends and only reasoning tokens have been sent (in_thinking is true), the correct action is to simply close the tag.

                                if in_thinking:
                                    # Only reasoning tokens were received, so close the <think> tag.
                                    yield f"data: {json.dumps({'id': completion_id, 'object': 'chat.completion.chunk', 'created': created, 'model': model_name, 'choices': [{'index': 0, 'delta': {'content': '</think>'}, 'finish_reason': None}]}, separators=_SEP)}\n\n"

studio/backend/core/inference/llama_cpp.py (2098)

medium

For consistency with other parts of this file (e.g., line 1564), it's better to use None instead of an empty dictionary {} when self._api_key is not present. This pattern is also repeated on line 2217. While httpx handles both correctly, using None is more idiomatic and consistent with the other changes in this PR.

            _auth_headers = {"Authorization": f"Bearer {self._api_key}"} if self._api_key else None
References
  1. Using None for optional dictionary parameters is more idiomatic in Python and maintains consistency within the codebase. (link)

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e945f43652

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +204 to +206
const response = useFastPath
? await fetch(fastUrl, {
method: "POST",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Fall back to baseline when fast stream request fails

When useFastPath is true, the fast-path POST is awaited directly and any network error or non-2xx response bubbles out as a hard failure instead of retrying /v1/chat/completions. This breaks chats whenever the dedicated server is briefly unavailable (startup race, token expiry, port refusal, etc.), even though the fast path is meant to be optional and silently degradable.

Useful? React with 👍 / 👎.

token = create_stream_token(current_subject)
return {
"supported": True,
"stream_url": f"http://127.0.0.1:{stream_port}/stream",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Build stream URL from request host instead of 127.0.0.1

Returning a hardcoded loopback URL makes the browser connect to its own localhost, not the backend host, whenever Studio is accessed remotely (or through a proxy). In those environments the fast path becomes unreachable (and can also hit mixed-content issues due to hardcoded http), so enabling fast SSE can regress normal streaming behavior.

Useful? React with 👍 / 👎.

else:
self._api_key = None

logger.info(f"Starting llama-server: {' '.join(cmd)}")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Redact direct-stream API key from startup logs

When UNSLOTH_DIRECT_STREAM=1, cmd includes --api-key <secret> and this log statement prints the full command line, exposing the bearer key in logs. Anyone with log access can reuse that key to call llama-server directly, which defeats the purpose of protecting the direct endpoint with an API key.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant